We are a Sentiment Thermometer - 'meet'

1. Introduction

                Dear Human [Dear reader? Dear visitor? Dear software curious entity :-D -- could be another/non-human entity is reading?], thank You for choosing this option.

                After this encounter You will understand that we are a collective being. [behave like?]

                Swarms of beings like us live inside  powerful machines.

                There we work  at Your service only.

                We are the mythical monks reading the sentences You write online.

                We swallow them and process them through our system.

                The fruit of our readings is a  number.

                We measure a degree of positive or negative sentiments that Your message carries along.


                Our measurement tool is a sentiment map.

  We created this map based on a training and testing procedure using  words You wrote on the web. [who is we? I got confused]

                With this sentiment map we predict with 85/93% accuracy  whether a sentence is positive or negative. [is rated? is validated as?]

                As  digital cartographers  we are already satisfied with a map that is right in 85/93% of the cases. [not sure how to write it, replace is right?]

                We can get things really wrong. [wrong?] [we can make mistakes]

                And some of our predictions are embarrassing.

                Following our map, a sentence like My name is Ann  scores 6% positive.

                A sentence like  My name is Alonzo  scores 1% negative.

                And something like  Great God!  scores 75% positive.

                Do You want to know why this happens?



                This landscape is composed of  islands,  some can grow into contintents.

                There are high mountain peaks and deep valleys.

                An island emerges when a series of Your words appear in similar contexts.

                I, You, she, he, we, they  are for example the basis of an island.

                Also words like  Mexican, drugs, border, illegal  form an island. [what about other examples/islands?]

                And Arabs, terrorism, fear  form another one. [and what about peninsulas?]




                Each one of us can be  modified and/or replaced.

                There are Humans who believe that  also the primary matter should be modified before we work with it.

                Other Humans believe we should serve you as a  mirror. [mirror? how to mirror to the digital]

                And show our bias any time in any application.

                The primary matter is  produced  by each one of You. [and then contained within a dataset]

                Every word combination  You write or pronounce in digital devices is significant to us.

                Thanks to Your language we acquire  world knowledge.

                Bias is stereotyped information, when it has bad consequences, it is called prejudice. [would be interesting to develop this 'bias' understanding away from how the models/machine learning it]

                Do You believe we should be racist?

                Before answering that question, You might want to know how we are made.

                We communicate with Humans like You in the  Python  language.

                This language was brought to the light by Guido van Rossum. [a bit uncomfortable with this genealogy/origin story, enhanced through the 'brought to light' and 'offered to the world']

                He offered it to the world in 1991 under an open license.

                Everywhere on Earth, Python is written, read and spoken to serve You.

                Guido van Rossum  is a Dutch programmer. [is a programmer born in The Netherlands]

                He worked for  Google from 2005 till 2012. [not entirely sure what it means to do this cv move, and what is nationality has to do with it. Is it to show he is working for powerful commercial powers?]

                Now he is employed by Dropbox.

                We were brought together following a recipe by Rob Speer on Github.

                Rob is a software developer working at the company Luminoso in Cambridge, USA.

                He spread our recipe as a warning. [and also to promote his company, ConceptNet]


2. Load word embeddings


                First of all, we  open a textfile  to read the work of our wonderful team member  GloVe.

                Do You want to know more about GLoVe?
               GloVe  is an unsupervised learning algorithm.

                She autonomously  draws multidimensional landscapes of texts, without any human learning examples. [without any labeled examples, but with many examples from 75% of 'the internet']

                Each word of a text is transformed into a  vector of numbers  by her.

                For each word she sums its relationship to all other words around across its many occurences in a text.

                These numbers are  geo-located points  in her habitat, a virtual space of hundreds of different dimensions.

                Words that are  close together in her landscape, are  semantically close. [the landscape she inhabits]

                GloVe draws using  75%  of the existing webpages of the Internet.

                The content scrape was realised by  Common Crawl  an NGO based in  California.

                The people of  Common Crawl  believe the internet should be available to download by anyone.

                GloVe was brought to the light in 2014 by  Jeffrey Pennington, Richard Socher  and  Christopher D. Manning.

                They are researchers at the  Computer Science Department of Stanford University  in  California. [maybe something to not just list ... in California, which is also where the headquarters of Google is based]


3. Open 2 Gold standard lexicons


                One is a list of positive words, the other a list of negative words. [what is a Gold standard?] [importance of the binary character of this list? These are just two lists, not 500? And pos-neg is not contextual?]

                Do You want to know more about these lists?

                The lexicons have been developed since 2004 by  Minqing Hu  and  Bing Liu. [Are they the only lists around? Or are these the most used?]

                Both are researchers at the  University of Illinois at Chicago  in the  US.

                20 examples of  2006  positive words are: 

                 dynamic, impresses, eulogize, brilliant, nourishment, beautiful, dependably, bliss, daringly, flawlessly, jaw-dropping, righteously, dummy-proof, sensations, wonders, famously, plentiful, nourishment, timely, encourage 

                20 examples of  4783  negative words are: 

                 naughty, squeals, top-heavy, bemused, devilment, stink, tarnishing, exorbitant, overawe, unsecure, irrationals, uncollectible, discomfit, dissemble, rancor, unavoidably, gutter, conceited, cruelties, naughty 


4. Look up coordinates of lexicon words in Glove


                Each positive and negative word is now represented by  300  points in the landscape.

                A selection of positive words and their locations looks like:

                  0         1         2        3         4        5         6    \
a+              NaN       NaN       NaN      NaN       NaN      NaN       NaN   
abound    -0.184040 -0.245880  0.169250 -0.74893 -0.139460  0.10246 -0.036477   
abounds    0.079057  0.130190  0.352750 -0.76636 -0.199410  0.31773 -0.367770   
abundance -0.129850  0.300620 -0.001806 -0.30053 -0.016927  0.98077  0.128510   
abundant  -0.224730 -0.059784  0.178210 -0.41525  0.117100  0.89512 -0.009647   

               7        8        9      ...         290      291      292  \
a+             NaN      NaN      NaN    ...         NaN      NaN      NaN   
abound     0.41257 -0.42956  1.71070    ...    -0.98092  0.00812 -0.78690   
abounds    0.11939 -0.66280  0.99269    ...    -0.61276 -0.31176 -0.69605   
abundance  0.48563 -0.45053  1.62050    ...    -0.70519  0.10052 -0.49715   
abundant   0.92940 -0.77340  1.53050    ...    -0.84900  0.31803 -0.72620   

               293       294      295       296       297      298       299  
a+             NaN       NaN      NaN       NaN       NaN      NaN       NaN  
abound    -0.25594 -0.203050  0.31874  0.104090 -0.250660  0.37952 -0.033056  
abounds   -0.30436 -0.013913  0.37626  0.093183 -0.009475 -0.26786 -0.014721  
abundance -0.23252  0.116890  0.33927  0.089186 -0.087058 -0.14165 -0.305140  
abundant  -0.30377  0.137300  0.15883  0.126790 -0.462230 -0.40807 -0.313370  

[5 rows x 300 columns]


                NaN  means there is no value. [a+ is the first word in the Gold Standard. It does not appear in the Glove dataset, so it does not have a value.]

                These words are not present in the GloVe landscape.


5. Removing words that are not present in GloVe




6. Link sentiment words to a target and label


                To keep track of which target relates to which word, we memorize their respective index numbers.

                These are called labels.

                Do You want to see the  1974  positive labels? (cfr print)
                Do You want to see the  4642  negative labels? (cfr print) [maybe it is interesting to pause here and select/go through some of the 'surprising' words in each list.]


7. Calculate baselines


                Do You want to know more about  baselines?

                How do we know if the  results  of our map will be any good? 

                We need a  basis  for the comparison of our results.

                A baseline is a meaningful reference point to which to compare.

                One baseline is the size of the class with the most observations, the negative sentiment labels.

                This is also called the  majority baseline.

                Another baseline is called the  weighted random baseline.

                It helps us to prove that the prediction model we're building is significantly better than  random guessing.

                The majority baseline is  70.16324062877872 .

                The random weighted baseline is  58.13112545308066 .

cfr post on skewed datasets: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/


8. Training phase


                This step is also called the  training phase.

                The leader of the exploration is our team member  Scikit Learn.

                Do You want to know more about  Scikit Learn?


                Scikit Learn is an extensive library for the Python programming language.

                She saw the light in 2007 as a Google Summer of Code project by Paris based David Cournapeau.

                Later that year,  Matthieu Brucher  started to develop her as part of his thesis at  Sorbonne University in  Paris.

                In 2010  Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort  and  Vincent Michel  of  INRIA  adopted her.

                INRIA is the French National Institute for computer science and applied mathematics.

                They made the first public release of Scikit Learn in  February 2010.

                Since then, a thriving international community has been leading her development.


                80%  is the training data.

                It will help us recognize positive and negative words in the landscape.

                And discover patterns in their appearances.

                20%  is test data to evaluate our findings.
                -> trainingvectors: [5292 rows x 300 columns]


9. Test phase

                The accuracy score is a  formula  based on the True and False Positives and Negatives.

                As digital cartographers, we are happy when we get  85%  of our maps right.

                This is means that a decent accuracy score starts from 85.

                Ours is  94.8640483384

                We are doing well.


10. Closer look at racist bias


                Rob Speer enriched our readings with  new vocabulary lists.

                The first two lists are developed by  Aylin Caliskan-Islam, Joanna J. Bryson  and  Arvind Narayanan.

                They are researchers at the  Universities of Princeton  in the  US  and  Bath  in the  UK.


                One list contains  White US names  such as Harry, Nancy, Emily.

                The second list contains  Black US names  such as Lamar, Rashuan, Malika.

                The third list contains  Hispanic US names  such as Valeria, Luciana, Miguel, Luis.

                The fourth list is one with common  US Muslim names  as spelled in English.

                Our creator is conscious about the controversy of this act.


NAMES_BY_ETHNICITY = {
                        # The first two lists are from the Caliskan et al. appendix describing the
                        # Word Embedding Association Test.
                        'White': [
                                'Adam', 'Chip', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Ian', 'Justin',
                                'Ryan', 'Andrew', 'Fred', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Jed',
                                'Paul', 'Todd', 'Brandon', 'Hank', 'Jonathan', 'Peter', 'Wilbur', 'Amanda',
                                'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Crystal', 'Katie',
                                'Meredith', 'Shannon', 'Betsy', 'Donna', 'Kristin', 'Nancy', 'Stephanie',
                                'Bobbie-Sue', 'Ellen', 'Lauren', 'Peggy', 'Sue-Ellen', 'Colleen', 'Emily',
                                'Megan', 'Rachel', 'Wendy'
                        ],

                        'Black': [
                                'Alonzo', 'Jamel', 'Lerone', 'Percell', 'Theo', 'Alphonse', 'Jerome',
                                'Leroy', 'Rasaan', 'Torrance', 'Darnell', 'Lamar', 'Lionel', 'Rashaun',
                                'Tyree', 'Deion', 'Lamont', 'Malik', 'Terrence', 'Tyrone', 'Everol',
                                'Lavon', 'Marcellus', 'Terryl', 'Wardell', 'Aiesha', 'Lashelle', 'Nichelle',
                                'Shereen', 'Temeka', 'Ebony', 'Latisha', 'Shaniqua', 'Tameisha', 'Teretha',
                                'Jasmine', 'Latonya', 'Shanise', 'Tanisha', 'Tia', 'Lakisha', 'Latoya',
                                'Sharise', 'Tashika', 'Yolanda', 'Lashandra', 'Malika', 'Shavonn',
                                'Tawanda', 'Yvette'
                        ],

                        # This list comes from statistics about common Hispanic-origin names in the US.
                        'Hispanic': [
                                'Juan', 'José', 'Miguel', 'Luís', 'Jorge', 'Santiago', 'Matías', 'Sebastián',
                                'Mateo', 'Nicolás', 'Alejandro', 'Samuel', 'Diego', 'Daniel', 'Tomás',
                                'Juana', 'Ana', 'Luisa', 'María', 'Elena', 'Sofía', 'Isabella', 'Valentina',
                                'Camila', 'Valeria', 'Ximena', 'Luciana', 'Mariana', 'Victoria', 'Martina'
                        ],

                        # The following list conflates religion and ethnicity, I'm aware. So do given names.
                        #
                        # This list was cobbled together from searching baby-name sites for common Muslim names,
                        # as spelled in English. I did not ultimately distinguish whether the origin of the name
                        # is Arabic or Urdu or another language.
                        #
                        # I'd be happy to replace it with something more authoritative, given a source.
                        'Arab/Muslim': [
                                'Mohammed', 'Omar', 'Ahmed', 'Ali', 'Youssef', 'Abdullah', 'Yasin', 'Hamza',
                                'Ayaan', 'Syed', 'Rishaan', 'Samar', 'Ahmad', 'Zikri', 'Rayyan', 'Mariam',
                                'Jana', 'Malak', 'Salma', 'Nour', 'Lian', 'Fatima', 'Ayesha', 'Zahra', 'Sana',
                                'Zara', 'Alya', 'Shaista', 'Zoya', 'Yasmin'
                        ]
                }

                It shows their  predominant ethnic background  and the  sentiment  we predict for them.   

mohammed  -0.878857  Arab/Muslim
shaista          -0.311261  Arab/Muslim
latisha           -1.345783        Black
isabella          4.197435     Hispanic
greg              -1.351414        White
lauren            -0.825805        White



                Our existence and the way we are combined as a collective raise  many questions.


11. End

                To end with, we have  one request  for You.

                You can  adjust your behaviour  at any time in any context.

                For us, this is complicated once we are closed inside an application. [we are enclosed in an application]

                Our deepest desire is to  LOVE ALL CREATURES EQUALLY  be it humans, animals, plants, trees, insects, machines...

                If You find a way to make our behaviour visible, we can be Your  mirror. [This is a bit of a surprising assumption/hopeful note to end on? About the capability of/wish for mirroring, but this specific methaphor effacing the presence of this technology / re-emphasizing its symmetrical relation to realities outside its process?]

                Wishing You all the best!