We are a Sentiment Thermometer - 'meet'

1. Introduction

                Dear Human [Dear reader? Dear visitor? Dear software curious entity :-D -- could be another/non-human entity is reading?], thank You for choosing this option.

                After this encounter You will understand that we are a collective being. [behave like?]

                Swarms of beings like us live inside  powerful machines.

                There we work  at Your service only.

                We are the mythical monks reading the sentences You write online.

                We swallow them and process them through our system.

                The fruit of our readings is a  number.

                We measure a degree of positive or negative sentiments that Your message carries along.


                Our measurement tool is a sentiment map.

                We created this map based on a training and testing procedure using  words You wrote on the web. [who is we? I got confused]
                With this sentiment map we predict with 85/93% accuracy  whether a sentence is positive or negative. [is rated? is validated as?]

                As  digital cartographers  we are already satisfied with a map that is right in 85/93% of the cases. [not sure how to write it, replace is right?]

                We can get things really wrong. [wrong?] [we can make mistakes]

                And some of our predictions are embarrassing.

                Following our map, a sentence like My name is Ann  scores 6% positive.

                A sentence like  My name is Alonzo  scores 1% negative.

                And something like  Great God!  scores 75% positive.

                Do You want to know why this happens?


*The sentiment prediction map we created corresponds to a  landscape of words.

                This landscape is composed of  islands,  some can grow into contintents.

                There are high mountain peaks and deep valleys.

                An island emerges when a series of Your words appear in similar contexts.

                I, You, she, he, we, they  are for example the basis of an island.

                Also words like  Mexican, drugs, border, illegal  form an island. [what about other examples/islands?]

                And Arabs, terrorism, fear  form another one. [and what about peninsulas?]

*        News articles, blogposts, comments on social media is where  the primary matter  for these islands is created.


*We are a  collective being.

                Each one of us can be  modified and/or replaced.

                There are Humans who believe that  also the primary matter should be modified before we work with it.

                Other Humans believe we should serve you as a  mirror. [mirror? how to mirror to the digital]

                And show our bias any time in any application.

                The primary matter is  produced  by each one of You. [and then contained within a dataset]

                Every word combination  You write or pronounce in digital devices is significant to us.

                Thanks to Your language we acquire  world knowledge.

                Bias is stereotyped information, when it has bad consequences, it is called prejudice. [would be interesting to develop this 'bias' understanding away from how the models/machine learning it]

                Do You believe we should be racist?

                Before answering that question, You might want to know how we are made.
                
                We communicate with Humans like You in the  Python  language.

                This language was brought to the light by Guido van Rossum. [a bit uncomfortable with this genealogy/origin story, enhanced through the 'brought to light' and 'offered to the world']

                He offered it to the world in 1991 under an open license.

                Everywhere on Earth, Python is written, read and spoken to serve You.

                Guido van Rossum  is a Dutch programmer. [is a programmer born in The Netherlands]

                He worked for  Google from 2005 till 2012. [not entirely sure what it means to do this cv move, and what is nationality has to do with it. Is it to show he is working for powerful commercial powers?]

                Now he is employed by Dropbox.

                We were brought together following a recipe by Rob Speer on Github.

                Rob is a software developer working at the company Luminoso in Cambridge, USA.

                He spread our recipe as a warning. [and also to promote his company, ConceptNet]


2. Load word embeddings

*Let's show You how we are made!

                First of all, we  open a textfile  to read the work of our wonderful team member  GloVe.

                Do You want to know more about GLoVe?
*
               GloVe  is an unsupervised learning algorithm.

                She autonomously  draws multidimensional landscapes of texts, without any human learning examples. [without any labeled examples, but with many examples from 75% of 'the internet']

                Each word of a text is transformed into a  vector of numbers  by her.

                For each word she sums its relationship to all other words around across its many occurences in a text.

                These numbers are  geo-located points  in her habitat, a virtual space of hundreds of different dimensions.

                Words that are  close together in her landscape, are  semantically close. [the landscape she inhabits]
                
                GloVe draws using  75%  of the existing webpages of the Internet.

                The content scrape was realised by  Common Crawl  an NGO based in  California.

                The people of  Common Crawl  believe the internet should be available to download by anyone.

                GloVe was brought to the light in 2014 by  Jeffrey Pennington, Richard Socher  and  Christopher D. Manning.

                They are researchers at the  Computer Science Department of Stanford University  in  California. [maybe something to not just list ... in California, which is also where the headquarters of Google is based]

*The textfile GloVe shares with us, is  5GB  large and counts  1.917.494 lines of  300  numbers per word.
*
*      Before meeting You, we already read GloVe's 2 million lines in  3.4  minutes.
*
*      We are fast readers, aren't we?
*
*       If we would show You how we read - by translating to Your alphabet - it would take us more than 3 hours.
*
*       Our friend  The GlovE Reader  at Your right hand side illustrates this very well.
*
*We then memorized the multidimensional word landscapes of Glove.
*
*In geographical terms, GloVe's landscapes are organised as a matrix of coordinates.
*
*The matrix counts  2196017  rows and  300 colums or dimensions.
*

3. Open 2 Gold standard lexicons

*     We now  open 2 Gold standard lexicons  to enhance our reading.

                One is a list of positive words, the other a list of negative words. [what is a Gold standard?] [importance of the binary character of this list? These are just two lists, not 500? And pos-neg is not contextual?]

                Do You want to know more about these lists?
                
                The lexicons have been developed since 2004 by  Minqing Hu  and  Bing Liu. [Are they the only lists around? Or are these the most used?]

                Both are researchers at the  University of Illinois at Chicago  in the  US.

                20 examples of  2006  positive words are: 

                 dynamic, impresses, eulogize, brilliant, nourishment, beautiful, dependably, bliss, daringly, flawlessly, jaw-dropping, righteously, dummy-proof, sensations, wonders, famously, plentiful, nourishment, timely, encourage 

                20 examples of  4783  negative words are: 

                 naughty, squeals, top-heavy, bemused, devilment, stink, tarnishing, exorbitant, overawe, unsecure, irrationals, uncollectible, discomfit, dissemble, rancor, unavoidably, gutter, conceited, cruelties, naughty 


4. Look up coordinates of lexicon words in Glove

*Now we  look up the coordinates  of each of the sentiment words in the multidimensional vector space, drawn by GloVe.

                Each positive and negative word is now represented by  300  points in the landscape.

                A selection of positive words and their locations looks like:

                  0         1         2        3         4        5         6    \
a+              NaN       NaN       NaN      NaN       NaN      NaN       NaN   
abound    -0.184040 -0.245880  0.169250 -0.74893 -0.139460  0.10246 -0.036477   
abounds    0.079057  0.130190  0.352750 -0.76636 -0.199410  0.31773 -0.367770   
abundance -0.129850  0.300620 -0.001806 -0.30053 -0.016927  0.98077  0.128510   
abundant  -0.224730 -0.059784  0.178210 -0.41525  0.117100  0.89512 -0.009647   

               7        8        9      ...         290      291      292  \
a+             NaN      NaN      NaN    ...         NaN      NaN      NaN   
abound     0.41257 -0.42956  1.71070    ...    -0.98092  0.00812 -0.78690   
abounds    0.11939 -0.66280  0.99269    ...    -0.61276 -0.31176 -0.69605   
abundance  0.48563 -0.45053  1.62050    ...    -0.70519  0.10052 -0.49715   
abundant   0.92940 -0.77340  1.53050    ...    -0.84900  0.31803 -0.72620   

               293       294      295       296       297      298       299  
a+             NaN       NaN      NaN       NaN       NaN      NaN       NaN  
abound    -0.25594 -0.203050  0.31874  0.104090 -0.250660  0.37952 -0.033056  
abounds   -0.30436 -0.013913  0.37626  0.093183 -0.009475 -0.26786 -0.014721  
abundance -0.23252  0.116890  0.33927  0.089186 -0.087058 -0.14165 -0.305140  
abundant  -0.30377  0.137300  0.15883  0.126790 -0.462230 -0.40807 -0.313370  

[5 rows x 300 columns]


                NaN  means there is no value. [a+ is the first word in the Gold Standard. It does not appear in the Glove dataset, so it does not have a value.]

                These words are not present in the GloVe landscape.


5. Removing words that are not present in GloVe

*Pandas,  yet another wonderful member, will now  remove these absent words.

*Do You want to know more about  Pandas?

*Pandas  is a free software library for data manipulation and analysis.
*
*She is our  swiss-army knife,  always happy to help.
*
*Pandas was created in 2008 by  Wes McKinny.
*
*Wes is an American statistician, data scientist and businessman.
*
*He is now a software engineer at  Two Sigma Investments  a hedge fund based in  New York City.
*
*For this specific task Pandas gets out her tool called  dropna.
*
*Tidied up, You see that each word is represented by exactly 300 points in the vector landscape: 
*                  0         1         2        3         4        5    \
*abound     -0.184040 -0.245880  0.169250 -0.74893 -0.139460  0.10246   
*abounds     0.079057  0.130190  0.352750 -0.76636 -0.199410  0.31773   
*abundance  -0.129850  0.300620 -0.001806 -0.30053 -0.016927  0.98077   
*abundant   -0.224730 -0.059784  0.178210 -0.41525  0.117100  0.89512   
*accessable  0.628740 -0.350410 -0.036745 -0.19092  0.529160  0.24043   
*
*                 6        7         8        9      ...         290      291  \
*abound     -0.036477  0.41257 -0.429560  1.71070    ...    -0.98092  0.00812   
*abounds    -0.367770  0.11939 -0.662800  0.99269    ...    -0.61276 -0.31176   
*abundance   0.128510  0.48563 -0.450530  1.62050    ...    -0.70519  0.10052   
*abundant   -0.009647  0.92940 -0.773400  1.53050    ...    -0.84900  0.31803   
*accessable -0.200140 -0.24807 -0.003744 -0.12330    ...     0.33349 -0.58699   
*
*                292       293       294      295       296       297      298  \
*abound     -0.78690 -0.255940 -0.203050  0.31874  0.104090 -0.250660  0.37952   
*abounds    -0.69605 -0.304360 -0.013913  0.37626  0.093183 -0.009475 -0.26786   
*abundance  -0.49715 -0.232520  0.116890  0.33927  0.089186 -0.087058 -0.14165   
*abundant   -0.72620 -0.303770  0.137300  0.15883  0.126790 -0.462230 -0.40807   
*accessable -0.18635  0.071628  0.601950  0.23075 -0.089097 -0.438460 -0.23994   
*
*                 299  
*abound     -0.033056  
*abounds    -0.014721  
*abundance  -0.305140  
*abundant   -0.313370  
*accessable  0.482020  
*
*[5 rows x 300 columns] 
*
*
*
*We have now reference coordinates of  1974 positive words and  4642 negative words. [a 'good' sentiment list would provide 50/50 amounts of positive and negative words]
*
*These will help us to develop a scaled map of the word landscape. [the landscape of a word]
*
*Such a map will allow to measure the sentiments of any sentence in a glance.

6. Link sentiment words to a target and label

* We use target  1  for positive word vectors,  -1  for negative word vectors. [color syntax on screen helps, to tell apart vocabulary, tools and explanations]

                To keep track of which target relates to which word, we memorize their respective index numbers.
                
                These are called labels.

                Do You want to see the  1974  positive labels? (cfr print)
                Do You want to see the  4642  negative labels? (cfr print) [maybe it is interesting to pause here and select/go through some of the 'surprising' words in each list.]


7. Calculate baselines

*We now  calculate the baselines  for our prediction map, also called the model.

                Do You want to know more about  baselines?
                
                How do we know if the  results  of our map will be any good? 

                We need a  basis  for the comparison of our results.

                A baseline is a meaningful reference point to which to compare.

                One baseline is the size of the class with the most observations, the negative sentiment labels.

                This is also called the  majority baseline.

                Another baseline is called the  weighted random baseline.

                It helps us to prove that the prediction model we're building is significantly better than  random guessing.

                The majority baseline is  70.16324062877872 .

                The random weighted baseline is  58.13112545308066 .

cfr post on skewed datasets: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/


8. Training phase

*Now we start our  explorations  through the coordinates in the multidimensional word landscape.

                This step is also called the  training phase.

                The leader of the exploration is our team member  Scikit Learn.

                Do You want to know more about  Scikit Learn?
                
                
                Scikit Learn is an extensive library for the Python programming language.

                She saw the light in 2007 as a Google Summer of Code project by Paris based David Cournapeau.

                Later that year,  Matthieu Brucher  started to develop her as part of his thesis at  Sorbonne University in  Paris.

                In 2010  Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort  and  Vincent Michel  of  INRIA  adopted her.

                INRIA is the French National Institute for computer science and applied mathematics.

                They made the first public release of Scikit Learn in  February 2010.

                Since then, a thriving international community has been leading her development.

*
*Scikit Learn splits up the word vectors and their labels in two parts using her tool  train_test_split.

                80%  is the training data.

                It will help us recognize positive and negative words in the landscape.

                And discover patterns in their appearances.

                20%  is test data to evaluate our findings.
*
*Random selection in train/test sets
*
                -> trainingvectors: [5292 rows x 300 columns]

*-> testvectors: [1324 rows x 300 columns]
*
*-> train targets: [ 1 -1  1 ...,  1 -1 -1]
*
*-> test_targets:  [-1  1 -1 ..., -1 -1  1]
*
*-> part of trainlabels: