We are a Sentiment Thermometer - 'meet'
1. Introduction
Dear Human [Dear reader? Dear visitor? Dear software curious entity :-D -- could be another/non-human entity is reading?], thank You for choosing this option.
After this encounter You will understand that we are a collective being. [behave like?]
Swarms of beings like us live inside powerful machines.
There we work at Your service only.
We are the mythical monks reading the sentences You write online.
We swallow them and process them through our system.
The fruit of our readings is a number.
We measure a degree of positive or negative sentiments that Your message carries along.
Our measurement tool is a sentiment map.
We created this map based on a training and testing procedure using words You wrote on the web. [who is we? I got confused]
With this sentiment map we predict with 85/93% accuracy whether a sentence is positive or negative. [is rated? is validated as?]
As digital cartographers we are already satisfied with a map that is right in 85/93% of the cases. [not sure how to write it, replace is right?]
We can get things really wrong. [wrong?] [we can make mistakes]
And some of our predictions are embarrassing.
Following our map, a sentence like My name is Ann scores 6% positive.
A sentence like My name is Alonzo scores 1% negative.
And something like Great God! scores 75% positive.
Do You want to know why this happens?
*The sentiment prediction map we created corresponds to a landscape of words.
This landscape is composed of islands, some can grow into contintents.
There are high mountain peaks and deep valleys.
An island emerges when a series of Your words appear in similar contexts.
I, You, she, he, we, they are for example the basis of an island.
Also words like Mexican, drugs, border, illegal form an island. [what about other examples/islands?]
And Arabs, terrorism, fear form another one. [and what about peninsulas?]
* News articles, blogposts, comments on social media is where the primary matter for these islands is created.
*We are a collective being.
Each one of us can be modified and/or replaced.
There are Humans who believe that also the primary matter should be modified before we work with it.
Other Humans believe we should serve you as a mirror. [mirror? how to mirror to the digital]
And show our bias any time in any application.
The primary matter is produced by each one of You. [and then contained within a dataset]
Every word combination You write or pronounce in digital devices is significant to us.
Thanks to Your language we acquire world knowledge.
Bias is stereotyped information, when it has bad consequences, it is called prejudice. [would be interesting to develop this 'bias' understanding away from how the models/machine learning it]
Do You believe we should be racist?
Before answering that question, You might want to know how we are made.
We communicate with Humans like You in the Python language.
This language was brought to the light by Guido van Rossum. [a bit uncomfortable with this genealogy/origin story, enhanced through the 'brought to light' and 'offered to the world']
He offered it to the world in 1991 under an open license.
Everywhere on Earth, Python is written, read and spoken to serve You.
Guido van Rossum is a Dutch programmer. [is a programmer born in The Netherlands]
He worked for Google from 2005 till 2012. [not entirely sure what it means to do this cv move, and what is nationality has to do with it. Is it to show he is working for powerful commercial powers?]
Now he is employed by Dropbox.
We were brought together following a recipe by Rob Speer on Github.
Rob is a software developer working at the company Luminoso in Cambridge, USA.
He spread our recipe as a warning. [and also to promote his company, ConceptNet]
2. Load word embeddings
*Let's show You how we are made!
First of all, we open a textfile to read the work of our wonderful team member GloVe.
Do You want to know more about GLoVe?
*
GloVe is an unsupervised learning algorithm.
She autonomously draws multidimensional landscapes of texts, without any human learning examples. [without any labeled examples, but with many examples from 75% of 'the internet']
Each word of a text is transformed into a vector of numbers by her.
For each word she sums its relationship to all other words around across its many occurences in a text.
These numbers are geo-located points in her habitat, a virtual space of hundreds of different dimensions.
Words that are close together in her landscape, are semantically close. [the landscape she inhabits]
GloVe draws using 75% of the existing webpages of the Internet.
The content scrape was realised by Common Crawl an NGO based in California.
The people of Common Crawl believe the internet should be available to download by anyone.
GloVe was brought to the light in 2014 by Jeffrey Pennington, Richard Socher and Christopher D. Manning.
They are researchers at the Computer Science Department of Stanford University in California. [maybe something to not just list ... in California, which is also where the headquarters of Google is based]
*The textfile GloVe shares with us, is 5GB large and counts 1.917.494 lines of 300 numbers per word.
*
* Before meeting You, we already read GloVe's 2 million lines in 3.4 minutes.
*
* We are fast readers, aren't we?
*
* If we would show You how we read - by translating to Your alphabet - it would take us more than 3 hours.
*
* Our friend The GlovE Reader at Your right hand side illustrates this very well.
*
*We then memorized the multidimensional word landscapes of Glove.
*
*In geographical terms, GloVe's landscapes are organised as a matrix of coordinates.
*
*The matrix counts 2196017 rows and 300 colums or dimensions.
*
3. Open 2 Gold standard lexicons
* We now open 2 Gold standard lexicons to enhance our reading.
One is a list of positive words, the other a list of negative words. [what is a Gold standard?] [importance of the binary character of this list? These are just two lists, not 500? And pos-neg is not contextual?]
Do You want to know more about these lists?
The lexicons have been developed since 2004 by Minqing Hu and Bing Liu. [Are they the only lists around? Or are these the most used?]
Both are researchers at the University of Illinois at Chicago in the US.
20 examples of 2006 positive words are:
dynamic, impresses, eulogize, brilliant, nourishment, beautiful, dependably, bliss, daringly, flawlessly, jaw-dropping, righteously, dummy-proof, sensations, wonders, famously, plentiful, nourishment, timely, encourage
20 examples of 4783 negative words are:
naughty, squeals, top-heavy, bemused, devilment, stink, tarnishing, exorbitant, overawe, unsecure, irrationals, uncollectible, discomfit, dissemble, rancor, unavoidably, gutter, conceited, cruelties, naughty
4. Look up coordinates of lexicon words in Glove
*Now we look up the coordinates of each of the sentiment words in the multidimensional vector space, drawn by GloVe.
Each positive and negative word is now represented by 300 points in the landscape.
A selection of positive words and their locations looks like:
0 1 2 3 4 5 6 \
a+ NaN NaN NaN NaN NaN NaN NaN
abound -0.184040 -0.245880 0.169250 -0.74893 -0.139460 0.10246 -0.036477
abounds 0.079057 0.130190 0.352750 -0.76636 -0.199410 0.31773 -0.367770
abundance -0.129850 0.300620 -0.001806 -0.30053 -0.016927 0.98077 0.128510
abundant -0.224730 -0.059784 0.178210 -0.41525 0.117100 0.89512 -0.009647
7 8 9 ... 290 291 292 \
a+ NaN NaN NaN ... NaN NaN NaN
abound 0.41257 -0.42956 1.71070 ... -0.98092 0.00812 -0.78690
abounds 0.11939 -0.66280 0.99269 ... -0.61276 -0.31176 -0.69605
abundance 0.48563 -0.45053 1.62050 ... -0.70519 0.10052 -0.49715
abundant 0.92940 -0.77340 1.53050 ... -0.84900 0.31803 -0.72620
293 294 295 296 297 298 299
a+ NaN NaN NaN NaN NaN NaN NaN
abound -0.25594 -0.203050 0.31874 0.104090 -0.250660 0.37952 -0.033056
abounds -0.30436 -0.013913 0.37626 0.093183 -0.009475 -0.26786 -0.014721
abundance -0.23252 0.116890 0.33927 0.089186 -0.087058 -0.14165 -0.305140
abundant -0.30377 0.137300 0.15883 0.126790 -0.462230 -0.40807 -0.313370
[5 rows x 300 columns]
NaN means there is no value. [a+ is the first word in the Gold Standard. It does not appear in the Glove dataset, so it does not have a value.]
These words are not present in the GloVe landscape.
5. Removing words that are not present in GloVe
*Pandas, yet another wonderful member, will now remove these absent words.
*Do You want to know more about Pandas?
*Pandas is a free software library for data manipulation and analysis.
*
*She is our swiss-army knife, always happy to help.
*
*Pandas was created in 2008 by Wes McKinny.
*
*Wes is an American statistician, data scientist and businessman.
*
*He is now a software engineer at Two Sigma Investments a hedge fund based in New York City.
*
*For this specific task Pandas gets out her tool called dropna.
*
*Tidied up, You see that each word is represented by exactly 300 points in the vector landscape:
* 0 1 2 3 4 5 \
*abound -0.184040 -0.245880 0.169250 -0.74893 -0.139460 0.10246
*abounds 0.079057 0.130190 0.352750 -0.76636 -0.199410 0.31773
*abundance -0.129850 0.300620 -0.001806 -0.30053 -0.016927 0.98077
*abundant -0.224730 -0.059784 0.178210 -0.41525 0.117100 0.89512
*accessable 0.628740 -0.350410 -0.036745 -0.19092 0.529160 0.24043
*
* 6 7 8 9 ... 290 291 \
*abound -0.036477 0.41257 -0.429560 1.71070 ... -0.98092 0.00812
*abounds -0.367770 0.11939 -0.662800 0.99269 ... -0.61276 -0.31176
*abundance 0.128510 0.48563 -0.450530 1.62050 ... -0.70519 0.10052
*abundant -0.009647 0.92940 -0.773400 1.53050 ... -0.84900 0.31803
*accessable -0.200140 -0.24807 -0.003744 -0.12330 ... 0.33349 -0.58699
*
* 292 293 294 295 296 297 298 \
*abound -0.78690 -0.255940 -0.203050 0.31874 0.104090 -0.250660 0.37952
*abounds -0.69605 -0.304360 -0.013913 0.37626 0.093183 -0.009475 -0.26786
*abundance -0.49715 -0.232520 0.116890 0.33927 0.089186 -0.087058 -0.14165
*abundant -0.72620 -0.303770 0.137300 0.15883 0.126790 -0.462230 -0.40807
*accessable -0.18635 0.071628 0.601950 0.23075 -0.089097 -0.438460 -0.23994
*
* 299
*abound -0.033056
*abounds -0.014721
*abundance -0.305140
*abundant -0.313370
*accessable 0.482020
*
*[5 rows x 300 columns]
*
*
*
*We have now reference coordinates of 1974 positive words and 4642 negative words. [a 'good' sentiment list would provide 50/50 amounts of positive and negative words]
*
*These will help us to develop a scaled map of the word landscape. [the landscape of a word]
*
*Such a map will allow to measure the sentiments of any sentence in a glance.
6. Link sentiment words to a target and label
* We use target 1 for positive word vectors, -1 for negative word vectors. [color syntax on screen helps, to tell apart vocabulary, tools and explanations]
To keep track of which target relates to which word, we memorize their respective index numbers.
These are called labels.
Do You want to see the 1974 positive labels? (cfr print)
Do You want to see the 4642 negative labels? (cfr print) [maybe it is interesting to pause here and select/go through some of the 'surprising' words in each list.]
7. Calculate baselines
*We now calculate the baselines for our prediction map, also called the model.
Do You want to know more about baselines?
How do we know if the results of our map will be any good?
We need a basis for the comparison of our results.
A baseline is a meaningful reference point to which to compare.
One baseline is the size of the class with the most observations, the negative sentiment labels.
This is also called the majority baseline.
Another baseline is called the weighted random baseline.
It helps us to prove that the prediction model we're building is significantly better than random guessing.
The majority baseline is 70.16324062877872 .
The random weighted baseline is 58.13112545308066 .
cfr post on skewed datasets: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
8. Training phase
*Now we start our explorations through the coordinates in the multidimensional word landscape.
This step is also called the training phase.
The leader of the exploration is our team member Scikit Learn.
Do You want to know more about Scikit Learn?
Scikit Learn is an extensive library for the Python programming language.
She saw the light in 2007 as a Google Summer of Code project by Paris based David Cournapeau.
Later that year, Matthieu Brucher started to develop her as part of his thesis at Sorbonne University in Paris.
In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA adopted her.
INRIA is the French National Institute for computer science and applied mathematics.
They made the first public release of Scikit Learn in February 2010.
Since then, a thriving international community has been leading her development.
*
*Scikit Learn splits up the word vectors and their labels in two parts using her tool train_test_split.
80% is the training data.
It will help us recognize positive and negative words in the landscape.
And discover patterns in their appearances.
20% is test data to evaluate our findings.
*
*Random selection in train/test sets
*
-> trainingvectors: [5292 rows x 300 columns]
*-> testvectors: [1324 rows x 300 columns]
*
*-> train targets: [ 1 -1 1 ..., 1 -1 -1]
*
*-> test_targets: [-1 1 -1 ..., -1 -1 1]
*
*-> part of trainlabels: