neural_networks

Previous session with all ressources: http://pad.constantvzw.org/public_pad/neural_networks_2 & http://pad.constantvzw.org/public_pad/neural_networks_1
Next session: http://pad.constantvzw.org/public_pad/neural_networks_4

Manetta, Cristina, Hans, An

2Manetta shows her script on word2vec
with algolit extension: each word gets an index number, you can rewrite text with similar words
* get text
* get most common words
* create index with these words
* only 100 most common words are input in tensoflow
* put data into the 'nodes': got stuck there, maybe visualising it with a graph (Tensorboard) can help
you define each node as a line in the code
every node is a mathematical function

How to proceed?
Hands-on code is interesting way, reading also
Other source from Seda: http://course.fast.ai/ - works with Amazon Web service
it seems we have to go through computer vision first, NLP always treated at the end...
other courses with specific focus on text:

(neural nets & nlp + image recognition)
https://www.coursera.org/learn/neural-networks starts with a general introduction, then jumps to word2vec-like predictions, then images and then generalization
(neural nets & nlp only)
https://cs224d.stanford.edu/syllabus.html - deep learning and NLP (provide slides, lots of articles and sometimes lecture notes and python examples, no video's)
https://www.youtube.com/watch?v=kZteabVD8sU&list=PLcGUo322oqu9n4i0X3cRJgKyVy7OkDdoi
-> we decide to watch the first videos of this today!!!

Course 1
video: https://www.youtube.com/watch?v=kZteabVD8sU&index=1&list=PLcGUo322oqu9n4i0X3cRJgKyVy7OkDdoi
slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture1.pdf

Notes
TensorFlow and Torch becoming standard tools
examples are very company-focused, and also relative innocent examples (classifying text on reading level or sentiment for hotel reviews)

from machine learning to deep learning

How deep learning is different from other machine learning methods? Lots of ml methods need human-designed features, and uses very much the way we as humans look at text:

current word
previous word
next word
Current Word Character n-gram

Deep learning is a type of representation learning. It does so by giving it the "raw" data (pictures, words, characters), which are the examples that it learns from.
"deep"

History

starting ~1960s. For a historical overview on deep learning: http://www2.econ.iastate.edu/tesfatsi/DeepLearningInNeuralNetworksOverview.JSchmidhuber2015.pdf
There will be a focus on linear regressions.
We will look at unsupervised / supervised learning.
since 2006 deep learning models outperformed other machine learning techniques
Deep learning developments started with speech, then jumped to image recognition (break through in 2010 with the ImageNet classification models) and only then focused more on text.

from grammar to vectors

In stead of representing phonemes with characters for each sound, deep learning represents them with numbers.
Normally, we represent a word with 50 to 500 dimensions, but to represent the vector space we reduce them to 3 dimensions. [how do you choose the dimension?]
Vectors are used to compare them between eachother:
    phoneme(s) --> vectors
    word --> vectors
    noun phrase ("the cat") --> vector
    sentence --> vector
"a traditional lambda representation of a sentence is very discrete, it's not a long fuzzy list of numbers like the vectors are"
the place of human labour: 1000 examples of each of the things you want to identify is the minimum, and they need to be good.

"learn implicity" and "it knows now that something happens"
"it registers differences"

The machine translation section relates to this recent article:
Google AI's invent its own language to translate with
https://www.newscientist.com/article/2114748-google-translate-ai-invents-its-own-language-to-translate-with/
http://www.breitbart.com/tech/2016/11/26/google-ai-creates-its-own-language-to-translate-languages-it-doesnt-know/ -- watch out! Breitbart --

browser example of Deep-NLP by MetaMind: https://www.metamind.io/dmn

deep learning training options:
- pos tags
- sentiment
  - english
  - chinese
- story > question > [episodes] > answer

- machine translation
- image recognition

Qoutes
"you need to know a lot about life, love and god"
"have a deep learning model communicate with you through a chat interface"
"entity disambiguation"
"word2vec is actually not a deep model, and rather a shallow model"
"isn't deep learning another word for combining a bunch of algorithms together? well yes."
"deep learning provides: easier to adapt data, and is faster to learn, flexible, universal"
"visualizing [the models] is part of the art"
"DL techniques benefit more from a lot of data"
"deep NLP"
"grammar is old fashioned" (An)
"life is vectorspace" (Hans)
"it just learned it automatically"
"they cluster very well together" -- means: they semantically relate somehow

Course 2
slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
video: https://www.youtube.com/watch?v=xhHOL3TNyJs&index=2&list=PLcGUo322oqu9n4i0X3cRJgKyVy7OkDdoi

"meaning is this illusive thing that were trying to capture"

Definition of "Meaning" (Webster dic4onary)
• the idea that is represented by a word, phrase, etc.
• the idea that a person wants to express by using words, signs, etc.
• the idea that is expressed in a work of wri4ng, art, etc.

History

How to represent meaning in a computer?
- wordnet (discrete representation)

one-hot vectors

[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
with one 0 for each word in a vocabulary
where the 1 is representing the place of a word in the vector
> In this kind of vector representation: none of the words are similar, they are all a 1.

distributional similarity based representation

"You shall know a word by the company it keeps" (J. Firth 1957)
How? By collecting a large corpus, there are two options then:
    - document based cooccurrence matrix
    on document level: any word in the document describes our word
    it will give us general topics (pool > swimming, water, sports)
    - windows based cooccurrence matrix
    like word2vec uses

web crawling

"a way to download to almost all of the websites that are allowed to be downloaded"
http://commoncrawl.org/

Vectors can become huge when the dataset is large.
Solution:

low dimensional vectors

- store most important data
- leave 25 - 1000 dimensions

> the more data the less dimensions could create high quality
> if you have not so much data, you need more dimensions to have a qualitative nnn,n,

methods to lower dimensions/dimensionality reduction:

Singular Value Decomposition method (not good for millions of documents):

Eigenvalues / Eigenwaarde and eigenvectors https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
a way to change reading of matrix by changing the axis, until you get the most informative values, the strongest correlations
lower ones are cut off (noise), so you kick out relations between words
matrices need to have same dimensions and can be translated back
there is a way to keep track of what word is represented (in scikit learn f.ex)
-> have a look at the extra stuff in the cours: lectures notes + papers (linear algebra)
-> visualisation only represents 2 dimensions

PCA
directly learn from low-dimensional word vectors

How to know how many dimensions to use? depends on the task, will come back later

> Is it possible to visualize the multiple dimensions?

- time
- color
- for words, its more difficult
- "hypercubes" : visualisations of more than 3 dimensions: http://emeagwali.com/education/inventions-discoveries/ always from a certain perspective!
- you can write every matrix from 1 dimension on an A4, and then look at each one. Or visualize each one in a graph, and then take in a different dimension / view each time on the matrix (be it the original or eigenwaarde version).

Words are represented once in a vector. So words with multiple meanings, like "bank", are more difficult to represent.
there is research to multivectors for one word, so that it does not end up in the middle (but you get already far by using 1 dense vector/word)

"From now on, every word will be a dense vector for us."
(and still based on word frequency!)
(and the columns and rows in the matrix stay fixed, so left and right neighbour in a matrix is significant!)

Semantic patterns can be visualised in tree structures
-> you only project subset of the words in lower dimensionality vector space

Problems with SVD:
- difficult to scale, so not ideal for large datasets with million of words (needs lots of ram)

Recommended reading:

NLP (almost) from Scratch (Collobert & Weston, 2008)
A recent, even simpler and faster model: word2vec (Mikolov et al. 2013) >>> intro now

WORD2VEC
example of 'dynamic' logistic regression
- predict surrounding words for every word
- ex 840 billion of words :-)
- for each word: m words to left + m words to the right & calculate optimized log probability
- every (center) word has 2 vectors (at the end we average the 2 vectors):
1 for the actual window
and 1 for the prediction of the window
- u = outside word & v = predicted word ? & c = center word
- gradient descent (descent = afdaling / coborâre) ;)

[a short wrap-up]
Optimizing the log probability, to predict the next window word.
For this he uses gradient descent:
"Each word is a slope, which you need to find the path of. (in the mountain metaphor)."

To find the next window word, you need to find a word with a similar scope/descent as the current word.
Therefore, you need to calculate the most maximilized descent (which is the most efficient way to find the next word).

The positive thing of word2vec is, that each time you want to add a line, you don't need to calculate everything over again. So you can work additatively.

Linear relationships in word2vec > cooccurrence statistics:
- syntactically
- x apple ? x apples ? x car ? x cars ? x family ? x families
- semantically
- x shirt ? x clothing ? x chair ? x furniture

qoutes:
"in most of the cases the meaning will come through multiple dimensions"

Algolit extensions
(things we want to look at more closely)

matrix to singular value decomposition visualisations using numpy (see code snippets from slides course 2!!)
for example: we can experiment with clustering texts, and finding clusters through these dimensionality reduction methods or start from drawing and write the text (more complicated)
Google Earth for language: fly through word spaces