Previous sessions with all ressources (in nr 1)
http://pad.constantvzw.org/public_pad/neural_networks_3
http://pad.constantvzw.org/public_pad/neural_networks_2
http://pad.constantvzw.org/public_pad/neural_networks_1

Response from some people at Transmediale:
    interesting research, possible to do a demo at the end with what we call 'algolit extensies'
    maybe dedicate a last session to a preparation of a workshop/small demo exhibition?

Proposal to start with homework & watch videos when Gijs joins (11h)
Here is the script: http://algolit.constantvzw.org/neural_networks/course-2_svd-word-vector.py

Following Slide 13 on:
https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf

code Hans:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import nltk

la = np.linalg

sentences = [
        "Vandaag hebben we neural networks bestudeerd",        "Cristina was er ook, en Gijs niet",
        "vandaag was het deep",
        "net zo deep als deep learning"
]

# sentences = ["I like deep learning.", "I like NLP.", "I enjoy flying."]

# unique words of the text
prematrix = set()

for sentence in sentences:
        words = sentence.split(" ")
        for word in words:
*           word = word.lower()
*word = word.strip()
                prematrix.add(word)

# order set & turn into list
pre2 = sorted(list(prematrix))

# create bigrams
bigram = []
for sentence in sentences:
   for b in nltk.bigrams(sentence.split()):
      bigram.append(b)

# create Co-occurence matrix
# create matrix with zeros, having the length of the vocabulary
X = np.zeros((len(pre2),len(pre2)), dtype=np.int)

# for each bigram, add one
for b in bigram:
   X[pre2.index(b[0]), pre2.index(b[1])] = X[pre2.index(b[0]),pre2.index(b[1])] + 1
   X[pre2.index(b[1]),pre2.index(b[0])] = X[pre2.index(b[1]),pre2.index(b[0])] + 1

print X

code Gijs:
sentences = [
"I like deep learning",
"I like NLP",
"I enjoy flying"
]

counter = dict()
words = list()

for sentence in sentences:
for ngram in nltk.bigrams(sentence.split()):
    ngram_sorted = tuple(sorted(ngram))
    if ngram_sorted not in counter:
      counter[ngram_sorted] = 0

    counter[ngram_sorted] += 1

    for word in ngram_sorted:
      if word not in words:
        words.append(word)

words.sort()

matrix = [[counter[tuple(sorted((word1, word2)))] if tuple(sorted((word1, word2))) in counter else 0 for word2 in words] for word1 in words]

"""
'expanded' version of lijn 85
matrix = []
for word1 in words:
    row = []
    for word2 in words:
      key = tuple(sorted([word1, word2]))
      if key in counter:
*      row.push(counter[key])
*else:
*row.push(0)
    matrix.push(row)
"""

print "{: >10}".format('') + ' ' + ''.join(["{: <10}".format(word) for word in words])
for k, word in enumerate(words):
print "{: >10}".format(word) + ' ' + ''.join(["{: <10}".format(c) for c in matrix[k]])

Course 3
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture3.pdf
Video: https://www.youtube.com/watch?v=UOGMsFw9V_w&index=3&list=PLcGUo322oqu9n4i0X3cRJgKyVy7OkDdoi

ways to calculate the probability that word Uo appears close to Uc.
Uo = outside word, Uc = center word

exp = e to the square of
https://en.wikipedia.org/wiki/E_(mathematical_constant)

slide 2
p(o|c) =
de kans berekenen dat twee woorden naast elkaar staan/calculate the probability that 2 words appear together (bij/with window = 1)
exp(Uo^T Vc)
Uo * Vc = vector outside word * vector center word / de som van alle mogelijkheden
gradient descent is naar de laagste punt proberen te gaan
- u = outside word & v = predicted word ? & c = center word

EXAMPLE of stochastic gradient descent - a way to calculate grammar positions ;-)
sentence = "I like deep learning","I like NLP", "I enjoy flying"
words = ['I', 'NLP', 'deep', 'enjoy', 'flying', 'learning', 'like']
matrix =
[[0 0 0 1 0 0 2]
[0 0 0 0 0 0 1] > center word (NLP)
[0 0 0 0 0 1 1] > outer word (deep)
[1 0 0 0 1 0 0]
[0 0 0 1 0 0 0]
[0 0 1 0 0 0 0]
[2 1 1 0 0 0 0]]

p(o|c) =
e^(center word * outer word) = e^(vector_NLP*vector_deep) = e^(0*0 + 0*0 + 0*0 + 0*0 + 0*0 + 0*1 + 1*1) = e^1
---------------------------------------------------------------------------------------------
e^(som van vermenigvuldiging van alle vectoren = 0*0*0*1*0*0*2 + 0*0*0*0*0*0*1 + 0*0*0*0*0*1*1 + etc.) = e^(4)

e = natuurlijk groeigetal (exp, zie link naar wikipedia on line 118), is a way to avoid zeros

- U T (Transposer): starts taking the columns, converts them into rows -> horizontal rows could be used, but this is the norm
- multiplies the vector with u (outside word), another column: multiplying each number of each column (coordinates in the vector space, to see if they're close to each other) - cfr line 138
- e to the exponent of the multiplication - cfr line 138
- divide by sum of the multiplication of each vector - cfr line 140
- this gives a chance for each of the words
- the sum of all chances is always 1

cfr throw a dice: each side has 1/6 of chance to be on top; all six together sum to 1
p(o|c) is in dat geval dan p(1|6), en p(1|6) * 6 = 1

instead of doing this with all the words of the collection, he does it with a sample
and each time, it shifts with a step size to another sample

cfr linear algebra!!!

We only work with 3 vectors, that we update each time
don't keep around all zeros for all word vectors
-> working with indexing (either one index for each word vector // cfr system of Hans in code above, either one large matrix with indexing)
-> one for U, one for V

slide 10
https://en.wikipedia.org/wiki/Hash_table
https://nl.wikipedia.org/wiki/Hashtabel

slide 12
Skip Gram Model
a way to not take into account all words that are not co-occurring (only looking for those that do & a few random words // k-random)

*k = size of sample set, "usally 20 or so is a good number" for the sample set (different for each window)
*-> formula to sample less frequent words more regularly: P(w) = U(w) 3/4 / Z
*sigmoid function = X ranges between 0 & 1, U ranges from -4 & +4    >>> derivative is always positive, as the line is always going up from -4 to 4 >>> https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/320px-Logistic-curve.svg.png
*s(-x) = 1-s(x)
*a way to have numbers between 0 and 1, therefore are they 'probabilities' ("proper statisticians will crunch here")
*logaritmische functie is het omgekeerde van een exponentionele functie

quote: "especially if you have a very simple binary problem to solve"

slide 14
Continuous bag of words
calculate average of all surrounding words to predict center word
"if you code Skip Gram Model well, it will be very easy to invert to Continuous Bag of Words"

slide 15
count based vs. direct prediction

slide 16
GloVe ('global vector model', best of both worlds)
u T/ivj: prediction of how often i,j might co-occur ("co-occurence counts")
log P (i,j): how often i, j co-occur (a way to maximize the extremes, very high/very low)
cut off with f-function: otherwise the model would spend a lot of parameters with dominant (most frequent) co-occurences, f.ex. "the cat", a way to get rid of words that appear very often to prevent that they dominate the whole corpus. You want to do this as you are interested in the most unique appearances, and not so much in very common words.

slide 18
to capture similar word vectors:
calculate U and V vectors, sum of all / concatenate -> sum is most common

*P(ij) = positie van i & j in the matrix > coocurence count > how often 'like' and 'leanring' occur together in an abs
*(u_i * v_j - log Pij) multiply two coocurence vectors - log (coocurence count of ij)
*f() = max. is 10

combines coocurence counts (Pij) & probabilities

slide 19
How to evaluate word vectors?
(evaluating how well you are doing)

Intrinsic / Extrinsic
you want both, but it will be hard to have both in most cases....
Machine translation with NN, can take weeks to train your model

INTRINSIC EVALUATION
- word vector analogies (man/woman ~ king/queen): be critical, might be less linear than is believed
cfr graphic slide 21, 22 'you eventually capture these kind of things', similar geometric relationships
-> able to look up facts
but when names of cities are the same - it will take most frequent (ex Paris/London in US)
-> needs a lot of tests with diferent parameters (see examples), see if models captures all

retrain for different corpora?
train on largest possible corpus (put them all together)

slide 28
- precise calculations
ideally 6 billion tokens (6B)
same model, same dimensions (300), larger corpus (42B) - in almost all cases gives best results
-> always change only 1 element!!!
-> no gain between 300 & 1000 dimensions, in terms of efficiency you want to find the lowest possible
-> window size: syntactic performance decreases with too large windows, but semantics keeps going up
overall: window size of 8

compute & visualise in as many plots as you can

Common crawl: combination of all different texts: wikipedia, google news, websites.... (it is a huge pain to deal with the gigantic size, needs couple of weeks of training with Python, a couple of days with efficient C++)

- Human judgement
"you can have student annotating word similarities, of course this is very subjective, but in general it is very useful" -- what is the degree of generalness?

slide 34/35
Ambiguity
see linguistic ambiguity examples: https://en.wikipedia.org/wiki/List_of_linguistic_example_sentences
- pull the word in different directions, f.ex. in noun or verb directions during training, and look what happens in 2-dimensional world
- cluster with K-means, in 3D, and train the model again, result: have words with multiple indexes & multiple vectors
ok for intrinsic tasks, not very necessary (in the projects of the students) :-)

EXTRINSIC EVALUATION
use all of the same?
differences might not be as large as for intrinsic evaluation

Q:
what are the names recurrent / reinforcement / convulutional neural networks reffering to?

USES
- classification
- capturing facts
- multiclass classification:
    -> uses softmax

slide 38
Next step:
softmax > classifying vectors into multiple classes (sigmoid is the version for 2 classes)
The formula builds upon the previous results/matrix, and uses the data to classify groups of word vectors

P(y|x) = give vector x, and ask what the probability is that vector x belongs to class y

word vector x
Wy: we take the y-th row of matrix W
C amount of classes you have
c is a specific class, a row (because the row is already a row vector, we do not transpose)
d columns/dimensions
normalize for all classes (all probabilities of y, notated as C)

Other sources:
http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression