Previous sessions with all ressources (in nr 1) http://pad.constantvzw.org/public_pad/neural_networks_3 http://pad.constantvzw.org/public_pad/neural_networks_2 http://pad.constantvzw.org/public_pad/neural_networks_1 Response from some people at Transmediale: interesting research, possible to do a demo at the end with what we call 'algolit extensies' maybe dedicate a last session to a preparation of a workshop/small demo exhibition? Proposal to start with homework & watch videos when Gijs joins (11h) Here is the script: http://algolit.constantvzw.org/neural_networks/course-2_svd-word-vector.py Following Slide 13 on: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf code Hans: import numpy as np import matplotlib.pyplot as plt from sklearn.feature_extraction.text import CountVectorizer import nltk la = np.linalg sentences = [ "Vandaag hebben we neural networks bestudeerd", "Cristina was er ook, en Gijs niet", "vandaag was het deep", "net zo deep als deep learning" ] # sentences = ["I like deep learning.", "I like NLP.", "I enjoy flying."] # unique words of the text prematrix = set() for sentence in sentences: words = sentence.split(" ") for word in words: * word = word.lower() *word = word.strip() prematrix.add(word) # order set & turn into list pre2 = sorted(list(prematrix)) # create bigrams bigram = [] for sentence in sentences: for b in nltk.bigrams(sentence.split()): bigram.append(b) # create Co-occurence matrix # create matrix with zeros, having the length of the vocabulary X = np.zeros((len(pre2),len(pre2)), dtype=np.int) # for each bigram, add one for b in bigram: X[pre2.index(b[0]), pre2.index(b[1])] = X[pre2.index(b[0]),pre2.index(b[1])] + 1 X[pre2.index(b[1]),pre2.index(b[0])] = X[pre2.index(b[1]),pre2.index(b[0])] + 1 print X code Gijs: sentences = [ "I like deep learning", "I like NLP", "I enjoy flying" ] counter = dict() words = list() for sentence in sentences: for ngram in nltk.bigrams(sentence.split()): ngram_sorted = tuple(sorted(ngram)) if ngram_sorted not in counter: counter[ngram_sorted] = 0 counter[ngram_sorted] += 1 for word in ngram_sorted: if word not in words: words.append(word) words.sort() matrix = [[counter[tuple(sorted((word1, word2)))] if tuple(sorted((word1, word2))) in counter else 0 for word2 in words] for word1 in words] """ 'expanded' version of lijn 85 matrix = [] for word1 in words: row = [] for word2 in words: key = tuple(sorted([word1, word2])) if key in counter: * row.push(counter[key]) *else: *row.push(0) matrix.push(row) """ print "{: >10}".format('') + ' ' + ''.join(["{: <10}".format(word) for word in words]) for k, word in enumerate(words): print "{: >10}".format(word) + ' ' + ''.join(["{: <10}".format(c) for c in matrix[k]]) Course 3 Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture3.pdf Video: https://www.youtube.com/watch?v=UOGMsFw9V_w&index=3&list=PLcGUo322oqu9n4i0X3cRJgKyVy7OkDdoi ways to calculate the probability that word Uo appears close to Uc. Uo = outside word, Uc = center word exp = e to the square of https://en.wikipedia.org/wiki/E_(mathematical_constant) slide 2 p(o|c) = de kans berekenen dat twee woorden naast elkaar staan/calculate the probability that 2 words appear together (bij/with window = 1) exp(Uo^T Vc) Uo * Vc = vector outside word * vector center word / de som van alle mogelijkheden gradient descent is naar de laagste punt proberen te gaan - u = outside word & v = predicted word ? & c = center word EXAMPLE of stochastic gradient descent - a way to calculate grammar positions ;-) sentence = "I like deep learning","I like NLP", "I enjoy flying" words = ['I', 'NLP', 'deep', 'enjoy', 'flying', 'learning', 'like'] matrix = [[0 0 0 1 0 0 2] [0 0 0 0 0 0 1] > center word (NLP) [0 0 0 0 0 1 1] > outer word (deep) [1 0 0 0 1 0 0] [0 0 0 1 0 0 0] [0 0 1 0 0 0 0] [2 1 1 0 0 0 0]] p(o|c) = e^(center word * outer word) = e^(vector_NLP*vector_deep) = e^(0*0 + 0*0 + 0*0 + 0*0 + 0*0 + 0*1 + 1*1) = e^1 --------------------------------------------------------------------------------------------- e^(som van vermenigvuldiging van alle vectoren = 0*0*0*1*0*0*2 + 0*0*0*0*0*0*1 + 0*0*0*0*0*1*1 + etc.) = e^(4) e = natuurlijk groeigetal (exp, zie link naar wikipedia on line 118), is a way to avoid zeros - U T (Transposer): starts taking the columns, converts them into rows -> horizontal rows could be used, but this is the norm - multiplies the vector with u (outside word), another column: multiplying each number of each column (coordinates in the vector space, to see if they're close to each other) - cfr line 138 - e to the exponent of the multiplication - cfr line 138 - divide by sum of the multiplication of each vector - cfr line 140 - this gives a chance for each of the words - the sum of all chances is always 1 cfr throw a dice: each side has 1/6 of chance to be on top; all six together sum to 1 p(o|c) is in dat geval dan p(1|6), en p(1|6) * 6 = 1 instead of doing this with all the words of the collection, he does it with a sample and each time, it shifts with a step size to another sample cfr linear algebra!!! We only work with 3 vectors, that we update each time don't keep around all zeros for all word vectors -> working with indexing (either one index for each word vector // cfr system of Hans in code above, either one large matrix with indexing) -> one for U, one for V slide 10 https://en.wikipedia.org/wiki/Hash_table https://nl.wikipedia.org/wiki/Hashtabel slide 12 Skip Gram Model a way to not take into account all words that are not co-occurring (only looking for those that do & a few random words // k-random) *k = size of sample set, "usally 20 or so is a good number" for the sample set (different for each window) *-> formula to sample less frequent words more regularly: P(w) = U(w) 3/4 / Z *sigmoid function = X ranges between 0 & 1, U ranges from -4 & +4 >>> derivative is always positive, as the line is always going up from -4 to 4 >>> https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/320px-Logistic-curve.svg.png *s(-x) = 1-s(x) *a way to have numbers between 0 and 1, therefore are they 'probabilities' ("proper statisticians will crunch here") *logaritmische functie is het omgekeerde van een exponentionele functie quote: "especially if you have a very simple binary problem to solve" slide 14 Continuous bag of words calculate average of all surrounding words to predict center word "if you code Skip Gram Model well, it will be very easy to invert to Continuous Bag of Words" slide 15 count based vs. direct prediction slide 16 GloVe ('global vector model', best of both worlds) u T/ivj: prediction of how often i,j might co-occur ("co-occurence counts") log P (i,j): how often i, j co-occur (a way to maximize the extremes, very high/very low) cut off with f-function: otherwise the model would spend a lot of parameters with dominant (most frequent) co-occurences, f.ex. "the cat", a way to get rid of words that appear very often to prevent that they dominate the whole corpus. You want to do this as you are interested in the most unique appearances, and not so much in very common words. slide 18 to capture similar word vectors: calculate U and V vectors, sum of all / concatenate -> sum is most common *P(ij) = positie van i & j in the matrix > coocurence count > how often 'like' and 'leanring' occur together in an abs *(u_i * v_j - log Pij) multiply two coocurence vectors - log (coocurence count of ij) *f() = max. is 10 combines coocurence counts (Pij) & probabilities slide 19 How to evaluate word vectors? (evaluating how well you are doing) Intrinsic / Extrinsic you want both, but it will be hard to have both in most cases.... Machine translation with NN, can take weeks to train your model INTRINSIC EVALUATION - word vector analogies (man/woman ~ king/queen): be critical, might be less linear than is believed cfr graphic slide 21, 22 'you eventually capture these kind of things', similar geometric relationships -> able to look up facts but when names of cities are the same - it will take most frequent (ex Paris/London in US) -> needs a lot of tests with diferent parameters (see examples), see if models captures all retrain for different corpora? train on largest possible corpus (put them all together) slide 28 - precise calculations ideally 6 billion tokens (6B) same model, same dimensions (300), larger corpus (42B) - in almost all cases gives best results -> always change only 1 element!!! -> no gain between 300 & 1000 dimensions, in terms of efficiency you want to find the lowest possible -> window size: syntactic performance decreases with too large windows, but semantics keeps going up overall: window size of 8 compute & visualise in as many plots as you can Common crawl: combination of all different texts: wikipedia, google news, websites.... (it is a huge pain to deal with the gigantic size, needs couple of weeks of training with Python, a couple of days with efficient C++) - Human judgement "you can have student annotating word similarities, of course this is very subjective, but in general it is very useful" -- what is the degree of generalness? slide 34/35 Ambiguity see linguistic ambiguity examples: https://en.wikipedia.org/wiki/List_of_linguistic_example_sentences - pull the word in different directions, f.ex. in noun or verb directions during training, and look what happens in 2-dimensional world - cluster with K-means, in 3D, and train the model again, result: have words with multiple indexes & multiple vectors ok for intrinsic tasks, not very necessary (in the projects of the students) :-) EXTRINSIC EVALUATION use all of the same? differences might not be as large as for intrinsic evaluation Q: what are the names recurrent / reinforcement / convulutional neural networks reffering to? USES - classification - capturing facts - multiclass classification: -> uses softmax slide 38 Next step: softmax > classifying vectors into multiple classes (sigmoid is the version for 2 classes) The formula builds upon the previous results/matrix, and uses the data to classify groups of word vectors P(y|x) = give vector x, and ask what the probability is that vector x belongs to class y word vector x Wy: we take the y-th row of matrix W C amount of classes you have c is a specific class, a row (because the row is already a row vector, we do not transpose) d columns/dimensions normalize for all classes (all probabilities of y, notated as C) Other sources: http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression