Tuesday 23 May 2017 Previous sessions with all ressources (in nr 1) http://pad.constantvzw.org/public_pad/neural_networks_6 http://pad.constantvzw.org/public_pad/neural_networks_5 http://pad.constantvzw.org/public_pad/neural_networks_4 http://pad.constantvzw.org/public_pad/neural_networks_3 http://pad.constantvzw.org/public_pad/neural_networks_2 http://pad.constantvzw.org/public_pad/neural_networks_1 http://pad.constantvzw.org/public_pad/neural_networks_algolit_extensions http://pad.constantvzw.org/public_pad/neural_networks_small_dict http://pad.constantvzw.org/public_pad/neural_networks_maisondulivre a practical excercise softmax exploration Following assignment 1, from PSET 1 and listening back to course 3, Simplest window classifier: Softmax assignment 1: http://web.stanford.edu/class/cs224d/assignment1/assignment1.pdf PSET 1 overview: http://web.stanford.edu/class/cs224d/assignment1/index.html lecture notes on the softmax: http://web.stanford.edu/class/cs224d/lecture_notes/notes2.pdf wiki page (with example): https://en.wikipedia.org/wiki/Softmax_function *softmax a classifier for classification problems "Logistic regression = Softmax classification on word vector x to obtain probability for class y" (from slides lecture 3) softmax = cross entropy = logistic regression (all synonyma's, not synonyms, but do similar things) softmax = for multiclass problems logistic regression = for binary class problems cross entropy = loss function for softmax 'The softmax classifier is a linear classifier that uses the cross-entropy loss function. In other words, the gradient of the above function tells a softmax classifier how exactly to update its weights using something like gradient descent.' example from lecture 3 problem: Named Entity Recognition, location detection sample sentence: museums in paris are amazine window size = 2 center word = "paris" resulting vector is a column vector = a accumulation of 5 (row?) vectors = a 5 dimensional column vector how to take derivatives of words for next layers softmax is considered one layer, word vectors could be considered as other layer step 1: define all variables create wordvectors then softmax is a simple next step, that is not a common task but a good simple example for now p(y/x) class y/word vector x, we take the y'th row of our matrix, x is the column(?) we normalize this over all the classes, so that sum of probabilites is 1 (sigmoid function works with 2 classes) as we train our softmax, we use a loss/cost/objective function (minimize or maximize cost) loss for softmax = cross entropy we compute probability of word for certain class y: take y'th row of W and multiply that row with x f(y) = feature vector, for the y'th class C = compute all fs for different classes loss wants to maximize probability of x for class y all this comes back in information theory when training osft max classifier we try to optimize entropy error our previous notes on this part of the course http://pad.constantvzw.org/public_pad/neural_networks_4 *slide 38 *Next step: *softmax > classifying vectors into multiple classes (sigmoid is the version for 2 classes) *The formula builds upon the previous results/matrix, and uses the data to classify groups of word vectors * *P(y|x) = give vector x, and ask what the probability is that vector x belongs to class y * *word vector x *Wy: we take the y-th row of matrix W *C amount of classes you have *c is a specific class, a row (because the row is already a row vector, we do not transpose) *d columns/dimensions *normalize for all classes (all probabilities of y, notated as C) number of window size = number of dimensions Softmax function scripts in Python https://martin-thoma.com/softmax/ https://stackoverflow.com/questions/34968722/softmax-function-python wiki softmax function (expanded version with many options that do the same (or approximations to the very same)) following the Wikipedia example: https://en.wikipedia.org/wiki/Softmax_function import math print '~~' x = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0] # ORIGINAL CODE ----------------------------------- # x_exp = [math.exp(i) for i in x] # STUDY CODE ----------------------------------- x_exp = [] for i in x: # exp = math.exp(i) exp = math.e**i # math.e = 2.718281828459045 # (this is an approximation) [~~] # exp = 2.718281828459045**i # (this is an approximation) [~~] x_exp.append(exp) print x_exp # Result: [2.72, 7.39, 20.09, 54.6, 2.72, 7.39, 20.09] print '~~' # ORIGINAL CODE ----------------------------------- sum_x_exp = sum(x_exp) print sum_x_exp # Result: 114.98 # STUDY CODE ----------------------------------- sumofall=0 for x in x_exp: sumofall=sumofall+x print sumofall print '~~' # ORIGINAL CODE ----------------------------------- softmax = [round(i / sum_x_exp, 3) for i in x_exp] print softmax # Result: [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175] # STUDY CODE ----------------------------------- # [~~] remember!!! 5/3=1 print 5/3 # [~~] remember!!! 3/5=0 print 3/5 # [~~] remember!!! 3.0/5.0=0.6 print 3.0/5.0 softmax = [] for y in x_exp: s = round(i / sumofall, 3) # round() = round(number [, ndigits]) print 'y:', y print 'y/sum:', y/sumofall # [~~] i = 1.0, Result: 2.71(...) / 114.9(...) = 0.0236405430216 # [~~] i = 2.0, Result: 7.38(...) / 114.9(...) = 0.0642616585105 # [~~] i = 3.0, Result: 20.08(...) / 114.9(...) = 0.174681298596 # [~~] i = 4.0, Result: 54.59(...) / 114.9(...) = 0.474832999744 softmax.append(s) print softmax print '~~' # ~~~~~~~~~~~~~~~~~~~~~~~~ # links: # ~~~~~~~~~~~~~~~~~~~~~~~~ # source: https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html