Tuesday 23 May 2017
Previous sessions with all ressources (in nr 1)
http://pad.constantvzw.org/public_pad/neural_networks_6
http://pad.constantvzw.org/public_pad/neural_networks_5
http://pad.constantvzw.org/public_pad/neural_networks_4
http://pad.constantvzw.org/public_pad/neural_networks_3
http://pad.constantvzw.org/public_pad/neural_networks_2
http://pad.constantvzw.org/public_pad/neural_networks_1
http://pad.constantvzw.org/public_pad/neural_networks_algolit_extensions
http://pad.constantvzw.org/public_pad/neural_networks_small_dict
http://pad.constantvzw.org/public_pad/neural_networks_maisondulivre

a practical excercise
softmax exploration

Following assignment 1, from PSET 1
and listening back to course 3, Simplest window classifier: Softmax

assignment 1: http://web.stanford.edu/class/cs224d/assignment1/assignment1.pdf
PSET 1 overview: http://web.stanford.edu/class/cs224d/assignment1/index.html
lecture notes on the softmax: http://web.stanford.edu/class/cs224d/lecture_notes/notes2.pdf
wiki page (with example): https://en.wikipedia.org/wiki/Softmax_function

*softmaxa classifier for classification problems
"Logistic regression = Softmax classification on word vector x to obtain probability for class y" (from slides lecture 3)

softmax = cross entropy = logistic regression (all synonyma's, not synonyms, but do similar things)
softmax = for multiclass problems
logistic regression = for binary class problems
cross entropy = loss function for softmax
'The softmax classifier is a linear classifier that uses the cross-entropy loss function. In other words, the gradient of the above function tells a softmax classifier how exactly to update its weights using something like gradient descent.'

example from lecture 3
problem: Named Entity Recognition, location detection
sample sentence: museums in paris are amazine
window size = 2
center word = "paris"

resulting vector is a column vector = a accumulation of 5 (row?) vectors = a 5 dimensional column vector
how to take derivatives of words for next layers softmax is considered one layer, word vectors could be considered as other layer

step 1:
    define all variables

create wordvectors

then softmax is a simple next step, that is not a common task but a good simple example for now
p(y/x)
class y/word vector x, we take the y'th row of our matrix, x is the column(?)
we normalize this over all the classes, so that sum of probabilites is 1
(sigmoid function works with 2 classes)

as we train our softmax, we use a loss/cost/objective function (minimize or maximize cost)
loss for softmax = cross entropy
we compute probability of word for certain class y: take y'th row of W and multiply that row with x

f(y) = feature vector, for the y'th class
C = compute all fs for different classes

loss wants to maximize probability of x for class y
all this comes back in information theory
when training osft max classifier we try to optimize entropy error

our previous notes on this part of the course http://pad.constantvzw.org/public_pad/neural_networks_4
*slide 38
*Next step:
*softmax > classifying vectors into multiple classes (sigmoid is the version for 2 classes)
*The formula builds upon the previous results/matrix, and uses the data to classify groups of word vectors
*
*P(y|x) = give vector x, and ask what the probability is that vector x belongs to class y
*
*word vector x
*Wy: we take the y-th row of matrix W
*C amount of classes you have
*c is a specific class, a row (because the row is already a row vector, we do not transpose)
*d columns/dimensions
*normalize for all classes (all probabilities of y, notated as C)

number of window size = number of dimensions

Softmax function scripts in Python
https://martin-thoma.com/softmax/
https://stackoverflow.com/questions/34968722/softmax-function-python

wiki softmax function (expanded version with many options that do the same (or approximations to the very same))
following the Wikipedia example: https://en.wikipedia.org/wiki/Softmax_function

import math

print '~~'

x = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]

# ORIGINAL CODE -----------------------------------
# x_exp = [math.exp(i) for i in x]

# STUDY CODE -----------------------------------
x_exp = []
for i in x:
        # exp = math.exp(i)
        exp = math.e**i
        # math.e = 2.718281828459045 # (this is an approximation) [~~]
        # exp = 2.718281828459045**i # (this is an approximation) [~~]
        x_exp.append(exp)
print x_exp
# Result: [2.72, 7.39, 20.09, 54.6, 2.72, 7.39, 20.09]

print '~~'

# ORIGINAL CODE -----------------------------------
sum_x_exp = sum(x_exp)
print sum_x_exp # Result: 114.98

# STUDY CODE -----------------------------------
sumofall=0
for x in x_exp:
        sumofall=sumofall+x
print sumofall

print '~~'

# ORIGINAL CODE -----------------------------------
softmax = [round(i / sum_x_exp, 3) for i in x_exp]
print softmax
# Result: [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]

# STUDY CODE -----------------------------------
# [~~] remember!!! 5/3=1
print 5/3
# [~~] remember!!! 3/5=0
print 3/5
# [~~] remember!!! 3.0/5.0=0.6
print 3.0/5.0

softmax = []
for y in x_exp:
        s = round(i / sumofall, 3)
        # round() = round(number [, ndigits])

        print 'y:', y
        print 'y/sum:', y/sumofall
        # [~~] i = 1.0, Result: 2.71(...) / 114.9(...) = 0.0236405430216
        # [~~] i = 2.0, Result: 7.38(...) / 114.9(...) = 0.0642616585105
        # [~~] i = 3.0, Result: 20.08(...) / 114.9(...) = 0.174681298596
        # [~~] i = 4.0, Result: 54.59(...) / 114.9(...) = 0.474832999744

        softmax.append(s)

print softmax

print '~~'

# ~~~~~~~~~~~~~~~~~~~~~~~~
# links:
# ~~~~~~~~~~~~~~~~~~~~~~~~
# source: https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html