Previous sessions with all ressources (in nr 1)
http://pad.constantvzw.org/public_pad/neural_networks_4
http://pad.constantvzw.org/public_pad/neural_networks_3
http://pad.constantvzw.org/public_pad/neural_networks_2
http://pad.constantvzw.org/public_pad/neural_networks_1
http://pad.constantvzw.org/public_pad/neural_networks_algolit_extensions
http://pad.constantvzw.org/public_pad/neural_networks_small_dict

Organize an event with outcomes of NN course? Good response on Transmediale, people are interested in a report on the outcomes / excercises. Also we have a bit of budget to spend. I we do this course untill June we could develop some elements in October / November into installation / workshops etc.
Maison du Livre - An checks dates
http://www.lamaisondulivre.be
10-11-12 November? Together with Gijs' installation in Constant V (opening on 9th November)
15-17th December 2017? eventually... together with Constant V in Recyclart (on 14th Dec)
Preparation: week of 18 till 22nd September

New course?
http://cs224n.stanford.edu/

Questions:
what does word2vec refer to? Is it a Google specific term? How is the general field of vector-math called?
what are the names recurrent / reinforcement / convulutional neural networks reffering to?
when is a neural network a neural network? or, what makes a neural network a neural network?
what is logistic regression?

Course 4
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf
Video: https://www.youtube.com/watch?v=bjDbNbSbwY4&list=PLcGUo322oqu9n4i0X3cRJgKyVy7OkDdoi&index=4

In the previous courses (1-3) we looked at the following: 
turn text into vectors
explore different statistical formula's to calculate probabilities (co-occurences, analogies, classification, multi-class classification)

slide 3
Sets of samples, set of samples.
x = words (either window words, sentence, document)
y = labels (for example: other words, class, multi-word sequences > not a single label but a sequence of labels)

slide 4
every point is a 2-dimensional vector
linear decision boundary > way to create the straight line
visualisation tool, classification with 2-layer neural network: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
the neural network is trying to draw the line between the two classes

word2vec > turn words into a vector
neural network > statistics

visualizations in graphs is a way to speak about multi-dimensionality, a reduction of the vectors, in order to be able to work with them. To visualize them you reduce even more. 

Q: how are vectors translated into graphs. (added this to the Aloglit Extension pad)

http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
However, there remain a number of concerns about them. One is that it can be quite challenging to understand what a neural network is really doing. If one trains it well, it achieves high quality results, but it is challenging to understand how it is doing so. If the network fails, it is hard to understand what went wrong.

http://cs.stanford.edu/people/karpathy/convnetjs//demo/classify2d.html
toy 2d classification with 2-layer neural network
The simulation below shows a toy binary problem with a few data points of class 0 (red) and 1 (green)

The formula on the bottom:
softmax = logistic regression classifier is used, to draw the line between class green & red
classify each word separately: for each probability we compute probability of class y: have exponent and inner product of Yth row of that vector (see last time) times vector x

slide 5
matrix notations are important for this course, this is a way to implement these calculations into code https://en.wikipedia.org/wiki/General_matrix_notation_of_a_VAR%28p%29
classification intuition -> classification notation
square error loss does not work well with NN, not clear why
-> better to use classification error (with linear regression), his intuition

slide 6
"we always do regularized version of any error"
"we assume we penalize any deviation from zero for all our parameters"
better have very large supervised training data set, to prevent overfitting > overfitting: https://en.wikipedia.org/wiki/Overfitting
a lot of techniques have been around for 30 year, we know can combine them
the data by itself almost regularises the model

cfr graph: important for all ML
x-axis: nr of parameters or amount of training time (things that make your model more powerful/accurate)
y_axis: error rate of test data, will go down initially but increase again later
training error (blue), testing error (red)
you are going to want to compute this graph on train (80%), development split (10%), in order to find the point in which you test data (10%) goes up in error rate again
This is a way to prevent overfitting, but not the only way to prevent overfitting.

slide 7
W.1 > taking this column from the fixed dataset W (dataset = collection of vectors)
? = theta https://en.wikipedia.org/wiki/Theta_%28disambiguation%29
limited amount of features & trying to find the right weigth for every feature

theta: "So, what is ?? The ? is set of parameters that generated (x,y). In the criminal minds example, ? is the killing preference of the serial killer, because his killing preference generated the dead victims." https://www.quora.com/What-is-theta-in-machine-learning-From-where-do-we-get-theta-to-provide-to-any-machine-learning-algorithm

So, a theta is a collection of preferences that are discovered by the model. Theta are the parameters that you use to make your prediction. 

two machine learning problems:
*classification 
*information extraction
It's important to know what problem you have on before hand, to adjust your model to the right way of problem solving

slide 8
in machine learning we want to do end to end learning (?) > "learn both W and word vectors x"
d = dimensionality of word vectors
V = size of vocabulary
-> so many parameters, overfitting is likely

Slide 9: graph is 200dimensional, but model is 200-dimensional
"nothing in real life will ever by 2 dimensional"

"unsupervised word-vectors, update them in a very small dataset" (?)

cfr graph in video:
    first sentiment analysis & then classification pos/neg
word vectors trained on sentiment-analysis, a simple pos-neg task
the more blue = the more neg 
the more red = the more pos
> semantically/sentimentically classification

if dataset is a large dataset, you don't have to get thorugh all the tasks of the previous class (glove, one-hot algorithm, ...)
you can just train on the task what you have (you can train on words)
the size of the dataset depends on the task: for a simple classification problem, few 1000 examples for each class is enough (is small amount) with a more complicated task you need much more (a few millions of tokens is ok)

slide 9
note on word-vector notation
word vectors = word embeddings = word vector representations
L = final matrix (lookup table), also indicated with X sometimes
hash = 
taking out specific vector from matrix
e = high vector with many '0's and '1's

Glove project (Stanford): https://nlp.stanford.edu/projects/glove/
has pre-trained word vectors, 2GB in zip-format
Common Crawl is good one to work with
Q: What are pre-trained word vectors? 
*our answer: the patterns that are detected in the dataset, by using the unsupervised learning method Glove
*richard's anser: using pre-trained is a way to say that you pre initialized word-vectors, unsupervised learning using word vectors and glove

slide 10 - Window Classification
Word vectors are used as a first step to create a 'real' sytem, never used by themselves
you need context to avoid ambiguity
-> consider word with its neighbouring words = a good baseline for your model!
posibility to average all words in the window.

slide 11
example use: train a classifier to assign a label to the center word, concatenate vectors of the surrounding words (resulting vector is a column vector)
R5d = R vector with 5 dimensions (due to the window size = 2)

update word vectors:
    - softmax layer on top of word vectors (could be considered 2-layers model, softmax + word vectors)
    - define all your variables : derivatives for the softmax (yth class), target probability distribution t, f(x) is matrix vector product (nr rows, nr columns) + keep track of their domensionality
    - apply chain rule, make sur you know which variables depend on what
    - include all partial derivatives in 1 vector
    - 
    
Ketting regel (om afgeleide van een functie te vinden): https://nl.wikipedia.org/wiki/Kettingregel

slide 20
always keep track of your dimensionality

slide 21
update theta new, equal theta old, minus the steps size times the gradient 
theta = current word vectors
after all the matrix multiplications the result needs to have the same amount of dimensions

slide 22
named entity recognition: word before center word 'in', indicates location

slide 25
note on matrix implementations
for softmax function:
    - large matrix multiplication is large cost (5000 dimensional vector x + 5000 by 5000 dimension W)
    - in Python please always avoid FOR loops!

slide 26
see drawing A4 by An
N = 500 = number of full windows (we don't know the window size)
d = 300 = reduction of the dimension of the matrix, using dimensionality reduction
c = 5 = number of classes (pos/less pos/neg/more neg...)
perform the dot-multiplication on matrix in Python, not on list -> goes a lot faster: difference between 12 days & 1 day :-)
instead, use a matrix multiplication
use $timeit as speedtest for your code!!!

slide 28
Softmax = logistic regression 
will always give linear decision boundary
can be good for little data (problems might be outliers), but for more data it is limiting
neural networks can give very complex nonlinear decision boundaries
Manetta:' So no straight lines, but powerful curves!'
They are so powerful, they will quickly overfit

"from logistic regression to neural nets"
softmax/logistic regression is a lego block :-)

slide 30
terminology for neural nets

*a single neuron = neural network unit = binary logistic regression unit: 
*inputs
*activation function
*output

"real neurons in the brain are really different and much more complex"

slide 31
h = output of one neuron when it gets as input vector x
simple inner product = vector y, a column vector x a row vector 
t = trans.... from column to row (we looked at this last time!)
b = bias term = single number

the function is a squashing function, making sure that every number is between 0 and 1
that's a neuron!

slide 32
input: word vectors

A neural network = running several logistic regressions at the same time

neuron 1: is this a person or not?
neuron 2: is this a named entity or not?
each neuron contains a binary classification problem.

After this set of neurons, there is another regression unit sits on top of the output of those (and we can define what the 'hidden' units below do to get best result)
== multilayer logistic regression / multilayer perceptron

bias term: in order to have a bias for each of the neurons, people visualize it by putting such a +1 here. For example: this neuron is almost never so useful, so the bias number is -1 so the neuron is not effective and will never be on whatever input there is.

How do you usually decide on the many layer you use?
The best answer: Try them all! 4 or 5. In many cases 1 or 2

question: word2vec has 2 layers, the logistic ... layer and the softmax?
but technically it is a single layer: word vectors (they are linear) and sigmoid function, is 'least deep model' called 'deep learning'
shallow neural net=a single layer neural net

slide 37
1 proper layer = 1 linear layer + non-linearity applied to all the elements
1 neuron a1, its output will be computed by sum of inner product (first row of matrix W that defines the entire layer here * the input) plus the bias term

Definition of single layer:
z = W(x) + b
numbers of rows in W is number of neurons we have in the first layer
numbers of colums is number of dimensionality of input x
f = function (non-linearity)
f(vector) = f([z1, z2, z3]) = [f(z1), f(z2), f(z3)]
each x in this example is single number

Why is non-linearity needed?
in stead of separate linear matrixes, that will end up with yet another straight line
1 hidden neuron (simple curve) to 10 hidden neurons (complex curve & overfitting)

slide 38
deep neural nets are general function approximators

slide 39
very exciting example from 2008
1 hidden neural network == 3 layer NN
results from the hidden layer can be used to compute a function (can be softmax or a normalizing score)

word vectors = 1 layer
Wx + b = 1 layer (or hidden layer)
score = 1 layer