Previous sessions with all ressources (in nr 1)
http://pad.constantvzw.org/public_pad/neural_networks_5
http://pad.constantvzw.org/public_pad/neural_networks_4
http://pad.constantvzw.org/public_pad/neural_networks_3
http://pad.constantvzw.org/public_pad/neural_networks_2
http://pad.constantvzw.org/public_pad/neural_networks_1
http://pad.constantvzw.org/public_pad/neural_networks_algolit_extensions
http://pad.constantvzw.org/public_pad/neural_networks_small_dict

Introduction by Manetta, good for contextualizing neural networks:
http://orithalpern.net/
In her book Beautiful Data, Orit Halpern describes the history of Neural Networks in the years of the Cold War after WWII. She specifically describes how Warren McCulloch developed ideas of neural nets in the context of the cybernetic Macy conferences in New York at the time. McCulloch had a psychology background with a materialistic understanding of his field: in his opinion, psychologic problems could be solved with experimental medication and chemicals, that would correct broken connection in the brain.

In the third chapter of her book, called "Rationalizing, Cognition, Time, and Logic in the Social and Behavioral Sciences", Orit Halpern points out how McCulloch linked rationality with incomplete reasoning. McCulloch proposed to accept our partial and incomplete form of perception. "Our knowledge of the world, including ourselves, is incomplete as to space and indefinite as to time". He connected this incompleteness to the psychological state of a psychosis, as a metaphorical state of losing contact with the world. A person in a psychosis acts not reasonable, he states, but rather logical and mechanical (I hope to find examples of this in the later part of the chapter). From this comparison to a psychological state, he proposed to create mechanisms that do not mimic what the mind is, but what it does. He called it an "experimental epistemology", a mode of perception based on invariant and changing structures. 

For McCulloch "rationality was not reasonable, had no relationship with consiousness, and demanded different concepts of systems, markets, and agents".

https://en.wikipedia.org/wiki/Warren_Sturgis_McCulloch

Connections:

Course 5
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture5.pdf
Video: https://www.youtube.com/watch?v=bjDbNbSbwY4&list=PLcGUo322oqu9n4i0X3cRJgKyVy7OkDdoi&index=5

This will be the hardest lecture! Richard's advise: look at the lecture notes to find help: http://web.stanford.edu/class/cs224d/syllabus.html

He runs through different steps necessary for a model:

Defining the Metric of your model
- with biased datasets, you use F1 scores instead of accuracy socres
- summarization algorithm: compare n-grams

Be close to your data! If you're looking at how your data has changed over time, respond to that in your distribution of train and test data.
Train/Dev/Test: use it in time, to simulate real world project
what size of a dataset? 
- for binary classification: start at 1000 ex per class, 10000 samples with no skewed distribution
- summarization: GigaBytes
- machine translation: Gigabytes

Baseline
-> get an idea of what the baseline is getting wrong
if the task is interesting, the baseline should not be very high
if the score of baseline is too high, the task might be too easy, no use for deep learning model
should be tuned the same way as your model, unigrams/bigrams, logistic regression model


Visualise errors/data/confusion matrix
change parameters: window, size of dataset

Try different model variants
- word vector averaging model (neural bag of words model) 
- fixed window neural model 
- recurrent neural network ("best model for a lot of problems")
- recursive neural network
- convolutional neural network

implement models & iterate over ideas
Machine translation is in general a very heavy task, literary linked to the large datasets you need to train a model, and the time it needs to train.
the larger dataset you have, the less quicker you can iterate over ideas

Example tasks:
Interesting to see how reviews are always coming back as a source for data. Reviews, containing a lot of opinion, connected to marketing interest, etc.
Kaggle - a platform for predictive modelling and analytics competitions (lots of cash to win, and good grades to gain)


revision of single neural network (1 hidden layer NN or 2 layered NN)
(A more powerful window classifier)

slide 14
task: predict if the center word is a location or not.
with window size = 2
example sentence: museums in Paris are amazing 

unnormalized score: 
    above 0 it is a location
    below 0 it is not a location

linear function > z = Wx + b (this is drawing a straight line between data points)
non-linear function = f > this could be for example a sigmoid function (this is drawing a bended line between data points. The possibility that a task can be seprated by a curved line has been proved to follow an exponential line)

a = feature vector , neural activations

x = [Xmuseums, Xin ] window vector of each word
score(x) = weighted sum of features in a

the 3-layer graph > each bubble is for instance a single number

Tick comparison; it has a simple way to perceive, it can only determine if a surface is hot or warm and if there is movement
Our perception is influenced by the distance/size of what we see:
    Hermann Haken, Juval Portugali-Information Adaptation_ The Interplay Between Shannon Information and Semantic Information in Cognition.pdf
    see p.22 (mixed picture Einstein - Monroe)
Finding the right filters that are needed to be used in order to perceive the information from the dataset that you are looking for. As if you are able to switch on and off layer of vision in your eyes.

Always mention dimensionality for each of the layers/functions
20 column vector
8 hidden layer vector
matrix needs to be 20x8

slide 15
we want  non-linear interactions between word vectors
a second layer is needed when you want to extract specific information, for example, extract what word follows after the word "in", to find out locations
always teach algorithm positive & negative examples
positive examples: center = location
negative examples: center = not a location
-> replace center word by random word, f.ex. museums in 'plane' are amazing)
-> or find other sentences with no location

max() function is used to take out all the negative numbers, and give them a 0
it compares the score of a pos. example with the neg. example. If 1 - s + sc is lower than 0, it is given the value of 0.

Backpropagation = update your classifier with new data, and further puss the new data into the matrix
descicion boundary = line to seprate class A from class B. You need to maxamize the distance between your two classes, to have the most optimal line.

SGD = statistic gradient descent > a form of linear regression, looking for the steapest hill and then going down, a way to find your bottom values

slide 16
derivatives = afgeleide [NL], the angle of your line, degree it raises
https://en.wikipedia.org/wiki/Derivative
derivative 2 X = 2

Now we compute: 
derivatives of S
derivatives of Sc (corrupted s)
variables: 

W(23): wirght on 2nd column, 3rd row
how important the 3 ... is to the 2nd neuron

briefly ;-): each layer is a matrix/each row is a node - matrixcalculations, chainrule, derivatives (shows which x has most effect on the steepness of the line)

Two layer neural nets and full backprop
...

excercise

step 1. 
create a matrix with a uniform distribution

W = 
[[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]]

want: als je met 0 begint, en je gaat vermenigvuldigen met 0, blijft het 0
begin van een kansfunctie, en de som van een kansfunctie is 1, dus de som van een rij is 1

deze matrix is een tussenlaag, dit is niet de representatieve matrix van de input tekst
W = het gewicht van de invloed van Xi op Zi
bijv.: het woord "in" op positie 2 gaat een grotere W geven, omdat het vaak invloed heeft op een positieve score van het center woord als locatie

score functie
We moeten de berekening van S (score functie) vaststellen.
Een optie is om hiervoor de coocurence telling te nemen.

Hoe combineren we de coocurence telling met de annotatie of een woord een locatie is of niet.
> te lastig voor nu, want de coocurence wordt ook al berekend in de W later........

Als score functie gebruiken we een lijst met locaties. 
Als het center woord in de woordenlijst staat, is de score 1. Zo niet, dan is de score 0.

> NEE! we moeten de score functie niet zelf bepalen.
> We moeten een geannoteerde dataset hebben, die een 1 score geeft wanneer een center woord een locatie is.

---

We zijn gestopt, omdat we niet uit konden vinden op welke manier we een annotatie kunnen meegeven in de Score(x) functie. 
We hebben geprobeerd om een locatie recognition classifier te schrijven, met de Frankenstein text als input data.

We hebben met de hand locaties ontdekt in de tekst. 
Ons korte scriptje wat een begin was voor het annoteren van windows: 

f = open('input/frankenstein_gutenberg_tf.txt', 'r').read()
# print f

uniquewords = set()
words = []
for word in f.split(' '):
        # print word
        words.append(word)

uniquewords.update(words)
# print uniquewords

uniquewords = sorted(uniquewords)

locations = ['st petersburgh', 'holland', 'paris', 'london', 'oxford', 'edinburgh', 'britain', 'scotland', 'england', 'cumberland', 'westmorland', 'swiss', 'switzerland']

Tot zover!