Previous sessions with all ressources (in nr 1)
http://pad.constantvzw.org/public_pad/neural_networks_5
http://pad.constantvzw.org/public_pad/neural_networks_4
http://pad.constantvzw.org/public_pad/neural_networks_3
http://pad.constantvzw.org/public_pad/neural_networks_2
http://pad.constantvzw.org/public_pad/neural_networks_1
http://pad.constantvzw.org/public_pad/neural_networks_algolit_extensions
http://pad.constantvzw.org/public_pad/neural_networks_small_dict
Introduction by Manetta, good for contextualizing neural networks:
http://orithalpern.net/
In her book Beautiful Data, Orit Halpern describes the history of Neural Networks in the years of the Cold War after WWII. She specifically describes how Warren McCulloch developed ideas of neural nets in the context of the cybernetic Macy conferences in New York at the time. McCulloch had a psychology background with a materialistic understanding of his field: in his opinion, psychologic problems could be solved with experimental medication and chemicals, that would correct broken connection in the brain.
In the third chapter of her book, called "Rationalizing, Cognition, Time, and Logic in the Social and Behavioral Sciences", Orit Halpern points out how McCulloch linked rationality with
incomplete reasoning. McCulloch proposed to accept our partial and incomplete form of perception. "Our knowledge of the world, including ourselves, is incomplete as to space and indefinite as to time". He connected this incompleteness to the psychological state of a psychosis, as a metaphorical state of losing contact with the world. A person in a psychosis acts not reasonable, he states, but rather logical and mechanical (I hope to find examples of this in the later part of the chapter). From this comparison to a psychological state, he proposed to create mechanisms that do not mimic what the mind is, but what it does. He called it an "experimental epistemology", a mode of perception based on invariant and changing structures.
For McCulloch "rationality was not reasonable, had no relationship with consiousness, and demanded different concepts of systems, markets, and agents".
https://en.wikipedia.org/wiki/Warren_Sturgis_McCulloch
Connections:
- in mathematics, there is a notion of real numbers. An idea that there is an infinite number of numbers, there is no descreteness anymore. But in computers, it needs to be discrete, and brought back to 0's and 1's. Chaos theory comes from there, from rounding errors of the numbers.
- Another interest book to read is by Katherine Hayles, also writing about cybernetics: How We Became Posthuman_ Virtual Bodies in Cybernetics, Literature, and Informatics(1999). (beschikbaar op http://gen.lib.rus.ec/ )
- The connection with the outside is almost secondary. People with psychosis, act like a crashing computer, but for them there is a logic in another dimension, their own one... internal functioning, loose connection to the outside.
- First generation of cybernetics looked at feedback loops. The second generation focused more at cognition and perception, in relation to the outside world.
- EUlaw, in Summer 2018, companies will be forced to explain how their algorithms have made a decision. It is unfortunately weak, vague. By forcing people to explain exactly what it does, you throw away much of the effort to just throw a lot of data to the computer and let it calculate an outcome, and test multiple outcomes as well.
- Semantics derived automatically from language c
- orpora contain human-like biases, an interesting article published recently, about human biases that can be retraced by a Machine Learning system. Link: http://science.sciencemag.org/content/356/6334/183.full
- f.ex. disciminatory systems: is a software running next to your system, that checks it, but the beginning situation can already be different (f.ex. recidivism in prison, more black people than white people go to prison for a start, so having an algorithm checking discrimination in recidivism starts with a bias)
Course 5
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture5.pdf
Video: https://www.youtube.com/watch?v=bjDbNbSbwY4&list=PLcGUo322oqu9n4i0X3cRJgKyVy7OkDdoi&index=5
This will be the hardest lecture! Richard's advise: look at the lecture notes to find help: http://web.stanford.edu/class/cs224d/syllabus.html
He runs through different steps necessary for a model:
Defining the Metric of your model
- with biased datasets, you use F1 scores instead of accuracy socres
- summarization algorithm: compare n-grams
Be close to your data! If you're looking at how your data has changed over time, respond to that in your distribution of train and test data.
Train/Dev/Test: use it in time, to simulate real world project
what size of a dataset?
- for binary classification: start at 1000 ex per class, 10000 samples with no skewed distribution
- summarization: GigaBytes
- machine translation: Gigabytes
Baseline
-> get an idea of what the baseline is getting wrong
if the task is interesting, the baseline should not be very high
if the score of baseline is too high, the task might be too easy, no use for deep learning model
should be tuned the same way as your model, unigrams/bigrams, logistic regression model
Visualise errors/data/confusion matrix
change parameters: window, size of dataset
Try different model variants
- word vector averaging model (neural bag of words model)
- fixed window neural model
- recurrent neural network ("best model for a lot of problems")
- recursive neural network
- convolutional neural network
implement models & iterate over ideas
Machine translation is in general a very heavy task, literary linked to the large datasets you need to train a model, and the time it needs to train.
the larger dataset you have, the less quicker you can iterate over ideas
Example tasks:
Interesting to see how reviews are always coming back as a source for data. Reviews, containing a lot of opinion, connected to marketing interest, etc.
Kaggle - a platform for predictive modelling and analytics competitions (lots of cash to win, and good grades to gain)
revision of single neural network (1 hidden layer NN or 2 layered NN)
(A more powerful window classifier)
slide 14
task: predict if the center word is a location or not.
with window size = 2
example sentence: museums in Paris are amazing
unnormalized score:
above 0 it is a location
below 0 it is not a location
linear function > z = Wx + b (this is drawing a straight line between data points)
non-linear function = f > this could be for example a sigmoid function (this is drawing a bended line between data points. The possibility that a task can be seprated by a curved line has been proved to follow an exponential line)
a = feature vector , neural activations
x = [Xmuseums, Xin ] window vector of each word
score(x) = weighted sum of features in a
the 3-layer graph > each bubble is for instance a single number
Tick comparison; it has a simple way to perceive, it can only determine if a surface is hot or warm and if there is movement
Our perception is influenced by the distance/size of what we see:
Hermann Haken, Juval Portugali-Information Adaptation_ The Interplay Between Shannon Information and Semantic Information in Cognition.pdf
see p.22 (mixed picture Einstein - Monroe)
Finding the right filters that are needed to be used in order to perceive the information from the dataset that you are looking for. As if you are able to switch on and off layer of vision in your eyes.
Always mention dimensionality for each of the layers/functions
20 column vector
8 hidden layer vector
matrix needs to be 20x8
slide 15
we want non-linear interactions between word vectors
a second layer is needed when you want to extract specific information, for example, extract what word follows after the word "in", to find out locations
always teach algorithm positive & negative examples
positive examples: center = location
negative examples: center = not a location
-> replace center word by random word, f.ex. museums in 'plane' are amazing)
-> or find other sentences with no location
max() function is used to take out all the negative numbers, and give them a 0
it compares the score of a pos. example with the neg. example. If 1 - s + sc is lower than 0, it is given the value of 0.
Backpropagation = update your classifier with new data, and further puss the new data into the matrix
descicion boundary = line to seprate class A from class B. You need to maxamize the distance between your two classes, to have the most optimal line.
SGD = statistic gradient descent > a form of linear regression, looking for the steapest hill and then going down, a way to find your bottom values
slide 16
derivatives = afgeleide [NL], the angle of your line, degree it raises
https://en.wikipedia.org/wiki/Derivative
derivative 2 X = 2
Now we compute:
derivatives of S
derivatives of Sc (corrupted s)
variables:
- U: final scoring layer
- W: weight matrix of first hidden layer
- b = bias term
- x = word vectors
W(23): wirght on 2nd column, 3rd row
how important the 3 ... is to the 2nd neuron
briefly ;-): each layer is a matrix/each row is a node - matrixcalculations, chainrule, derivatives (shows which x has most effect on the steepness of the line)
Two layer neural nets and full backprop
...
excercise
step 1.
create a matrix with a uniform distribution
W =
[[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]
[0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ]]
want: als je met 0 begint, en je gaat vermenigvuldigen met 0, blijft het 0
begin van een kansfunctie, en de som van een kansfunctie is 1, dus de som van een rij is 1
deze matrix is een tussenlaag, dit is niet de representatieve matrix van de input tekst
W = het gewicht van de invloed van Xi op Zi
bijv.: het woord "in" op positie 2 gaat een grotere W geven, omdat het vaak invloed heeft op een positieve score van het center woord als locatie
score functie
We moeten de berekening van S (score functie) vaststellen.
Een optie is om hiervoor de coocurence telling te nemen.
Hoe combineren we de coocurence telling met de annotatie of een woord een locatie is of niet.
> te lastig voor nu, want de coocurence wordt ook al berekend in de W later........
Als score functie gebruiken we een lijst met locaties.
Als het center woord in de woordenlijst staat, is de score 1. Zo niet, dan is de score 0.
> NEE! we moeten de score functie niet zelf bepalen.
> We moeten een geannoteerde dataset hebben, die een 1 score geeft wanneer een center woord een locatie is.
---
We zijn gestopt, omdat we niet uit konden vinden op welke manier we een annotatie kunnen meegeven in de Score(x) functie.
We hebben geprobeerd om een locatie recognition classifier te schrijven, met de Frankenstein text als input data.
We hebben met de hand locaties ontdekt in de tekst.
Ons korte scriptje wat een begin was voor het annoteren van windows:
f = open('input/frankenstein_gutenberg_tf.txt', 'r').read()
# print f
uniquewords = set()
words = []
for word in f.split(' '):
# print word
words.append(word)
uniquewords.update(words)
# print uniquewords
uniquewords = sorted(uniquewords)
locations = ['st petersburgh', 'holland', 'paris', 'london', 'oxford', 'edinburgh', 'britain', 'scotland', 'england', 'cumberland', 'westmorland', 'swiss', 'switzerland']
Tot zover!