Algolit - Machine Learning Tutorial
23rd April 2016
Marc, Piero, Hans, Gijs, Olivier, Yann, An

presentation of basic models
connection to neural networks

Problem of correlation/causation: ex. high correlation between suncream/skin cancer

Overview of few problems
2 main themes:

Feedback by humans? 
interactive learning: machine asks queries to human

Supervised Learning
-------------------------------
- data from the real world
- create a database / table
- learn predictive model

— in his examples Y uses simple datasets, as more complex ones need a lot of work and then have no pedagogical value —

ex in which each line describes a day
model to predict whether people come to play tennis or not (temperature, rain, wind)
categorical dataset (no numerical attributes)

ex yelp: average rating predicting the rating of restaurants
ex. Iris dataset (<R)
given length/width of petal, it predicts the type of iris

-> Yann works on metagenomic data (1 million colums), learn models on metagenome (all bacterials in your guts, 2kg, 90% of cells of our your body), predict whether treatment of obesity/other diseases will work -> personalised medicine

ex. robot wants to recognize humans with his camera (2000)
records lots of photos
human labels the photo : no human / name of human
depending on colours/regions of colours: values or categories

deep learning -> many layers of analysis...

Other fields: biometry, datamining, looking for freud (fraud ? -> cool project though: looking for dr. Freud)...

Values/categories
* numerical
* categorical


2 possible Tasks

* Regression task
what we want to predict is a number
-> LInear Regression
ex weight of car based on horsepower
Weight = .37 + 0.2* Horsepower
-> dots are original data // linear line is the model you create

-> Non linear Regression
because there is no straight line, you look for more precision (curve)
count number of inflection points

-> Take care how data is represented
-> nr of data we have/need
-> type of model (neural networks, trees, linear/nonlinear regression)
-> criteria that fit the model: what does 'fit' mean?
-> afterwards: assess the model


* Classification problem
iris: 2 attributes (length/width) -  2 classes (2 types of iris/blue-red)
x = attributes without class
y = class
Goal: find the best model
Formula? (sum for all terms for i = 1 to m)
error = no of times the prediction doesn't fit the class
m = total amount of lines

model would seperate red from blue ones
ex complex polygon -- draws a polygon in the graph, if dot falls in the polygon it's predicted to be red, otherwise blue. This polygon does not change with the introduction of new data.
The example has an error rate of 3/41? 7%. The shape is very precisely fitted to the given data and probably is errornous when applied in reality.
-> you have iris and don't know class: look at attributes, if it falls inside the polygon -> red
-> the more data, the better
-> train errors: the 3 red dots that are not in the model / prediction errors
-> would not work very well because of overfitting

Linear model: 1 straight line - left/right
-> makes a lot of mistakes (36%) but model is much more simple than before, therefore would fits noise less and probably would perform the same when applied in realitry


How do you learn this?
* Naive
draw lots of random models, measure the error, keep the best
-> possible to code in 5 minutes
-> bad for complex models

* iterative improvement / small improvements
draw random model (or separator), draw 1 row / separator, make small changes, if it improves the model keep the change.

Naive model. Can apply it on any data-set.
not so inefficient as it looks.
easy to find a good separator, but 'the best' is impossible
sensible to starting points

cfr go down the mountain -> try all possible steps & find the lowest part, step & try all possible steps -> bottom of the mountain = best model.
-> problem: valleys. If you encounter an ascent, it doesn't mean there it's not going to be a stronger descent later.
no garantee you'll find the lowest point (local minimum vs global mimimum)
all deep learners are based on this algorithm - 99% of ML
stochastic gradient descent algorithm: another approach which has on 1 single mininum, but it not the error

Perceptron algorithm
°50s
trying to model neurons in the brain
linear separator with Hebb's rule to improve itself
now understood as stochastic gradient descent algorithm

input: multiple numbers - neurons of the brain
multiplied by weights
weights are dentrites into neuron, they carry an electrical current which is either amplified or reduced in the neuron
if total sum (of incoming neurons*dentrites) exceeds certain amount (threshold), the neuron will fire.
ex if sum > 0: return 1 / otherwise: return -1
In the brain the weight of a dentrite is defined by it's resistance
-> we're constantly learning, changing the weight of our neurons

when output & input are correlated -> weight of dentritde conveying the signal, will be higher
cfr Pavlov: see icecream correlates with mouth watering 
= Hebbian rule / règle de Hebb

plane x1/x2
line is perpendicular to vector
Sum = 3x1 - 2x2 == vector
line goes through the origin
constant is missing in equation / add 4 f.ex = ROW of petal dimension that has value of 1 in each case
why problematic: separator that doesn't go through the origin is more powerful 

if form is correctly classified: ok
else: change class: adding the attributes of the coefficient of vector -> separator has turned to include the incorrectly classified example
= single neuron network

= basis for SVM, linear regression ... all used for supervised learning

deep neural networks: supervised network / often come with unsupervised training to give a 'preshape' to you model

you can take neurons and plug them into another sum / take output of both and plug them into a 3rd
combine 2 linear separators to make a non-linear separator
= intersection of 2 : input layer - hidden layer - output layer
-> can be extended / universal approximator / not practical / theorem
ex picture of atomium in 2 dimensions - 4 neurons = square
-> needs as many neurons as pixels
-> better to add layers, than to add no of neurons/layer


2006 (lots of data / structured data input / graphic card / 10 to 40 convolutional layers): deep neural networks: returns 0 before & sum after
if we do 1/-1 we loose a lot of information -> gives a flat landscape, you don't know where the improvements are / you can't learn these neural networks -> from 80s till 2000 smooth 1/-1 (hyperbolic tangent/sigmoid function) because it is derivale (allows you to use stochaistic gradient descent)
rectified neural unit https://en.wikipedia.org/wiki/Rectifier_%28neural_networks%29

transfer function

why rectified neural unit replacing hyperbolic tangent? Because it's easier to compute than the tangent / exponents.
is s>0, return s otherwise return 0
in practise we never reach the global optimum, only local optimum, but it works -> very little insight still
each point is configuration of weights with error rate
if you search too far, you might end up with a model fitting noise
very fast (compared to sinus/cosinus, too slow)
— Modern neural networks are not any longer fitted on biological neural network as it's too complex —

convolutional layers ( convolutional neural network ): applying filter to signal
1 neuron on each pixel + its neighbours
weights of each neuron at first layer (recognizing borders), weight the same on all neurons, doesn't depend on where the neurons are
similar to what retina does: identifying shapes

Neural networks only work with a lot of data, it has to be structured.

For example in processing speech recognition, in the first step it might not matter what the position of the data is in the stream.

Evaluation methods
training error is low - but because there is not enough data!
cfr graph with training/test error
difference can be big if you have little data
-> the more parameters you have, the more data you need
rule of thumb: much less parameters (columns)  than number of rows 
-> drop colums, select meaningful features!

train error <  test error + (square root) no of parameters / no of examples
you have to test, it depends on your data
ex 1000 columns that are extremely correlated == 1 single column + noise
 ex Alpha Go: retrain model 1x in 6 months / based on imitation of game style & unclear evaluation

http://googleresearch.blogspot.be/2015/06/inceptionism-going-deeper-into-neural.html
 first attempt to produce image
 gradient descient on images, not on model (model is fixed)
 it sees the world differently than us (not 4 legs on a table, but eyes in the wood...)
 cats/dogs majority of images on the internet :-)

 Google algorithms
 -> to train it you need power, but not to use it (now they are trained)
 pattern recognition is good, but they will not kill the rest of ML
 cfr Yann Lecun - Collège de France - lectures online

 SVM: non-linear classifier - transform non-linear to linear : does linear separator in another 'bended' space
 ex parabole to line

---

Going practical with Python
scikit learn
statmodels
scipy


sgd-classifer (in Scikit Learn) = same results than Perceptron

Main deep neural network frameworks in Python

Low-level: Theano
Low+High-level: Torch (not sure there's a Python binding), Lasagne (https://github.com/Lasagne/Lasagne), Caffee (http://caffe.berkeleyvision.org/), Tensorflow

Scikit Learn also requires numpy scipy. 

pip install scikit-learn

how to load the "digits" dataset with scikit learn
http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html
-> optical character recognition with very high precision


clf = LogisticRegression()
clf.fit(digits.data,digits.target)
clf.score(digits.data,digits.target)

OUTPUT:
    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
0.993322203673

-> classification task with 10 classes
compares 1 class 10 times, compares them to all other classes


# display the 64 values of the first digit of the database, and its label
digits.data[0] , digits.target[0]
output digits.data[0]
-> each number corresponds to pixel in the 8x8 box of number zero
[  0.   0.   5.  13.   9.   1.   0.   0.   0.   0.  13.  15.  10.  15.   5.
   0.   0.   3.  15.   2.   0.  11.   8.   0.   0.   4.  12.   0.   0.   8.
   8.   0.   0.   5.   8.   0.   0.   9.   8.   0.   0.   4.  11.   0.   1.
  12.   7.   0.   0.   2.  14.   5.  10.  12.   0.   0.   0.   0.   6.  13.
  10.   0.   0.   0.]

clf.coef_,clf.intercept_
recognizing digit 0
each pixel of the 64 pixels that compose the image, is multiplied with one of the 64 coefficients, summed and if result is higher than the intercept/treshold (the last one of the variables): it is a zero

how to predict the label of 10th image
x = digits.data[10]
if np.sum(wi*xi for wi,xi in zip(clf.coef_[3],x))  + clf.intercept_[3] >= 0:
    print "class of example 10 is predicted to be 3"

-> 10 models, for each model there is an intercept

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

# Test ten random numbers in the set
import random

for i in range(10):
    n = random.randint(0,1000)
    x = digits.data[n]

    for t in range(0,10):
        if np.sum(wi*xi for wi,xi in zip(clf.coef_[t],x))  + clf.intercept_[t] >= 0:
            print "class of example {0} is predicted to be {1}".format(n, t)

[[  0.00000000e+00  -4.61488264e-02  -4.21745687e-02   4.18852652e-02
   -9.41324542e-02  -3.76048898e-01  -2.69221731e-01  -2.95125757e-02
   -1.27476272e-05  -1.18056989e-01  -1.16024692e-02   1.73758690e-01
    2.14618631e-01   2.81038917e-01  -4.17673514e-03  -3.25076594e-02
   -4.78277696e-03   9.41829800e-02   2.16248301e-01  -6.55830973e-02
   -3.80995268e-01   3.13868361e-01  -1.51719633e-02  -1.34085075e-02
   -2.38982361e-03   3.05036375e-02  -7.63426285e-02  -1.68203716e-01
   -6.43030746e-01   3.33508262e-02   4.03149022e-02  -1.54133606e-04
    0.00000000e+00   2.59834207e-01   1.56328486e-01  -1.36847000e-01
   -6.36512805e-01  -5.78812215e-02  -5.80958479e-02   0.00000000e+00
   -1.72569373e-03  -2.67489992e-02   2.25330916e-01  -3.03498374e-01
   -2.92844675e-01   3.23906129e-03   1.36460743e-01  -8.98661244e-05
   -1.20123721e-03  -1.32347269e-01   2.67496803e-02  -7.78937144e-02
    8.35157864e-02  -2.02847224e-02  -1.87567122e-01  -8.50602034e-02
   -2.83424174e-06  -5.65465335e-02  -2.74688532e-01   1.13840720e-01
   -3.16493957e-01  -1.24307964e-01  -1.87628035e-01  -7.39892549e-02]


text data set, newsgroups
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

CRF classifier for structured prediction (Hedge cues & their scopes)



# CODE OF THE PERCEPTRON
# Running 10 times over the whole dataset, is sufficient
#
w = np.ones(65)
for t in range(10):
    for x,y in zip(digits.data,digits.target):
        xb = np.append(x,[1.0])
        s = np.sum(wi*xi for wi,xi in zip(w,xb))
        pred = np.sign(s)
        truelab = 1.0 if y==3 else -1.0
        if pred != truelab:
            w = w + truelab*xb


Try to predict a fresh drawn number
Visualisation Gijs
-> coefficient tells you how the pixel influences the probability -> visualize for each picture the influence transforming coefficients in 0/1 and assigning a colour