Algolit - Machine Learning Tutorial 23rd April 2016 Marc, Piero, Hans, Gijs, Olivier, Yann, An presentation of basic models connection to neural networks Problem of correlation/causation: ex. high correlation between suncream/skin cancer Overview of few problems 2 main themes: *supervised learnrng *data that is labeled by humans, create model to predict *ex websites annotated with 'commercial/no' -> model that can predict if a website is commercial *unsupervised learning *Just unlabeled data where you want the algorithm to recognize structure.: websites will cluster together automatically *reduce dimensionality of data *semi-supervised learning: you label small number of websites, learns with the rest of the web Feedback by humans? interactive learning: machine asks queries to human Supervised Learning ------------------------------- - data from the real world - create a database / table - learn predictive model — in his examples Y uses simple datasets, as more complex ones need a lot of work and then have no pedagogical value — ex in which each line describes a day model to predict whether people come to play tennis or not (temperature, rain, wind) categorical dataset (no numerical attributes) ex yelp: average rating predicting the rating of restaurants ex. Iris dataset ( Yann works on metagenomic data (1 million colums), learn models on metagenome (all bacterials in your guts, 2kg, 90% of cells of our your body), predict whether treatment of obesity/other diseases will work -> personalised medicine ex. robot wants to recognize humans with his camera (2000) records lots of photos human labels the photo : no human / name of human depending on colours/regions of colours: values or categories deep learning -> many layers of analysis... Other fields: biometry, datamining, looking for freud (fraud ? -> cool project though: looking for dr. Freud)... Values/categories * numerical * categorical 2 possible Tasks * Regression task what we want to predict is a number -> LInear Regression ex weight of car based on horsepower Weight = .37 + 0.2* Horsepower -> dots are original data // linear line is the model you create -> Non linear Regression because there is no straight line, you look for more precision (curve) count number of inflection points -> Take care how data is represented -> nr of data we have/need -> type of model (neural networks, trees, linear/nonlinear regression) -> criteria that fit the model: what does 'fit' mean? -> afterwards: assess the model * Classification problem iris: 2 attributes (length/width) - 2 classes (2 types of iris/blue-red) x = attributes without class y = class Goal: find the best model Formula? (sum for all terms for i = 1 to m) error = no of times the prediction doesn't fit the class m = total amount of lines model would seperate red from blue ones ex complex polygon -- draws a polygon in the graph, if dot falls in the polygon it's predicted to be red, otherwise blue. This polygon does not change with the introduction of new data. The example has an error rate of 3/41? 7%. The shape is very precisely fitted to the given data and probably is errornous when applied in reality. -> you have iris and don't know class: look at attributes, if it falls inside the polygon -> red -> the more data, the better -> train errors: the 3 red dots that are not in the model / prediction errors -> would not work very well because of overfitting Linear model: 1 straight line - left/right -> makes a lot of mistakes (36%) but model is much more simple than before, therefore would fits noise less and probably would perform the same when applied in realitry How do you learn this? * Naive draw lots of random models, measure the error, keep the best -> possible to code in 5 minutes -> bad for complex models * iterative improvement / small improvements draw random model (or separator), draw 1 row / separator, make small changes, if it improves the model keep the change. Naive model. Can apply it on any data-set. not so inefficient as it looks. easy to find a good separator, but 'the best' is impossible sensible to starting points cfr go down the mountain -> try all possible steps & find the lowest part, step & try all possible steps -> bottom of the mountain = best model. -> problem: valleys. If you encounter an ascent, it doesn't mean there it's not going to be a stronger descent later. no garantee you'll find the lowest point (local minimum vs global mimimum) all deep learners are based on this algorithm - 99% of ML stochastic gradient descent algorithm: another approach which has on 1 single mininum, but it not the error Perceptron algorithm °50s trying to model neurons in the brain linear separator with Hebb's rule to improve itself now understood as stochastic gradient descent algorithm input: multiple numbers - neurons of the brain multiplied by weights weights are dentrites into neuron, they carry an electrical current which is either amplified or reduced in the neuron if total sum (of incoming neurons*dentrites) exceeds certain amount (threshold), the neuron will fire. ex if sum > 0: return 1 / otherwise: return -1 In the brain the weight of a dentrite is defined by it's resistance -> we're constantly learning, changing the weight of our neurons when output & input are correlated -> weight of dentritde conveying the signal, will be higher cfr Pavlov: see icecream correlates with mouth watering = Hebbian rule / règle de Hebb plane x1/x2 line is perpendicular to vector Sum = 3x1 - 2x2 == vector line goes through the origin constant is missing in equation / add 4 f.ex = ROW of petal dimension that has value of 1 in each case why problematic: separator that doesn't go through the origin is more powerful if form is correctly classified: ok else: change class: adding the attributes of the coefficient of vector -> separator has turned to include the incorrectly classified example = single neuron network = basis for SVM, linear regression ... all used for supervised learning deep neural networks: supervised network / often come with unsupervised training to give a 'preshape' to you model you can take neurons and plug them into another sum / take output of both and plug them into a 3rd combine 2 linear separators to make a non-linear separator = intersection of 2 : input layer - hidden layer - output layer -> can be extended / universal approximator / not practical / theorem ex picture of atomium in 2 dimensions - 4 neurons = square -> needs as many neurons as pixels -> better to add layers, than to add no of neurons/layer 2006 (lots of data / structured data input / graphic card / 10 to 40 convolutional layers): deep neural networks: returns 0 before & sum after if we do 1/-1 we loose a lot of information -> gives a flat landscape, you don't know where the improvements are / you can't learn these neural networks -> from 80s till 2000 smooth 1/-1 (hyperbolic tangent/sigmoid function) because it is derivale (allows you to use stochaistic gradient descent) rectified neural unit https://en.wikipedia.org/wiki/Rectifier_%28neural_networks%29 transfer function why rectified neural unit replacing hyperbolic tangent? Because it's easier to compute than the tangent / exponents. is s>0, return s otherwise return 0 in practise we never reach the global optimum, only local optimum, but it works -> very little insight still each point is configuration of weights with error rate if you search too far, you might end up with a model fitting noise very fast (compared to sinus/cosinus, too slow) — Modern neural networks are not any longer fitted on biological neural network as it's too complex — convolutional layers ( convolutional neural network ): applying filter to signal 1 neuron on each pixel + its neighbours weights of each neuron at first layer (recognizing borders), weight the same on all neurons, doesn't depend on where the neurons are similar to what retina does: identifying shapes Neural networks only work with a lot of data, it has to be structured. For example in processing speech recognition, in the first step it might not matter what the position of the data is in the stream. Evaluation methods training error is low - but because there is not enough data! cfr graph with training/test error difference can be big if you have little data -> the more parameters you have, the more data you need rule of thumb: much less parameters (columns) than number of rows -> drop colums, select meaningful features! train error < test error + (square root) no of parameters / no of examples you have to test, it depends on your data ex 1000 columns that are extremely correlated == 1 single column + noise ex Alpha Go: retrain model 1x in 6 months / based on imitation of game style & unclear evaluation http://googleresearch.blogspot.be/2015/06/inceptionism-going-deeper-into-neural.html first attempt to produce image gradient descient on images, not on model (model is fixed) it sees the world differently than us (not 4 legs on a table, but eyes in the wood...) cats/dogs majority of images on the internet :-) Google algorithms -> to train it you need power, but not to use it (now they are trained) pattern recognition is good, but they will not kill the rest of ML cfr Yann Lecun - Collège de France - lectures online SVM: non-linear classifier - transform non-linear to linear : does linear separator in another 'bended' space ex parabole to line --- Going practical with Python scikit learn statmodels scipy sgd-classifer (in Scikit Learn) = same results than Perceptron Main deep neural network frameworks in Python Low-level: Theano Low+High-level: Torch (not sure there's a Python binding), Lasagne (https://github.com/Lasagne/Lasagne), Caffee (http://caffe.berkeleyvision.org/), Tensorflow Scikit Learn also requires numpy scipy. pip install scikit-learn how to load the "digits" dataset with scikit learn http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html -> optical character recognition with very high precision *#training with scikit-learn: *import numpy as np *from sklearn import datasets *from sklearn.linear_model import LogisticRegression *digits = datasets.load_digits() clf = LogisticRegression() clf.fit(digits.data,digits.target) clf.score(digits.data,digits.target) OUTPUT: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) 0.993322203673 -> classification task with 10 classes compares 1 class 10 times, compares them to all other classes # display the 64 values of the first digit of the database, and its label digits.data[0] , digits.target[0] output digits.data[0] -> each number corresponds to pixel in the 8x8 box of number zero [ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3. 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0. 0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12. 0. 0. 0. 0. 6. 13. 10. 0. 0. 0.] clf.coef_,clf.intercept_ recognizing digit 0 each pixel of the 64 pixels that compose the image, is multiplied with one of the 64 coefficients, summed and if result is higher than the intercept/treshold (the last one of the variables): it is a zero how to predict the label of 10th image x = digits.data[10] if np.sum(wi*xi for wi,xi in zip(clf.coef_[3],x)) + clf.intercept_[3] >= 0: print "class of example 10 is predicted to be 3" -> 10 models, for each model there is an intercept http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html # Test ten random numbers in the set import random for i in range(10): n = random.randint(0,1000) x = digits.data[n] for t in range(0,10): if np.sum(wi*xi for wi,xi in zip(clf.coef_[t],x)) + clf.intercept_[t] >= 0: print "class of example {0} is predicted to be {1}".format(n, t) [[ 0.00000000e+00 -4.61488264e-02 -4.21745687e-02 4.18852652e-02 -9.41324542e-02 -3.76048898e-01 -2.69221731e-01 -2.95125757e-02 -1.27476272e-05 -1.18056989e-01 -1.16024692e-02 1.73758690e-01 2.14618631e-01 2.81038917e-01 -4.17673514e-03 -3.25076594e-02 -4.78277696e-03 9.41829800e-02 2.16248301e-01 -6.55830973e-02 -3.80995268e-01 3.13868361e-01 -1.51719633e-02 -1.34085075e-02 -2.38982361e-03 3.05036375e-02 -7.63426285e-02 -1.68203716e-01 -6.43030746e-01 3.33508262e-02 4.03149022e-02 -1.54133606e-04 0.00000000e+00 2.59834207e-01 1.56328486e-01 -1.36847000e-01 -6.36512805e-01 -5.78812215e-02 -5.80958479e-02 0.00000000e+00 -1.72569373e-03 -2.67489992e-02 2.25330916e-01 -3.03498374e-01 -2.92844675e-01 3.23906129e-03 1.36460743e-01 -8.98661244e-05 -1.20123721e-03 -1.32347269e-01 2.67496803e-02 -7.78937144e-02 8.35157864e-02 -2.02847224e-02 -1.87567122e-01 -8.50602034e-02 -2.83424174e-06 -5.65465335e-02 -2.74688532e-01 1.13840720e-01 -3.16493957e-01 -1.24307964e-01 -1.87628035e-01 -7.39892549e-02] text data set, newsgroups http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html CRF classifier for structured prediction (Hedge cues & their scopes) # CODE OF THE PERCEPTRON # Running 10 times over the whole dataset, is sufficient # w = np.ones(65) for t in range(10): for x,y in zip(digits.data,digits.target): xb = np.append(x,[1.0]) s = np.sum(wi*xi for wi,xi in zip(w,xb)) pred = np.sign(s) truelab = 1.0 if y==3 else -1.0 if pred != truelab: w = w + truelab*xb Try to predict a fresh drawn number Visualisation Gijs -> coefficient tells you how the pixel influences the probability -> visualize for each picture the influence transforming coefficients in 0/1 and assigning a colour