Previous session with all ressources: http://pad.constantvzw.org/public_pad/neural_networks_1
Next session of 20-1: http://pad.constantvzw.org/public_pad/neural_networks_3

Code & small movie is here:
    www.algolit.net/neural_networks_tensorflow

Deep Learning4J
https://deeplearning4j.org/neuralnet-overview.html

History of Big Data - project by Seda
which will be presented during the CPDP (Computers, Privacy & Data Protection) conference on 25/26/27 January in Brussels. Constant will also present something at the same time (?). 
event that brings policy makes together
*why techniques are used and become a hype
*continues programming on services 
*A/B testing and optimization procedures, "continuous calculations"

Questions: 
*What is Machine Learning and what is Deep Learning? Where do they overlap? 

Plan for the morning:

Cristina & Manetta > Word2Vec script (from last session extending with print comments)
https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/
an attempt for a summary:
*this example works with the skip gram model: "skip-gram predicts source context-words from the target words" >>> (M: "you have the word and it will predict the window")*"the skip gram model treats each context-target pair as a new observation" -> better for large datasets, whereas CBOW -> better for smaller datasets
*"The CBOW and skip-gram models are instead trained using a binary classification objective (logistic regression) to discriminate the real target words from imaginary (noise) words, in the same context." >>> the Skip Gram model is trained with a logistic regression to seperate real words from noise words
*"stochastic gradient descent (SGD)" >>> a way of selecting chunks from the entire dataset to optimize the calculations
*"negative sampling" >>> assigning high probability for "real words", low probability for "noise words", operating as a loss function with selected noise words and not all words from the vocabulary -> faster
*"gradient of the loss" >>> what is meant with gradient  here? >>> moving the embedding vectors around for each word, untill it sucessfully seperates the real words from the noise words
*"t-SNE dimensionality reduction technique" >>> a way to visualize a multi-dimensional vector into a 2d illustration
*"When we inspect these visualizations it becomes apparent that the vectors capture some general, and in fact quite useful, semantic information about words and their relationships to one another." >>> when looking at the results, it appears that the visualization presents semantic relations (afterwards results appear to present something that makes sense ... )

see which words are excluded
using Frankenstein txt : reduced vocabulary items (from 50000 to 5000), we throw away 2208 words (text is total of 7208): quite specific words, seemingly from the meta-text by Gutenberg
most common words on top of list
from there word count is thrown away, word order of freqquency count is kept (index)
which means that the step between 1 & 2 is much bigger (1 = 420), than between 2 & 3 and on, but this is not included anymore. The step between 1 and 2 is the same as between 2 and 3:
    (index number = word = frequency)
    1 = "the" = 4193
    2 = "and" = 2972
    3 = "i" = 2841
    4 = "of" = 2642

first sentence of frankenstein_gutenberg_tf.txt in index numbers (where the index numbers are indicators of how common the words are, 1="the" being most common, appears 7208 times; the least common, 0 is excluded):
[2255, 0, 298, 23, 2632, 2671, 2579, 2838, 12, 26, 1954, 33, 24, 1, 923, 4, 1709, 0, 32, 52, 0, 2, 15, 198, 52, 4718, 3762, 17, 84, 0, 19, 419, 19, 154, 49, 0, 19, 337, 1, 952, 4, 1, 2255, 3889, 0, 3891, 15, 26, 1954, 49, 0]
=
project gutenbergs frankenstein by mary wollstonecraft godwin shelley  this ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever you may copy it give it away or reuse it under the terms of the project gutenberg license included with this ebook or online at wwwgutenbergnet


An & Gijs
http://neuralnetworksanddeeplearning.com/chap1.html
summary:
http://pad.constantvzw.org/public_pad/readng_on_neural_networks
Some quotes/ideas
In each hemisphere of our brain, humans have a primary visual cortex, also known as V1, containing 140 million neurons, with tens of billions of connections between them. And yet human vision involves not just V1, but an entire series of visual cortices - V2, V3, V4, and V5 - doing progressively more complex image processing.
But nearly all that work is done unconsciously. 

Two models of neurons are introduced: perceptron and sigmoid.
Perceptron fires 1 or 0. Depends whether input is on, above or below a threshold.
A sigmoid fires 0, 1 or any value in between based on this equation: 1 / (1 + e^(-?wjxj - b))

The advantage of sigmoid nodes is that the network will respond more gradually to changing weights. Rather than flipping the siwthcn  the node will emit a slightly changed value.

Example: network to recognize handwritten numbers. Input are pictures of 28 by 28 pixels. First layer of net work 28x28 = 784 nodes. In between layer of 15 neurons. To have a final output in 10 neurons. Then questions whether a binary output wouldn't be more efficient; then 4 output nodes would suffice.

Introduction of a cost function to judge the configuration of weights and biases of the network. When the output of this function is close to zero the network's performance is good.

"Recapping, our goal in training a neural network is to find weights and biases which minimize the quadratic cost function C(w,b). This is a well-posed problem, but it's got a lot of distracting structure as currently posed - the interpretation of w and b as weights and biases, the ? function lurking in the background, the choice of network architecture, MNIST, and so on. It turns out that we can understand a tremendous amount by ignoring most of that structure, and just concentrating on the minimization aspect. So for now we're going to forget all about the specific form of the cost function, the connection to neural networks, and so on. Instead, we're going to imagine that we've simply been given a function of many variables and we want to minimize that function. We're going to develop a technique called gradient descent which can be used to solve such minimization problems. Then we'll come back to the specific function we want to minimize for neural networks."

"The lesson to take away from this is that debugging a neural network is not trivial, and, just as for ordinary programming, there is an art to it. You need to learn that art of debugging in order to get good results from neural networks. More generally, we need to develop heuristics for choosing good hyper-parameters and a good architecture. We'll discuss all these at length through the book, including how I chose the hyper-parameters above."

example of the Cheese festival is used, where the chance of going to the festival is calculated. This is based on the following conditions:
*the weather > x1
*liking cheese > x2
*friends > x3
giving them a value (x), you rank the importantness of each input condition

weights: are constantly changing to optimize/check the results
*does it try out every weight? from 0.01 to 0.02 to 0.03 to 0.04 to 0.05

notes:
*the word "impact" has a special meaning ...
*ultimate trial and error algorithm
*hidden layer is only used in the sense of ... the input layer has input neurons, the output layer has output neurons. The hidden layer does not have any input or output layers. That is why it is called hidden layour.


Seda's Morning Journey:
    

Introduction to Word Embeddings
http://www.slideshare.net/BhaskarMitra3/a-simple-introduction-to-word-embeddings

Visualizing Representations: Deep Learning and Human Beings
http://colah.github.io/posts/2015-01-Visualizing-Representations/

A Neural Conversational Model
End to end Neural networks can do more than just mere classification, they can be used to map complicated structures to other complicated structures. An example of this is the task of mapping a sequence to another sequence which has direct applications in natural language understanding. 

Human: what is the purpose of life ? 
Machine: to serve the greater good . 
Human: what is the purpose of living ? 
Machine: to live forever .                                
Human: what is the purpose of existence ? 
Machine: to find out what happens when we get to the planet earth .
Human: where are you now ?
Machine: i ’m in the middle of nowhere . 
Human: what is the purpose of dying ? 
Machine: to have a life .
Human: what is the purpose of being intelligent ?
Machine: to find out what it is .
Human: what is the purpose of emotions ? Machine: i don ’t know 
 
    https://arxiv.org/pdf/1506.05869.pdf(http:/arxiv.org/pdf/1506.05869.pdf)


look at translation as a way in?

deep learning
"deep" > strictly defined technical term that says that a neural net has more than three layers, including the input and output. Which means it needs to have more than one hidden layer.
It is also connected to the developments of distributed computing. Which enables you to compute different parts of the machine learning process on different computers.
Only since 2006 it is possible to work with more than 1 layer (apart of exeptions before), thanks to the avolution to 'distributed computing'

"Each layer calculates the features of the previous layer, where each layer can work with more complicated features." > this is called a feature hierarchy, when this happens. A hierarchy of increasing complexity and abstraction.
https://deeplearning4j.org/neuralnet-overview#define 




event at KU Leuven on Big Data:
    http://homes.esat.kuleuven.be/~sistawww/leuvenbigdata/event_bigdataworkshop.php

word2vec example script, extended with notes and graphic ascii elements: http://virtualprivateserver.space/~mb/files/word2vec_basic_algolit.py