In each hemisphere of our brain, humans have a primary visual cortex, also known as V1, containing 140 million neurons, with tens of billions of connections between them. And yet human vision involves not just V1, but an entire series of visual cortices - V2, V3, V4, and V5 - doing progressively more complex image processing.
But nearly all that work is done unconsciously. 

the neural network uses the examples to automatically infer rules for recognizing handwritten digits. 
-> by increasing the number of training examples, improve its accuracy. 
-> using thousands or even millions or billions of training examples.

best commercial neural networks are now used by 
- banks to process cheques
- post offices to recognize addresses

handwriting recognition:
- is excellent prototype problem for learning about neural networks in general
- great way to develop more advanced techniques, such as deep learning

along the way:
key ideas about neural networks:
- two important types of artificial neuron (the perceptron and the sigmoid neuron)
- standard learning algorithm for neural networks, known as stochastic gradient descent


---------------

PERCEPTRON
basic mathematical modeldeveloped in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. 
(the main neuron model used is one called the sigmoid neuron, for later)

takes several binary inputs, x1,x2,…, and produces a single binary output
Rosenblatt proposed a simple rule to compute the output: weights, w1,w2,…, real numbers expressing the importance of the respective inputs to the output

The neuron's output, 0 or 1, is determined by whether the weighted sum ?jwjxj is less than or greater than some threshold value


SIMPLE EXAMPLE of prediction decision making:
Suppose the weekend is coming up, and you've heard that there's going to be a cheese festival in your city. You like cheese, and are trying to decide whether or not to go to the festival. You might make your decision by weighing up three factors:

    Is the weather good?
    Does your boyfriend or girlfriend want to accompany you?
    Is the festival near public transit? (You don't own a car). 

We can represent these three factors by corresponding binary variables x1,x2, and x3. For instance, we'd have x1=1 if the weather is good, and x1=0 if the weather is bad. Similarly, x2=1 if your boyfriend or girlfriend wants to go, and x2=0 if not. And similarly again for x3 and public transit.

Now, suppose you absolutely adore cheese, so much so that you're happy to go to the festival even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But perhaps you really loathe bad weather, and there's no way you'd go to the festival if the weather is bad. You can use perceptrons to model this kind of decision-making. One way to do this is to choose a weight w1=6
for the weather, and w2=2 and w3=2 for the other conditions. The larger value of w1 indicates that the weather matters a lot to you, much more than whether your boyfriend or girlfriend joins you, or the nearness of public transit. Finally, suppose you choose a threshold of 5 for the perceptron. With these choices, the perceptron implements the desired decision-making model, outputting 1 whenever the weather is good, and 0 whenever the weather is bad. It makes no difference to the output whether your boyfriend or girlfriend wants to go, or whether public transit is nearby.

By varying the weights and the threshold, we can get different models of decision-making. For example, suppose we instead chose a threshold of 3. Then the perceptron would decide that you should go to the festival whenever the weather was good or when both the festival was near public transit and your boyfriend or girlfriend was willing to join you. In other words, it'd be a different model of decision-making. Dropping the threshold means you're more willing to go to the festival.


FIRST LAYER
-> In this network, the first column of perceptrons - what we'll call the first layer of perceptrons - is making three very simple decisions, by weighing the input evidence.

SECOND LAYER
Each of those perceptrons is making a decision by weighing up the results from the first layer of decision-making.
can make a decision at a more complex and more abstract level

DRAWING NETWORK
perceptrons look like they have multiple outputs. In fact, they're still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons. 

use bias instead of the threshold
how easy it is to get the perceptron to output a 1. Or to put it in more biological terms, the bias is a measure of how easy it is to get the perceptron to fire


The computational universality of perceptrons is simultaneously reassuring and disappointing. It's reassuring because it tells us that networks of perceptrons can be as powerful as any other computing device. But it's also disappointing, because it makes it seem as though perceptrons are merely a new type of NAND gate. That's hardly big news!

BUT
we can devise learning algorithms which can automatically tune the weights and biases of a network of artificial neurons. This tuning happens in response to external stimuli, without direct intervention by a programmer.


-------------------

SIGMOID NEURONS
inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and biases so that the output from the network correctly classifies the digit

ex network classifies number as 8 instead of 9:
a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1. That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way
-> So while your "9" might now be classified correctly, the behaviour of the network on all the other images is likely to have completely changed in some hard-to-control way. 

-> overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron
small changes in their weights and bias cause only a small change in their output
the sigmoid neuron has inputs, x1,x2,…. 
BUT:
instead of being just 0 or 1, these inputs can also take on any values between 0 and 1
the output is not 0 or 1. Instead, it's ?(w?x+b), where ? is called the sigmoid function
((? is sometimes called the logistic function, and this new class of neurons is often called logistic neurons))

So when z=w?x+b is very negative/positive, the behaviour of a sigmoid neuron also closely approximates a perceptron (output 0/1). 
It's only when w?x+b is of modest size that there's much deviation from the perceptron model.
The smoothness of ? means that small changes ?wj in the weights and ?b in the bias will produce a small change ?output in the output from the neuron

Suppose we want the output from the network to indicate either "the input image is a 9" or "the input image is not a 9". Obviously, it'd be easiest to do this if the output was a 0 or a 1, as in a perceptron. But in practice we can set up a convention to deal with this, for example, by deciding to interpret any output of at least 0.5 as indicating a "9", and any output less than 0.5 as indicating "not a 9".

-------------------

3- LAYER FEEDFORWARD NEURAL NETWORK 

- the leftmost layer: called the input layer, and the neurons within the layer are called input neurons
- the rightmost layer: called output layer, contains the output neurons, or, as in this case, a single output neuron. 
- the middle layer: called a hidden layer, since the neurons in this layer are neither inputs nor outputs

Note: Somewhat confusingly, and for historical reasons, multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. 

IN THIS BOOK: feedforward neural networks
there are no loops in the network - information is always fed forward, never fed back
VS: recurrent neural networks
other models of artificial neural networks in which feedback loops are possible
-> have neurons which fire for some limited duration of time, before becoming quiescent. That firing can stimulate other neurons, which may fire a little while later, also for a limited duration. That causes still more neurons to fire, and so over time we get a cascade of neurons firing. Loops don't cause problems in such a model, since a neuron's output only affects its input at some later time, not instantaneously.
-> much closer in spirit to how our brains work than feedforward networks



EXAMPLE
Classify images as numbers between 0 and 10:
greyscale image of 28 by 28 pixels 
- input neurons: 784=28×28, with the intensities scaled appropriately between 0 and 1, value of 0.0 representing white, a value of 1.0 representing black,in between values representing gradually darkening shades of grey

- output layer: normally would be a single neuron, with output values of less than 0.5 indicating "input image is not a 9", and values greater than 0.5 indicating "input image is a 9 "
-> in this case: 10 neurons
If the first neuron fires, i.e., has an output ?1, then that will indicate that the network thinks the digit is a 0. If the second neuron fires then that will indicate that the network thinks the digit is a 1. And so on.

- hidden layer(s): not possible to sum up the design process for the hidden layers with a few simple rules of thumb. Instead, neural networks researchers have developed many design heuristics for the hidden layers, which help people get the behaviour they want out of their nets. For example, such heuristics can be used to help determine how to trade off the number of hidden layers against the time required to train the network
-> for now: denote the number of neurons in this hidden layer by n, and we'll experiment with different values for n. The example shown illustrates a small hidden layer, containing just n=15 neurons.

the first neuron in the hidden layer detects whether or not an image like 1st quarter of zero is present.
the second, third, and fourth neurons in the hidden layer detect whether or not the other quarters are present:

-> this is all just a HEURISTIC. Nothing says that the three-layer neural network has to operate in the way I described, with the hidden neurons detecting simple component shapes

----------------------

LEARNING WITH GRADIENT DESCENT

- training data set:  MNIST data set (US National Institute of Standards and Technology), images & their classifications
The FIRST part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size.
The SECOND part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images.
taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students)

- use the notation x to denote a training input. 
each training input x is a 28×28=784-dimensional vector
each entry in the vector represents the grey value for a single pixel in the image
the corresponding desired output is y=y(x), where y is a 10-dimensional vector. 
For example, if a particular training image, x, depicts a 6, then y(x)=(0,0,0,0,0,0,1,0,0,0)T is the desired output from the network

- To quantify how well we're achieving this goal we define a cost function
it's also sometimes known as the mean squared error or just MSE

- the aim of our training algorithm will be to minimize the cost C(w,b) as a function of the weights and biases, we want to find a set of weights and biases which make the cost as small as possible.  We'll do that using an algorithm known as gradient descent
The idea is to use gradient descent to find the weights wk and biases bl which minimize the cost

STOCHASTIC GRADIENT DESCENT
An idea called stochastic gradient descent can be used to speed up learning.  The idea is to estimate the gradient ?C by computing ?Cx for a small sample of randomly chosen training inputs.  By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ?C, and this helps speed up gradient descent, and thus learning.
We'll label those random training inputs X1,X2,…,Xm, and refer to them as a mini-batch.

stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those. Then we pick out another randomly chosen mini-batch and train with those.  And so on, until we've exhausted the training inputs, which is said to complete an epoch of training.  At that point we start over with a new training epoch.
-> We can think of stochastic gradient descent as being like political polling

 In neural networks the cost C is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space.  Some people get hung up thinking: "Hey, I have to be able to visualize all these extra dimensions".  And they may start to worry: "I can't think in four dimensions, let alone five (or five million)".  Is there some special ability they're missing, some ability that "real" supermathematicians have?  Of course, the answer is no.  Even most professional mathematicians can't visualize four dimensions especially well, if at all.  The trick they use, instead, is to develop other ways of representing what's going on.  That's exactly what we did above: we used an algebraic (rather than visual) representation of ?C to figure out how to move so as to decrease C.  
 also see: http://mathoverflow.net/questions/25983/intuitive-crutches-for-higher-dimensional-thinking

STARTING POINT FOR THE STOCHASTIC GRADIENT DESCENT
class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(sizes[:-1], sizes[1:])

the list sizes contains the number of neurons in the respective layers.  So, for example, if we want to create a Network object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer, we'd do this with the code: 
net = Network([2, 3, 1])
-> GET everything on github
git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

The biases and weights in the Network object are all initialized randomly, using the Numpy np.random.randn function to generate Gaussian distributions with mean 0 and standard deviation 1. This random initialization gives our stochastic gradient descent algorithm a place to start from.
the first layer of neurons is an input layer, and omits to set any biases for those neurons, since biases are only ever used in computing the outputs from later layers.
net.weights[1] is a Numpy matrix storing the weights connecting the second and third layers of neurons. 


 a is the vector of activations of the second layer of neurons. To obtain a? we multiply a by the weight matrix w, and add the vector b of biases. We then apply the function ? elementwise to every entry in the vector wa+b. (This is called vectorizing the function ?.)


guess with respect to weights, looks at error (compared to ground truth), adjustment
similar to linear regression


the trained network (with 30 hidden neurons) gives us a classification rate of about 95 percent - 95.45 percent at its peak ("Epoch 24")!
Epoch 24: 9545 / 10000

2nd round (with 100 hidden neurons)
Epoch 28: 9641 / 10000

to obtain these accuracies I had to make specific choices for the number of epochs of training, the mini-batch size, and the learning rate, ?. As I mentioned above, these are known as hyper-parameters for our neural network, in order to distinguish them from the parameters (weights and biases) learnt by our learning algorithm. If we choose our hyper-parameters poorly, we can get bad results. Suppose, for example, that we'd chosen the learning rate to be ?=0.001,
 the performance of the network is getting slowly better over time. That suggests increasing the learning rate, say to ?=0.01. If we do that, we get better results, which suggests increasing the learning rate again. (If making a change improves things, try doing more!) If we do that several times over, we'll end up with a learning rate of something like ?=1.0 (and perhaps fine tune to 3.0), which is close to our earlier experiments. So even though we initially made a poor choice of hyper-parameters, we at least got enough information to help us improve our choice of hyper-parameters.

-> debugging a neural network can be challenging. This is especially true when the initial choice of hyper-parameters produces results no better than random noise.
ex. try the successful 30 hidden neuron network architecture from earlier, but with the learning rate changed to ?=100.0
-> learning rate is too high
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000

What does this mean? Good result compared to what? cfr BASELINES


BASELINES
It's informative to have some simple (non-neural-network) baseline tests to compare against, to understand what it means to perform well.
 The simplest baseline of all, of course, is to randomly guess the digit. That'll be right about ten percent of the time. 

Less trivial baseline: look at how dark an image is, using the training data to compute average darknesses for each digit, 0,1,2,…,9
->  22,25% accuracy

to get much higher accuracies it helps to use established machine learning algorithms, ex the support vector machine or SVM
mnist_svm.py
finetune parameters for SVM: http://peekaboo-vision.blogspot.be/2010/09/mnist-for-ever.html
-> get the performance up above 98.5 percent accuracy
well-designed neural networks outperform every other technique for solving MNIST, including SVMs
-> neural nets outperform humans :-)


we don't immediately have an explanation of how the network does what it does. Can we find some way to understand the principles by which our network is classifying handwritten digits?
ex recognizing human face
Break down question into subquestions, further and further through multiple layers. Ultimately, we'll be working with sub-networks that answer questions so simple they can easily be answered at the level of single pixels. Those questions might, for example, be about the presence or absence of very simple shapes at particular points in the image. Such questions can be answered by single neurons connected to the raw pixels in the image.
-> early layers answering very simple and specific questions about the input image, and later layers building up a hierarchy of ever more complex and abstract concepts. Networks with this kind of many-layer structure - two or more hidden layers - are called deep neural networks.

-> Researchers in the 1980s and 1990s tried using stochastic gradient descent and backpropagation to train deep networks. Unfortunately, except for a few special architectures, they didn't have much luck. The networks would learn, but very slowly, and in practice often too slowly to be useful.

Since 2006, a set of techniques has been developed that enable learning in deep neural nets.
These deep learning techniques are based on stochastic gradient descent and backpropagation, but also introduce new ideas. 
-> people now routinely train networks with 5 to 10 hidden layers