.. jupyter-execute::
   :hide-code:

   import sys
   sys.path.append("../project/")
   import datasets
   from datasets_plotly import animate, plot
   s = datasets.Simple(20)


==================
ML Primer
==================

This guide is a primer on the very basics of machine learning that are
necessary to complete the assignments and motivate the final
system. Machine learning (ML) is a rich and well-developed field with many
different models, goals, and learning settings. There are many great
texts that cover  all the aspects of
the area in detail. (I recommend this `textbook
<https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/>`_.) This
guide is not
that.  Our goal is to explain the minimal details of *one* dataset
with *one* class of model. Specifically, this is an introduction to
`supervised binary classification` with neural networks. The goal of this
section
is to learn how a basic neural network works to classify simple points.


.. image:: figs/Graphs/mlpgraph.png
           :align: center
           :width: 300px


Dataset
-------------

Supervised learning problems begin with a labeled `training` dataset.
We assume that we are given a set of labeled points. Each point has
two coordinates :math:`x_1` and :math:`x_2`, and has a label :math:`y`
corresponding to an O or X. For instance, here is one O labeled point:


.. image:: figs/Graphs/data1.png
           :align: center
           :width: 250px

Here's another:

.. image:: figs/Graphs/data2.png
           :align: center
           :width: 250px

And here is an X labeled point.

.. image:: figs/Graphs/data3.png
           :align: center
           :width: 250px

It is often convenient to plot all of the points together on one set of
axes.

.. image:: figs/Graphs/data@3x.png
           :align: center
           :width: 250px

Here we can see that all the O points are in the top-right and all the X
points are on
the bottom-left. Not all datasets is this simple, and here is another dataset
where points are
split up a bit more.

.. image:: figs/Graphs/split.png
           :align: center
           :width: 250px


Later in the class, we will consider datasets of different forms, e.g. a
dataset of
handwritten numbers, where some are 8's and others are 2's:

.. image:: figs/Conv/im1.png
           :align: center
           :width: 200px
.. image:: figs/Conv/im2.png
           :align: center
           :width: 200px

Here is an example of what this dataset looks like.

.. image:: figs/Conv/mnist.png
           :align: center
           :width: 200px


Model
-------

In addition to a dataset, our ML system need to specify
a model type that we want to `fit` to the data. A `model`
is a function that assigns labels to data points. In 2D,
we can visualize a model by its decision boundary.
For instance, consider the following (Model A).


.. image:: figs/Graphs/model1.png
           :align: center
           :width: 250px

For most of the data points, the model puts them in class X.
Only for a little area on the top right would it decide
to put those points in class O.

We can overlay the simple dataset described ealier over this model.
This tells us roughly how well the model fits this dataset.

.. image:: figs/Graphs/incorrect.png
           :align: center
           :width: 250px


Models can take many different forms, Here is another model, which we will
discuss more below,
that splits the data points up based on three regions (Model B).

.. image:: figs/Graphs/model2.png
           :align: center
           :width: 250px


Models may also have strange shapes and even disconnected regions. Any
blue/red split will do, for instance (Model C):

.. image:: figs/Graphs/model3.png
           :align: center
           :width: 250px


A `model class` specifies the general shape of models that you want to
explore. Given that we as programmers don't know what the dataset looks
like, we try to give a class of functions for our system to explore. Machine
learning is the process of finding the best model from that class.

The first model class we consider is `linear models`. Linear models separate
the data space with only a single straight line. For instance, Model A is
a linear model, but an intuitively "better" model looks like this:


.. image:: figs/Graphs/sector2@3x.png
           :align: center
           :width: 250px

Note that Model B also uses lines, but it is not a linear model: it uses
multiple lines to split up the space.


Let's look at an example of some models. Here is some randomly generated data.

.. jupyter-execute::
   :hide-code:

   plot(s)


.. jupyter-execute::

   def model1(x):
       if x[0] < 0.25:
           return 1
       else:
           return 0
   plot(s, model1)


.. jupyter-execute::

   def model2(x):
       if x[0] < 0.25:
           return 1
       if x[0] > 0.5 and x[1] > 0.6:
           return 1
       else:
           return 0
   plot(s, model2)


Parameters
-----------

Once we have decided on our model class, we need a way to move between models
in that
class. Ideally, we would have internal knobs that alter the properties of
the model.


.. jupyter-execute::

   def make_model(param):
       def model(x):
           if x[0] < param:
               return 1
           else:
               return 0
       return model
   plot(s, make_model(0.4))

.. jupyter-execute::

   plot(s, make_model(0.6))


In the case of the linear models, there are two main knobs we might use,

a. rotating the linear separator ("slope")


.. image:: figs/Graphs/weight.png
           :align: center
           :width: 400px

b. changing the separator cutoff ("intercept")

.. image:: figs/Graphs/bias.png
           :align: center
           :width: 400px

`Parameters` are the set of numerical values that fully define a model's
decisions.
Parameters are critical for storing how a model acts, and necessary for
producing
its decision on a given data point.

In the case of linear models and binary classification, we can write down
the linear model as:

.. math::
   m(x_1, x_2; w_1, w_2, b) = x_1 \times w_1 + x_2 \times w_2 + b

Here :math:`w_1, w_2, b` are parameters, :math:`x_1, x_2` are the input
point, and
the model predicts X if :math:`m` is greater than 1 and O otherwise. The
semi-color
notation indicates which arguments are for parameters and which are for data.

.. jupyter-execute::

   def make_linear(w1, w2, b):
       def model(x):
           return 1 if (x[0] * w1 + x[1] * w2 + b > 0.0) else 0
       return model

   biases = [-0.098 + (i / 100.0) for i in range(25)]
   animate(s, [make_linear(0.1, -0.2, b) for b in biases], biases)


.. note::
   See https://wikipedia.org/wiki/Linear_equation for a review of linear
   equation, and an explanation for why this corresponds to parameterizing
   the slope and intercept of a line.


Loss
-------

When we look at our data, we can clearly see that some models are good and
make no classification errors:

.. image:: figs/Graphs/sector2.png
           :align: center
           :width: 250px

And some are bad and make multiple errors:

.. image:: figs/Graphs/incorrect.png
           :align: center
           :width: 250px

In order to find a good model, we need to first define what good means. We
do this through a
`loss` function that scores how badly we are currently doing. A good model
is the one that
makes this loss as small as possible.

Our loss function will be based on the distance and direction of the line
from each point to
the decision boundary.  You can show that this distance is equivalent to
the absolute value of the function :math:`m()` above.

.. image:: figs/Graphs/distance.png
           :align: center
           :width: 250px

For simplicity, let us consider a single point with different models.

This point might be classified the correct side and very far from this line
(Point A, "great"):

.. image:: figs/Graphs/pt3.png
           :align: center
           :width: 250px


Or it might be on the correct side of the line, but close to the line (Point B,
"worrisome"):

.. image:: figs/Graphs/pt1.png
           :align: center
           :width: 250px

Or this point might be classified on the wrong side of the line (Point C,
"bad"):

.. image:: figs/Graphs/pt2.png
           :align: center
           :width: 250px

The loss is determined based on a function of this distance. The most
commonly used function (and the one we will focus on) is the sigmoid
function. For strong negative inputs, it goes to zero, and for strong
positive, it goes to 1. In between, it forms a smooth S-curve.


.. image:: figs/Graphs/sigmoid.png
           :align: center
           :width: 400px

As shown below, the losses of three X points land on the following positions
for the sigmoid curve.
Almost zero for Point A, middle value for Point B, and nearly one for point C.

.. image:: figs/Graphs/sigmoid2.png
           :align: center
           :width: 400px

The total loss for a model is the product of each of the individual losses.
It's easy to see that a good model yields a lower loss than a bad one.


Fitting Parameters
--------------------

The model class tells us what models we can consider, the parameters
tell us how to specify a given model, and the loss tells us how good
our current model is. What we need is a method for finding a good model
given a loss function. We refer this step as *parameter fitting*.

Unfortunately, parameter fitting is quite difficult. For all but the
simplest ML models, it is a challenging and computational demanding
task. For our sample problem, there are just 3 parameters, but nowadays some
of the
large models may have billions of parameters that need to be fit.

This is the step where libraries like MiniTorch come in handy. This library
aims to demonstrate how with careful coding, we can setup a framework to
fit parameters for supervised classification, in an automatic and efficient
manner.

The library focuses on one form of parameter fitting: `gradient
descent`. Intuitively,
gradient descent works in the following manner.

1. Compute the loss function, :math:`L`, for the data with the parameters.
2. See how small changes to each of the parameters would change the loss.
3. Update the parameters with a small change in the direction that locally
   most reduces the loss.


Let's return to the incorrect model above.

.. image:: figs/Graphs/incorrect.png
           :align: center
           :width: 250px


As we noted, this model has a high loss, and we want to consider ways to
"turn the knobs" of the parameters to find a better model. Let us focus on
the parameter controlling the intercept of the model.

.. image:: figs/Graphs/bias.png
           :align: center
           :width: 300px


We can consider how the loss changes with respect to just varying this
parameter. It seems like the loss will go down if we lower the intercept a bit.

.. image:: figs/Graphs/move.png
           :align: center
           :width: 400px

Doing this leads to a better model.

.. image:: figs/Graphs/incorrect3.png
           :align: center
           :width: 250px

We can repeat this process for the intercept as well as for all the other
parameters in the model.


But how did we know how the loss function changed if we changed the
intercept? For a small problem, we can just move it a bit and see. But
remember that machine learning models can have billions of parameters,
so this would take a ton of time.

A better approach is to utilize calculus and take the derivative of the loss
function with respect to the parameter :math:`L'_b`. If we can efficiently and
automatically take this derivative, it tells us how to change the parameter
to update its value to fit any loss. Even better, if we can efficiently
take a set of derivatives (known as a `gradient`) for all the parameters,
then we know which direction they all should move.

The first 4 modules in MiniTorch are dedicated to implementing this fitting
procedure efficiently.


Neural Networks
------------------

The linear model class can be used to find good fits to the data we have
considered so far, but it fails for data that splits up into multiple
segments. These datasets are not *linearly separable*.

.. image:: figs/Graphs/splitfail.png
           :align: center
           :width: 300px

An alternative model class for this style of data is a neural
network. Neural networks can be used to specify a much wider range of
separators.


Intuitively, neural networks divide classification into two or more stages.
Each stage uses a linear model to reshape the data into new points. The final
stage is a linear classifier over the transformed point.

Let's look at our dataset:

.. image:: figs/Graphs/incorrect.png
           :align: center
           :width: 300px

A neural network might first produce a separator (yellow) to pull apart the
top red points:


.. image:: figs/Graphs/split1.png
           :align: center
           :width: 300px

And then produce another separator (green) to pull apart the bottom red points:

.. image:: figs/Graphs/split2.png
           :align: center
           :width: 300px

The neural network is allowed to transform the points based on the
distance from these separators (very similar to the loss function
above). It can use whatever function it wants to do this
transformation. Ideally, the function would make the points in yellow and
green high, and the
other points low. One function to do this is the ReLU function (ReLU stands
for Rectified Linear Unit, a very complicated way of saying "delete values
below 0".):


.. image:: figs/Graphs/relu2.png
           :align: center
           :width: 400px

For the yellow separator, the ReLU yields the following values:

.. image:: figs/Graphs/relu.png
           :align: center
           :width: 400px

Basically the top X's are positive and the bottom O's and X's are 0.
Something very similar happens for the green separator.

Finally yellow and green become our new :math:`x_1, x_2`. Since all the O's
are now at the origin
it is very easy to separate out the space.


.. image:: figs/Graphs/mlpmid.png
           :align: center
           :width: 300px

Looking back at the original model, this process appears like it has produced
two lines to pull apart the data.


.. image:: figs/Graphs/mlpgraph.png
           :align: center
           :width: 300px


Mathematically we can think of the transformed data as values :math:`h_1,
h_2` which we get from applying
separators with different parameters to the original data. The final prediction
then applies a separator
to :math:`h_1, h_2`.

.. math::

   \begin{eqnarray*}
   h_ 1 &=& ReLU(x_1 \times w^0_1 + x_2 \times w^0_2 + b^0) \\
   h_ 2 &=& ReLU(x_1 \times w^1_1 + x_2 \times w^1_2 + b^1)\\
   m(x_1, x_2) &=& h_1 \times w_1 + h_2 \times w_2 + b
   \end{eqnarray*}

Here :math:`w_1, w_2, w^0_1, w^0_2, w^1_1, w^1_2, b, b^0, b^1` are all
parameters. We have gained more flexible models,
at the cost of now needing to fit many more parameters to the data.

This neural network will be the main focus for the first couple
models. It appears quite simple, but fitting it effectively will
require building up systems infrastructure. Once we have this
infrastructure, though, we will be able to easily support most modern
neural network models.