In [1]:

hide_inp

Copied!





import math
from dataclasses import dataclass

import chalk
from chalk import hcat
from colour import Color
from mt_diagrams.drawing import r
from mt_diagrams.mlprimer_draw import (
    compare,
    draw_graph,
    draw_nn_graph,
    draw_with_hard_points,
    graph,
    s,
    s1,
    s1_hard,
    s2,
    s2_hard,
    show,
    show_loss,
    split_graph,
    with_points,
)

import minitorch

chalk.set_svg_draw_height(300)
chalk.set_svg_height(300)
import math
from dataclasses import dataclass

import chalk
from chalk import hcat
from colour import Color
from mt_diagrams.drawing import r
from mt_diagrams.mlprimer_draw import (
    compare,
    draw_graph,
    draw_nn_graph,
    draw_with_hard_points,
    graph,
    s,
    s1,
    s1_hard,
    s2,
    s2_hard,
    show,
    show_loss,
    split_graph,
    with_points,
)

import minitorch

chalk.set_svg_draw_height(300)
chalk.set_svg_height(300)

ML Primer¶

This guide is a primer on the very basics of machine learning that are necessary to complete the assignments and motivate the final system. Machine learning is a rich and well-developed field with many different models, goals, and learning settings. There are many great texts that cover all the aspects of the area in detail. This guide is not that. Our goal is to explain the minimal details of one dataset with one class of model. Specifically, this is an introduction to supervised binary classification with neural networks. The goal of this section is to learn how a basic neural network works to classify simple points.

Dataset¶

Supervised learning problems begin with a labeled training dataset. We assume that we are given a set of labeled points. Each point has two coordinates $x_1$ and $x_2$, and has a label $y$ corresponding to an O or X. For instance, here is one O labeled point:

In [2]:

hide_inp

Copied!

d = hcat([split_graph([s1[0]], []), split_graph([s1[1]], [])], 0.3)
r(d, "figs/Graphs/data1.svg")
d = hcat([split_graph([s1[0]], []), split_graph([s1[1]], [])], 0.3)
r(d, "figs/Graphs/data1.svg")

Out[2]:

No description has been provided for this image

And here is an X labeled point.

In [3]:

hide_inp

Copied!

d = hcat([split_graph([], [s2[0]]), split_graph([], [s2[1]])], 0.3)
r(d, "figs/Graphs/data2.svg")
d = hcat([split_graph([], [s2[0]]), split_graph([], [s2[1]])], 0.3)
r(d, "figs/Graphs/data2.svg")

Out[3]:

It is often convenient to plot all of the points together on one set of axes.

In [4]:

hide_inp

Copied!

d = split_graph(s1, s2)
r(d, "figs/Graphs/data3.svg")
d = split_graph(s1, s2)
r(d, "figs/Graphs/data3.svg")

Out[4]:

Here we can see that all the X points are in the top-right and all the O points are on the bottom-left. Not all datasets is this simple, and here is another dataset where points are split up a bit more.

In [5]:

hide_inp

Copied!

d = split_graph(s1_hard, s2_hard)
r(d, "figs/Graphs/data4.svg")
d = split_graph(s1_hard, s2_hard)
r(d, "figs/Graphs/data4.svg")

Out[5]:

Later in the class, we will consider datasets of different forms, e.g. a dataset of handwritten numbers, where some are 8's and others are 2's:

Here is an example of what this dataset looks like.

Model¶

Our ML system needs to specify a model that we want to the data. A model is a function that assigns labels to data points. We can specify a model in Python through its parameters and function.

In [6]:

Copied!





@dataclass
class Linear:
    # Parameters
    w1: float
    w2: float
    b: float

    def forward(self, x1: float, x2: float) -> float:
        return self.w1 * x1 + self.w2 * x2 + self.b
@dataclass
class Linear:
    # Parameters
    w1: float
    w2: float
    b: float

    def forward(self, x1: float, x2: float) -> float:
        return self.w1 * x1 + self.w2 * x2 + self.b

This model can be written mathematically as,

$$m(x_1, x_2; w_1, w_2, b) = x_1 \times w_1 + x_2 \times w_2 + b$$.

We call it a linear model because it divides the data points up based on a line. We can visualize this be computing the "decision boundary", i.e. the areas where this function returns a positive and negative boundary.

In [7]:

Copied!

model = Linear(1, 1, -0.9)
model = Linear(1, 1, -0.9)

In [8]:

hide_inp

Copied!

d = draw_graph(model)
r(d, "figs/Graphs/model1.svg")
d = draw_graph(model)
r(d, "figs/Graphs/model1.svg")

Out[8]:

We can overlay the simple dataset described earlier over this model. This tells us roughly how well the model fits this dataset.

In [9]:

hide_inp

Copied!

d = show(model)
r(d, "figs/Graphs/incorrect.svg")
d = show(model)
r(d, "figs/Graphs/incorrect.svg")

Out[9]:

Models can take many different forms, Here is another model which has a compound form. We will discuss these types of models more below. It splits its decision into three regions (Model B).

In [10]:

Copied!





@dataclass
class Split:
    m1: Linear
    m2: Linear

    def forward(self, x1, x2):
        return self.m1.forward(x1, x2) * self.m2.forward(x1, x2)
@dataclass
class Split:
    m1: Linear
    m2: Linear

    def forward(self, x1, x2):
        return self.m1.forward(x1, x2) * self.m2.forward(x1, x2)

In [11]:

Copied!

model_b = Split(Linear(1, 1, -1.5), Linear(1, 1, -0.5))
model_b = Split(Linear(1, 1, -1.5), Linear(1, 1, -0.5))

In [12]:

hide_inp

Copied!

d = draw_graph(model_b)
r(d, "figs/Graphs/model2.svg")
d = draw_graph(model_b)
r(d, "figs/Graphs/model2.svg")

Out[12]:

Models may also have strange shapes and even disconnected regions. Any blue/red split will do, for instance (Model C):

In [13]:

Copied!





@dataclass
class Part:
    def forward(self, x1, x2):
        return 1 if (0.0 <= x1 < 0.5 and 0.0 <= x2 < 0.6) else 0
@dataclass
class Part:
    def forward(self, x1, x2):
        return 1 if (0.0 <= x1 < 0.5 and 0.0 <= x2 < 0.6) else 0

In [14]:

hide_inp

Copied!

d = draw_graph(Part())
r(d, "figs/Graphs/model3.svg")
d = draw_graph(Part())
r(d, "figs/Graphs/model3.svg")

Out[14]:

Parameters¶

Once we have decided on the shape that we are using, we need a way to move between models in that class. Ideally, we would have internal knobs that alter the properties of the model.

In [15]:

Copied!

show(Linear(1, 1, -0.5))
show(Linear(1, 1, -0.5))

Out[15]:

In [16]:

Copied!

show(Linear(1, 1, -1))
show(Linear(1, 1, -1))

Out[16]:

In the case of the linear models, there are two knobs,

a. rotating the separator

In [17]:

hide_inp

Copied!





model1 = Linear(1, 1, -1.0)
model2 = Linear(0.5, 1.5, -1.0)
d = compare(model1, model2)
r(d, "figs/Graphs/weight.svg")
model1 = Linear(1, 1, -1.0)
model2 = Linear(0.5, 1.5, -1.0)
d = compare(model1, model2)
r(d, "figs/Graphs/weight.svg")

Out[17]:

b. changing the separator cutoff

In [18]:

hide_inp

Copied!





model1 = Linear(1, 1, -1.0)
model2 = Linear(1, 1, -1.5)
d = compare(model1, model2)
r(d, "figs/Graphs/bias.svg")
model1 = Linear(1, 1, -1.0)
model2 = Linear(1, 1, -1.5)
d = compare(model1, model2)
r(d, "figs/Graphs/bias.svg")

Out[18]:

Parameters are the set of numerical values that fully define a model's decisions. Parameters are critical for storing how a model acts, and necessary for producing its decision on a given data point.

Recall the functional form of the model is,

$$m(x_1, x_2; w_1, w_2, b) = x_1 \times w_1 + x_2 \times w_2 + b$$

Here $w_1, w_2, b$ are parameters, $x_1, x_2$ are the input point. The semi-colon notation indicates which arguments are for parameters and which are for data.

Our goal in this class will be to move these knobs to find the best data fit.

In [19]:

Copied!

biases = [(i / 25.0) - 0.1 for i in range(0, 26, 5)]
biases = [(i / 25.0) - 0.1 for i in range(0, 26, 5)]

In [20]:

hide_inp

Copied!

d = hcat([show(Linear(1.0, 1.0, -b)) for b in biases], sep=0.5)
r(d, "figs/Graphs/knob.svg")
d = hcat([show(Linear(1.0, 1.0, -b)) for b in biases], sep=0.5)
r(d, "figs/Graphs/knob.svg")

Out[20]:

Loss¶

Observing the data, we can see that some parameters lead to good models with few classification errors,

In [21]:

Copied!

show(Linear(1, 1, -1.0))
show(Linear(1, 1, -1.0))

Out[21]:

And some are bad and make multiple errors,

In [22]:

Copied!

show(Linear(1, 1, -0.5))
show(Linear(1, 1, -0.5))

Out[22]:

In order to find a good model, we need to first define what good means. We do this through a loss function that quantifies how badly we are currently doing. A good model has small loss.

Our loss function will be based on the distance and direction of the line from each point to the decision boundary.

In [23]:

hide_inp

Copied!

d = with_points(s1, s2, Linear(1, 1, -0.4))
r(d, "figs/Graphs/to_boundary.svg")
d = with_points(s1, s2, Linear(1, 1, -0.4))
r(d, "figs/Graphs/to_boundary.svg")

Out[23]:

Consider a single point with different models.

This point might be classified the correct side and very far from this line (Point A, "great"):

In [24]:

hide_inp

Copied!

d = with_points([s1[0]], [], Linear(1, 1, -1.5))
r(d, "figs/Graphs/pointA.svg")
d = with_points([s1[0]], [], Linear(1, 1, -1.5))
r(d, "figs/Graphs/pointA.svg")

Out[24]:

Or it might be on the correct side of the line, but close to the line (Point B, "worrisome"):

In [25]:

hide_inp

Copied!

d = with_points([s1[0]], [], Linear(1, 1, -1))
r(d, "figs/Graphs/pointB.svg")
d = with_points([s1[0]], [], Linear(1, 1, -1))
r(d, "figs/Graphs/pointB.svg")

Out[25]:

Or this point might be classified on the wrong side of the line (Point C, "bad"):

In [26]:

hide_inp

Copied!

d = with_points([s1[0]], [], Linear(1, 1, -0.5))
r(d, "figs/Graphs/pointC.svg")
d = with_points([s1[0]], [], Linear(1, 1, -0.5))
r(d, "figs/Graphs/pointC.svg")

Out[26]:

The loss is determined based on a function of this distance. The most commonly used function (and the one we will focus on) is the sigmoid function. For strong negative inputs, it goes to zero, and for strong positive, it goes to 1. In between, it forms a smooth S-curve.

In [27]:

hide_inp

Copied!

d = graph(minitorch.operators.sigmoid, width=8).scale_x(0.5)
r(d, "figs/Graphs/loss.svg")
d = graph(minitorch.operators.sigmoid, width=8).scale_x(0.5)
r(d, "figs/Graphs/loss.svg")

Out[27]:

For computational reasons, in practice we work with the log of this function. This yields a loss function that gets much worse as we move further from the decision boundary.

In [28]:

Copied!

def point_loss(x):
    return -math.log(minitorch.operators.sigmoid(-x))
def point_loss(x):
    return -math.log(minitorch.operators.sigmoid(-x))

In [29]:

hide_inp

Copied!

d = graph(point_loss, [], [])
r(d, "figs/Graphs/pointloss.svg")
d = graph(point_loss, [], [])
r(d, "figs/Graphs/pointloss.svg")

Out[29]:

The losses of three X points land on the following positions for the sigmoid curve. Almost zero for Point A, middle value for Point B, and nearly one for Point C.

In [30]:

hide_inp

Copied!

d = graph(point_loss, [], [-2, -0.2, 1])
r(d, "figs/Graphs/pointloss2.svg")
d = graph(point_loss, [], [-2, -0.2, 1])
r(d, "figs/Graphs/pointloss2.svg")

Out[30]:

Loss is given for the red points as well, but they are penalized in the opposite direction,

In [31]:

hide_inp

Copied!

d = graph(lambda x: point_loss(-x), [-1, 0.4, 1.3], [])
r(d, "figs/Graphs/pointloss3.svg")
d = graph(lambda x: point_loss(-x), [-1, 0.4, 1.3], [])
r(d, "figs/Graphs/pointloss3.svg")

Out[31]:

The total loss function $L$ for a model is the sum of each of the individual losses. Specifically,

$$L(w_1, w_2, b) = -\sum_j \log \sigma ( y^j \times m(x^j_1, x^j_2 ; w_1, w_2, b) )$$

Where $(x^j, y^j)$ are the datapoints, $\sigma$ is the sigmoid function, and multiplying by $y$ reverses the function based on the true class of the point. Here is what this looks like in code.

In [32]:

Copied!





def full_loss(m):
    l = 0
    for x, y in zip(s.X, s.y):
        l += point_loss(-y * m.forward(*x))
    return -l


# -

# Fitting Parameters
# --------------------

# To review, the model class tells us what shapes we can consider, the parameters
# tell us the decision boundary, and the loss tells us how well the current model is doing.
#
# The last step is to produce a method for finding a good model
# given a loss function, referred to as *parameter fitting*.

# Exact parameter fitting is difficult. For all but the
# simplest models, it is a challenging task.
# This example has just 3 parameters, but some large models may have billions of parameters that need to be fit.

# We will focus on parameter fitting with *gradient
# descent*. Gradient descent works in the following manner.

# 1. Compute the loss function, $L$, for the data with the current parameters.
# 2. See how small changes to each of the parameters would change the loss.
# 3. Update the parameters with a small change in the direction that locally
#    most reduces the loss.

# Let's return to the incorrect model above.
def full_loss(m):
    l = 0
    for x, y in zip(s.X, s.y):
        l += point_loss(-y * m.forward(*x))
    return -l


# -

# Fitting Parameters
# --------------------

# To review, the model class tells us what shapes we can consider, the parameters
# tell us the decision boundary, and the loss tells us how well the current model is doing.
#
# The last step is to produce a method for finding a good model
# given a loss function, referred to as *parameter fitting*.

# Exact parameter fitting is difficult. For all but the
# simplest models, it is a challenging task.
# This example has just 3 parameters, but some large models may have billions of parameters that need to be fit.

# We will focus on parameter fitting with *gradient
# descent*. Gradient descent works in the following manner.

# 1. Compute the loss function, $L$, for the data with the current parameters.
# 2. See how small changes to each of the parameters would change the loss.
# 3. Update the parameters with a small change in the direction that locally
#    most reduces the loss.

# Let's return to the incorrect model above.

In [33]:

Copied!

m = Linear(1, 1, -0.5)
m = Linear(1, 1, -0.5)

In [34]:

hide_inp

Copied!

d = show(m)
r(d, "figs/Graphs/fit1.svg", 500)
d = show(m)
r(d, "figs/Graphs/fit1.svg", 500)

Out[34]:

As we noted, this model has a high loss, and we want to consider ways to "turn the knobs" of the parameters to find a better model. Let us focus on the parameter controlling the intercept.

We can consider how the loss changes with respect to just varying this parameter. It seems like the loss will go down if we move the intercept a bit.

In [35]:

Copied!

m = Linear(1, 1, -0.55)
m = Linear(1, 1, -0.55)

In [36]:

hide_inp

Copied!

d = show(m)
r(d, "figs/Graphs/fit2.svg", 500)
d = show(m)
r(d, "figs/Graphs/fit2.svg", 500)

Out[36]:

In [37]:

hide_inp

Copied!





d = show_loss(full_loss, Linear(1, 1, 0))
chalk.set_svg_height(500)
r(d, "figs/Graphs/loss.svg", 500)
d
d = show_loss(full_loss, Linear(1, 1, 0))
chalk.set_svg_height(500)
r(d, "figs/Graphs/loss.svg", 500)
d

Out[37]:

Doing this leads to a better model.

In [38]:

Copied!

chalk.set_svg_height(200)
chalk.set_svg_height(200)

We can repeat this process for the intercept as well as for all the other parameters in the model.

But how did we know how the loss function will change? For a small problem, we can move and see. But remember that machine learning models are large.

In the first module of Minitorch, we will see how to compute the direction efficiently for small problems, and then scale it up to much large models.

Neural Networks¶

The linear model class can be used to find good fits to the data we have considered so far, but it fails for data that splits up into multiple segments. These datasets are not linearly separable. Let us consider a very simple dataset with this property.

In [39]:

Copied!

split_graph(s1_hard, s2_hard, show_origin=True)
split_graph(s1_hard, s2_hard, show_origin=True)

Out[39]:

Let's look at our dataset:

In [40]:

Copied!

model = Linear(1, 1, -0.7)
model = Linear(1, 1, -0.7)

In [41]:

hide_inp

Copied!

draw_with_hard_points(model)
draw_with_hard_points(model)

Out[41]:

An alternative model class for this data is a neural network. Neural networks can be used to specify a much wider range of separators.

Neural networks are compound model classes that divide classification into two or more stages.

Each stage uses a linear model to seperate the data. And then an activation function to reshape it.

To see how this works consider how we might split up the datasets above. Instead of splitting all the points directly, we might first split off the left points,

In [42]:

hide_inp

Copied!

yellow = Linear(-1, 0, 0.25)
ycolor = Color("#fde699")
draw_with_hard_points(yellow, ycolor, Color("white"))
yellow = Linear(-1, 0, 0.25)
ycolor = Color("#fde699")
draw_with_hard_points(yellow, ycolor, Color("white"))

Out[42]:

And then produce another separator (green) to pull apart the red points,

In [43]:

hide_inp

Copied!

green = Linear(1, 0, -0.8)
gcolor = Color("#d1e9c3")
draw_with_hard_points(green, gcolor, Color("white"))
green = Linear(1, 0, -0.8)
gcolor = Color("#d1e9c3")
draw_with_hard_points(green, gcolor, Color("white"))

Out[43]:

We would like only points in the green or yellow sections to be classified as X's.

To do this, we employ an activation function that filters out only these points. This function is known as a ReLU function, which is a fancy way of saying "threshold".

$$ \text{ReLU}(z) = \begin{cases} z & z \geq 0\\ 0 & z< 0 \end{cases}$$

In [44]:

hide_inp

Copied!





graph(
    minitorch.operators.relu,
    [yellow.forward(*pt) for pt in s2_hard],
    [yellow.forward(*pt) for pt in s1_hard],
    3,
    0.25,
    c=ycolor,
)
graph(
    minitorch.operators.relu,
    [yellow.forward(*pt) for pt in s2_hard],
    [yellow.forward(*pt) for pt in s1_hard],
    3,
    0.25,
    c=ycolor,
)

Out[44]:

For the yellow separator, the ReLU yields the following values:

In [45]:

hide_inp

Copied!





graph(
    minitorch.operators.relu,
    [green.forward(*pt) for pt in s2_hard],
    [green.forward(*pt) for pt in s1_hard],
    3,
    0.25,
    c=gcolor,
)
graph(
    minitorch.operators.relu,
    [green.forward(*pt) for pt in s2_hard],
    [green.forward(*pt) for pt in s1_hard],
    3,
    0.25,
    c=gcolor,
)

Out[45]:

Basically the right X's are thresholed to positive values and the other O's and X's are 0.

Finally yellow and green become our new $x_1, x_2$. Since all the O's are now at the origin it is very easy to separate out the space.

In [46]:

hide_inp

Copied!

draw_nn_graph(green, yellow)
draw_nn_graph(green, yellow)

Out[46]:

Looking back at the original model, this process appears like it has produced two lines to pull apart the data.

In [47]:

Copied!





@dataclass
class MLP:
    lin1: Linear
    lin2: Linear
    final: Linear

    def forward(self, x1, x2):
        x1_1 = minitorch.operators.relu(self.lin1.forward(x1, x2))
        x2_1 = minitorch.operators.relu(self.lin2.forward(x1, x2))
        return self.final.forward(x1_1, x2_1)
@dataclass
class MLP:
    lin1: Linear
    lin2: Linear
    final: Linear

    def forward(self, x1, x2):
        x1_1 = minitorch.operators.relu(self.lin1.forward(x1, x2))
        x2_1 = minitorch.operators.relu(self.lin2.forward(x1, x2))
        return self.final.forward(x1_1, x2_1)

In [48]:

Copied!

mlp = MLP(green, yellow, Linear(3, 3, -0.3))
draw_with_hard_points(mlp)
mlp = MLP(green, yellow, Linear(3, 3, -0.3))
draw_with_hard_points(mlp)

Out[48]:

In [49]:

hide_inp

Copied!

d = draw_with_hard_points(mlp)
r(d, "figs/Graphs/hard.svg")
d = draw_with_hard_points(mlp)
r(d, "figs/Graphs/hard.svg")

Out[49]:

Mathematically we can think of the transformed data as values $h_1, h_2$ which we get from applying separators with different parameters to the original data. The final prediction then applies a separator to $h_1, h_2$.

\begin{eqnarray*} h_ 1 &=& \text{ReLU}(x_1 \times w^0_1 + x_2 \times w^0_2 + b^0) \\ h_ 2 &=& \text{ReLU}(x_1 \times w^1_1 + x_2 \times w^1_2 + b^1)\\ m(x_1, x_2) &=& h_1 \times w_1 + h_2 \times w_2 + b \end{eqnarray*}

Here $w_1, w_2, w^0_1, w^0_2, w^1_1, w^1_2, b, b^0, b^1$ are all parameters. We have gained more flexible models, at the cost of now needing to fit many more parameters to the data.

This neural network will be the main focus for the first couple models. It appears quite simple, but fitting it effectively will require building up systems infrastructure. Once we have this infrastructure, though, we will be able to easily support most modern neural network models.