import sys
import minitorch
sys.path.append("project/interface/")
sys.path.append("../project/interface/")
from plots import animate, plot
from drawing import split_graph, draw_below, aqua, black, draw_graph
from chalk import *
from drawing import x_mark, circle_mark
from colour import Color
import random
random.seed(10)
s = minitorch.datasets["Simple"](10)
spl = minitorch.datasets["Split"](10)
def linear(w, b):
def model(x1, x2):
return w[0] * x1 + w[1] * x2 + b
return model
base_model = linear((1, 1), -1.1)
s1 = [s.X[i] for i in range(len(s.y)) if base_model(*s.X[i]) < 0]
s2 = [s.X[i] for i in range(len(s.y)) if base_model(*s.X[i]) > 0]
def draw_with_easy_points(model):
return draw_graph(model) + split_graph(s1, s2, show_origin=False)
set_svg_draw_height(300)
This guide is a primer on the very basics of machine learning that are necessary to complete the assignments and motivate the final system. Machine learning is a rich and well-developed field with many different models, goals, and learning settings. There are many great texts that cover all the aspects of the area in detail. This guide is not that. Our goal is to explain the minimal details of one dataset with one class of model. Specifically, this is an introduction to supervised binary classification with neural networks. The goal of this section is to learn how a basic neural network works to classify simple points.
draw_with_easy_points(base_model)
Supervised learning problems begin with a labeled training
dataset.
We assume that we are given a set of labeled points. Each point has
two coordinates $x_1$ and $x_2$, and has a label $y$
corresponding to an O or X. For instance, here is one O labeled point:
split_graph([s1[0]], []) | hstrut(0.3) | split_graph([s1[1]], [])
And here is an X labeled point.
split_graph([], [s2[0]]) | hstrut(0.3) | split_graph([], [s2[1]])
It is often convenient to plot all of the points together on one set of axes.
split_graph(s1, s2)
Here we can see that all the O points are in the top-right and all the X points are on the bottom-left. Not all datasets is this simple, and here is another dataset where points are split up a bit more.
d = spl
s1_hard = [d.X[i] for i in range(len(d.y)) if d.y[i] == 0]
s2_hard = [d.X[i] for i in range(len(d.y)) if d.y[i] == 1]
split_graph(s1_hard, s2_hard)
Later in the class, we will consider datasets of different forms, e.g. a dataset of handwritten numbers, where some are 8's and others are 2's:
Here is an example of what this dataset looks like.
In addition to a dataset, our ML system need to specify
a model type that we want to fit
to the data. A model
is a function that assigns labels to data points. In 2D,
we can visualize a model by its decision boundary.
For instance, consider the following (Model A).
def draw_with_hard_points(model, c1=None, c2=None):
if c1 is None:
d = draw_graph(model)
else:
d = draw_graph(model,c1=c1, c2=c2)
return d + split_graph(s1_hard, s2_hard, show_origin=False)
#-
bad_model = linear((1, 1), -1.5)
draw_with_easy_points(bad_model)
# For most of the data points, the model puts them in class X.
# Only for a little area on the top right would it decide
# to put those points in class O.
# We can overlay the simple dataset described ealier over this model.
# This tells us roughly how well the model fits this dataset.
draw_with_easy_points(bad_model)
# Models can take many different forms, Here is another model, which we will
# discuss more below,
# that splits the data points up based on three regions (Model B).
def split_model(x1, x2):
return linear((1, 1), -1.5)(x1, x2) * linear((1, 1), -0.5)(x1, x2)
draw_with_easy_points(split_model)
# Models may also have strange shapes and even disconnected regions. Any
# blue/red split will do, for instance (Model C):
def part_model(x1, x2):
return 1 if (0.0 <= x1 < 0.5 and 0.0 <= x2 < 0.6) else 0
draw_with_easy_points(part_model)
# A *model class* specifies the general shape of models that you want to
# explore. Given that we as programmers don't know what the dataset looks
# like, we try to give a class of functions for our system to explore. Machine
# learning is the process of finding the best model from that class.
# The first model class we consider is *linear models*. Linear models separate
# the data space with only a single straight line. For instance, Model A is
# a linear model, but an intuitively "better" model looks like this:
draw_with_easy_points(base_model)
# Note that Model B also uses lines, but it is not a linear model: it uses
# multiple lines to split up the space.
# Parameters
# -----------
# Once we have decided on our model class, we need a way to move between models
# in that
# class. Ideally, we would have internal knobs that alter the properties of
# the model.
def make_model(param):
def model(x1, x2):
if x1 < param:
return 1
else:
return 0
return model
draw_with_easy_points(make_model(0.4))
draw_with_easy_points(make_model(0.6))
# In the case of the linear models, there are two main knobs we might use,
# a. rotating the linear separator ("slope")
draw_graph(linear((1, 1), -1.0)).center_xy() | hstrut(0.5) | text("→", 0.5).fill_color(black) | hstrut(0.5) | draw_graph(linear((0.5, 1.5), -1.0)).center_xy()
b. changing the separator cutoff ("intercept")
bias = draw_graph(linear((1, 1), -1.0)).center_xy() | hstrut(0.5) | text("→", 0.5).fill_color(black) | hstrut(0.5) | draw_graph(linear((1, 1), -1.5)).center_xy()
bias
Parameters are the set of numerical values that fully define a model's decisions. Parameters are critical for storing how a model acts, and necessary for producing its decision on a given data point.
In the case of linear models and binary classification, we can write down the linear model as:
Here $w_1, w_2, b$ are parameters, $x_1, x_2$ are the input point, and the model predicts X if
Note: $m$ is greater than 1 and O otherwise. The semi-colon notation indicates which arguments are for parameters and which are for data.
def make_linear(w1, w2, b):
def model(x1, x2):
return 1 if (x1 * w1 + x2 * w2 + b > 0.0) else 0
return model
biases = [-0.098 + (i / 100.0) for i in range(0, 25, 5)]
hcat([draw_with_easy_points(make_linear(0.1, -0.2, b)) for b in biases], sep=0.5)
Note:
See https://wikipedia.org/wiki/Linear_equation for a review of linear
equation, and an explanation for why this corresponds to parameterizing
the slope and intercept of a line.
When we look at our data, we can clearly see that some models are good and make no classification errors:
draw_with_easy_points(linear((1, 1), -1.0))
And some are bad and make multiple errors:
draw_with_easy_points(linear((1, 1), -0.5))
In order to find a good model, we need to first define what good means. We
do this through a
loss
function that scores how badly we are currently doing. A good model
is the one that
makes this loss as small as possible.
Our loss function will be based on the distance and direction of the line from each point to the decision boundary. You can show that this distance is equivalent to the absolute value of the function $m()$ above.
def with_points(pts1, pts2, b):
w1, w2 = 1, 1
model = linear((w1, w2), b)
line = make_path([(0, b), (1, b + 1)])
dia = draw_graph(model) + split_graph(pts1, pts2, False)
for pt in pts1 + pts2:
pt2 = line.get_trace().trace_p(P2(pt[0], -pt[1]), V2(-1, 1))
if pt2:
dia += make_path([(pt[0], -pt[1]), pt2]).dashing([5, 5], 0)
return dia
with_points(s1, s2, -0.5)
For simplicity, let us consider a single point with different models.
This point might be classified the correct side and very far from this line (Point A, "great"):
with_points([s1[0]], [], -1.5)
Or it might be on the correct side of the line, but close to the line (Point B, "worrisome"):
with_points([s1[0]], [], -1)
Or this point might be classified on the wrong side of the line (Point C, "bad"):
with_points([s1[0]], [], -0.5)
The loss is determined based on a function of this distance. The most commonly used function (and the one we will focus on) is the sigmoid function. For strong negative inputs, it goes to zero, and for strong positive, it goes to 1. In between, it forms a smooth S-curve.
def graph(fn, xs, os, width=4, offset=0, c=Color("black")):
path = []
for a in range(100):
a = width * ((a / 100) - 0.5) - offset
path.append((a, fn(a)))
dia = make_path([(-width/2, 0), (width/2, 0)]) + make_path(path).line_color(c)
for pt in xs:
dia += x_mark().scale(width / 2).translate(pt, fn(pt))
for pt in os:
dia += circle_mark().scale(width / 2).translate(pt, fn(pt))
return dia.reflect_y()
#-
def loss(x):
return -math.log(minitorch.operators.sigmoid(-x))
graph(loss, [], [])
As shown below, the losses of three X points land on the following positions for the sigmoid curve. Almost zero for Point A, middle value for Point B, and nearly one for Point C.
graph(loss, [], [-2, 0, 2])
The total loss for a model is the product of each of the individual losses. It's easy to see that a good model yields a lower loss than a bad one.
The model class tells us what models we can consider, the parameters tell us how to specify a given model, and the loss tells us how good our current model is. What we need is a method for finding a good model given a loss function. We refer this step as parameter fitting.
Unfortunately, parameter fitting is quite difficult. For all but the simplest ML models, it is a challenging and computational demanding task. For our sample problem, there are just 3 parameters, but nowadays some of the large models may have billions of parameters that need to be fit.
This is the step where libraries like MiniTorch come in handy. This library aims to demonstrate how with careful coding, we can setup a framework to fit parameters for supervised classification, in an automatic and efficient manner.
The library focuses on one form of parameter fitting: gradient descent. Intuitively, gradient descent works in the following manner.
Let's return to the incorrect model above.
draw_with_easy_points(linear((1, 1), -0.5))
As we noted, this model has a high loss, and we want to consider ways to "turn the knobs" of the parameters to find a better model. Let us focus on the parameter controlling the intercept of the model.
bias
We can consider how the loss changes with respect to just varying this parameter. It seems like the loss will go down if we lower the intercept a bit.
linear((1, 1), -0.5)(0, 0)
-0.5
def full_loss(b):
l = 0
m = linear((1, 1), b)
for pt in s1:
l += -loss(-m(*pt))
for pt in s2:
l += -loss( m(*pt))
return l
d = empty()
scores = []
path = []
i = 0
for j, b in enumerate(range(20)):
b = -1.5 + b / 20
pt = (b, full_loss(b))
path.append(pt)
if j % 5 == 0:
d = d | hstrut(0.5) | draw_with_easy_points(linear((1, 1), b)).named(("graph", i))
p = circle(0.01).translate(pt[0], pt[1]).fill_color(black)
p = p.named(("x", i))
i += 1
scores.append(p)
print(path)
d = (concat(scores) + make_path(path)).center_xy().scale(3) / vstrut(0.5) / d.scale(2).center_xy()
for i in range(i):
d = d.connect(("graph", i), ("x", i), ArrowOpts(head_pad=0.1))
set_svg_height(500)
d
[(-1.5, -8.91709764826237), (-1.45, -8.816705574754979), (-1.4, -8.722212419566382), (-1.35, -8.633669248579224), (-1.3, -8.55112072729222), (-1.25, -8.474604921691927), (-1.2, -8.40415312616738), (-1.15, -8.339789720058922), (-1.1, -8.281532054193912), (-1.05, -8.229390368502255), (-1.0, -8.183367741527482), (-0.95, -8.143460072358145), (-0.9, -8.109656095204183), (-0.85, -8.08193742653808), (-0.8, -8.06027864441572), (-0.75, -8.044647399291918), (-0.7, -8.03500455535482), (-0.65, -8.031304361126814), (-0.6, -8.033494647821058), (-0.55, -8.041517053706167)]
Doing this leads to a better model.
set_svg_height(300)
draw_with_easy_points(linear((1, 1), -1.1))
We can repeat this process for the intercept as well as for all the other parameters in the model.
But how did we know how the loss function changed if we changed the intercept? For a small problem, we can just move it a bit and see. But remember that machine learning models can have billions of parameters, so this would take a ton of time.
A better approach is to utilize calculus and take the derivative of the loss
function with respect to the parameter $L'_b$. If we can efficiently and
automatically take this derivative, it tells us how to change the parameter
to update its value to fit any loss. Even better, if we can efficiently
take a set of derivatives (known as a gradient
) for all the parameters,
then we know which direction they all should move.
The first 4 modules in MiniTorch are dedicated to implementing this fitting procedure efficiently.
The linear model class can be used to find good fits to the data we have considered so far, but it fails for data that splits up into multiple segments. These datasets are not linearly separable.
An alternative model class for this style of data is a neural network. Neural networks can be used to specify a much wider range of separators.
Intuitively, neural networks divide classification into two or more stages. Each stage uses a linear model to reshape the data into new points. The final stage is a linear classifier over the transformed point.
Let's look at our dataset:
draw_with_hard_points(linear((1, 1), -0.7))
A neural network might first produce a separator (yellow) to pull apart the top red points:
yellow = linear((-1, 0), 0.25)
ycolor = Color("#fde699")
draw_with_hard_points(yellow, ycolor, Color("white"))
And then produce another separator (green) to pull apart the bottom red points:
green = linear((1, 0), -0.8)
gcolor = Color("#d1e9c3")
draw_with_hard_points(green, gcolor, Color("white"))
The neural network is allowed to transform the points based on the distance from these separators (very similar to the loss function above). It can use whatever function it wants to do this transformation. Ideally, the function would make the points in yellow and green high, and the other points low. One function to do this is the ReLU function (ReLU stands for Rectified Linear Unit, a very complicated way of saying "delete values below 0".):
yellow(1.0, 0.0)
-0.75
graph(minitorch.operators.relu,
[yellow(*pt) for pt in s2_hard],
[yellow(*pt) for pt in s1_hard], 1, 0.25, c=ycolor)
For the yellow separator, the ReLU yields the following values:
graph(minitorch.operators.relu,
[green(*pt) for pt in s2_hard],
[green(*pt) for pt in s1_hard], 1.5, 0.25, c=gcolor)
Basically the top X's are positive and the bottom O's and X's are 0. Something very similar happens for the green separator.
Finally yellow and green become our new $x_1, x_2$. Since all the O's are now at the origin it is very easy to separate out the space.
sc = 3
s1_trans = [(minitorch.operators.relu(green(*p)) * sc, minitorch.operators.relu(yellow(*p)) * sc) for p in s1_hard]
s2_trans = [(minitorch.operators.relu(green(*p)) * sc , minitorch.operators.relu(yellow(*p)) * sc) for p in s2_hard]
final = linear((1 ,1), -0.3)
draw_graph(final) + make_path([(0, 0), (0, -1)]).line_color(ycolor).line_width(1.0) + make_path([(0, 0), (1, 0)]).line_color(gcolor).line_width(1.0) + split_graph(s1_trans, s2_trans, show_origin=False)
Looking back at the original model, this process appears like it has produced two lines to pull apart the data.
def mlp(x, y):
yel = minitorch.operators.relu(yellow(x, y))
gre = minitorch.operators.relu(green(x, y))
return final(sc* gre, sc*yel)
draw_with_hard_points(mlp)
mlp(0.5, 0.5)
-0.3
Mathematically we can think of the transformed data as values $h_1, h_2$ which we get from applying separators with different parameters to the original data. The final prediction then applies a separator to $h_1, h_2$.
\begin{eqnarray*} h_ 1 &=& \text{ReLU}(x_1 \times w^0_1 + x_2 \times w^0_2 + b^0) \\ h_ 2 &=& \text{ReLU}(x_1 \times w^1_1 + x_2 \times w^1_2 + b^1)\\ m(x_1, x_2) &=& h_1 \times w_1 + h_2 \times w_2 + b \end{eqnarray*}
Here $w_1, w_2, w^0_1, w^0_2, w^1_1, w^1_2, b, b^0, b^1$ are all parameters. We have gained more flexible models, at the cost of now needing to fit many more parameters to the data.
This neural network will be the main focus for the first couple models. It appears quite simple, but fitting it effectively will require building up systems infrastructure. Once we have this infrastructure, though, we will be able to easily support most modern neural network models.