Module 4.3 - Advanced NNs¶

"Pooling"¶

Reduction applied to each region:

Simple Implementation¶

Ensure that it is contiguous
Use View to "fold" the tensor

Why does folding work?¶

View requires "contiguous" tensor
View(4, 2) makes strides (2, 1)

Simple Implementation¶

Reduce along created fold

Quiz¶

Gradient Flow¶

Layers that are used get more updates
Gradient signals which aspect was important
Can have extra layers

More Reductions¶

Heading for a max reduction
Heading for a softmax output
Quick detour

ReLU, Step, Sigmoid¶

Basic Operations¶

Introduced in Module-0
Widely used in ML
What is it?

Simple Function: ReLU¶

Main "activation" function

Primarily used to split the data.

Simple Function: Step¶

Step function $f(x) = x > 0$ determines correct answer

ReLU¶

Mathematically,

$$\text{ReLU}(x) = \max\{0, x\}$$

Simplest max function.

Step¶

Mathematically,

$$\text{step}(x) = x > 0 = \arg\max\{0, x\}$$

Simplest argmax function.

Relationship¶

Step is derivative of ReLU

$$ \begin{eqnarray*} \text{ReLU}'(x) &=& \begin{cases} 0 & \text{if } x \leq 0 \\ 1 & \text{ow} \end{cases} \\ \text{step}(x) &=& \text{ReLU}'(x) \end{eqnarray*} $$

Loss of step tells us how many points are wrong.

Derivative of Step?¶

Mathematically,

$$\text{step}'(x) = \begin{cases} 0 & \text{if } x \leq 0 \\ 0 & \text{ow} \end{cases}$$

Not a useful function to differentiate

Altenative Function: Sigmoid¶

Used to determine the loss function

Soft (arg)max?¶

Would be nice to have a version that with a useful derivative

$$\text{sigmoid}(x) = \text{softmax} \{0, x\}$$

Useful soft version of argmax.

Max, Argmax, Softmax¶

Challenge¶

How do we generalize sigmoid to multiple outputs?

Max reduction¶

Max is a binary associative operator
$\max(a, b)$ returns max value
Generalized $\text{ReLU}(a) = \max(a, 0)$

Max Pooling¶

Common to apply pooling with max
Sets pooled value to "most active" in block
Forward code is easy to implement

Max Backward¶

Unlike sum, max throws away other values
Only top value gets used
Backward needs to know this.

Argmax¶

Function that returns argmax, one-hot
Generalizes step

Max Backward¶

First compute argmax
Only send gradient to argmax gradinput
Everything else is 0

Ties¶

What if there are two or more argmax's?
Max is non-differentiable, like ReLU(0).
Short answer: Ignore, pick one

HW¶

When writing tests for max, ties will break finite-differences
Suggestion: perturb your input by adding a small amount of random noise.

Soft argmax?¶

Need a soft version of argmax.
Generalizes sigmoid for our new loss function
Standard name -> softmax

Softmax¶

$$\text{softmax}(\textbf{x}) = \frac{\exp \textbf{x}}{\sum_i \exp x_i}$$

Sigmoid is Softmax¶

$$\text{softmax}([0, x])[1] = \frac{\exp x}{\exp x + \exp 0} = \sigma(x)$$

Softmax¶

Softmax

Review¶

ReLU -> Max
Step -> Argmax
Sigmoid -> Softmax

Softmax¶

Network¶

Network

Softmax Layer¶

Produces a probability distribution over outputs (Sum to 1)
Derivative similar to sigmoid
Lots of interesting practical properties

Softmax in Context¶

Not a map!
Gradient spreads out from one point to all.

Softmax¶

(Colab)[https://colab.research.google.com/drive/1EB7MI_3gzAR1gFwPPO27YU9uYzE_odSu]

Soft Gates¶

New Methods¶

Sigmoid and softmax produce distributions
Can be used to "control" information flow

Example¶

Returns a combination of x and y $$f(x, y, r) = x * \sigma(r) + y * (1 - \sigma(r))$$

Gradient is controlled¶

$$\begin{eqnarray*} f'_x(x, y, r) &= \sigma(r) \\ f'_y(x, y, r) &= (1 - \sigma( r))\\ f'_r(x, y, r) &= (x - y) \sigma'(r) \end{eqnarray*}$$

Neural Network Gates¶

Learn which one of the previous layers is most useful. $$\begin{eqnarray*} r &= NN_1 \\ x &= NN_2 \\ y &= NN_3\\ \end{eqnarray*} $$

Gradient Flow¶

Layers that are used get more updates
Gradient signals which aspect was important
Can have extra layers

Selecting Choices¶

Gating gives us a binary choice
What if we want to select between many elements?
Softmax!

Softmax Gating¶

Combines many elements of X based on R

$$f(X, R) = X \times softmax(R)$$

Softmax Gating¶

Brand name: Attention

Example: Translation¶

Show example

Example: GPT-3¶

Show example

Module 4.3 - Advanced NNs¶

"Pooling"¶

Simple Implementation¶

Why does folding work?¶

Simple Implementation¶

Quiz¶

Gradient Flow¶

More Reductions¶

ReLU, Step, Sigmoid¶

Basic Operations¶

Simple Function: ReLU¶

Simple Function: Step¶

ReLU¶

Step¶

Relationship¶

Derivative of Step?¶

Altenative Function: Sigmoid¶

Soft (arg)max?¶

Max, Argmax, Softmax¶

Challenge¶

Max reduction¶

Max Pooling¶

Max Backward¶

Argmax¶

Max Backward¶

Ties¶

HW¶

Soft argmax?¶

Softmax¶

Sigmoid is Softmax¶

Softmax¶

Review¶

Softmax¶

Network¶

Softmax Layer¶

Softmax in Context¶

Softmax¶

Soft Gates¶

New Methods¶

Example¶

Gradient is controlled¶

Neural Network Gates¶

Gradient Flow¶

Selecting Choices¶

Softmax Gating¶

Softmax Gating¶

Example: Translation¶

Example: GPT-3¶

QA¶