Module 3.2 - Fusion and CUDA¶

Why are Python (and friends) "slow"?¶

  • Function calls
  • Types
  • Loops

Function Calls¶

  • Function calls are not free
  • Checks for args, special keywords andm lists
  • Methods check for overrides and class inheritance

Types¶

Critical code

out[o] = in_storage[j] + 3
  • Doesn't know type of in_storage[j]
  • May need to coerce 3 to float or raise error
  • May even call __add__ or __ladd__!

How does it work?¶

Work

def my_code(x, y):
     for i in range(100):
         x[i] = y + 20
  ...
  my_code(x, y)
  fast_my_code = numba.njit()(my_code)
  fast_my_code(x, y)
  fast_my_code(x, y)

Notebook¶

Colab Notebook

Terminology : JIT Compiler¶

  • Just-in-time
  • Waits until you call a function to compile it
  • Specializes code based on the argument types given.

Parallel Range¶

  • Replace for loops with parallel version
  • Tells compiler it can run in any order
  • Be careful! Ideally these loops don't change anything

Quiz¶

Outline¶

  • Fusion
  • CUDA
  • Threads

Operator Fusion¶

User API¶

  • Basic mathematical operations
  • Chained together as boxes with broadcasting
  • Optimize within each individually

Fusion¶

  • Optimization across operator boundary
  • Save speed or memory in by avoiding extra forward/backward
  • Can open even great optimization gains

Automatic Fusion¶

  • Compiled language can automatically fuse operators
  • Major area of research
  • Example: TVM, XLA, ONXX

Automatic Fusion¶

Manual Fusion¶

  • Utilize a pre-fused operator when needed
  • Standard libraries for implementations

Example: Matmul¶

In [2]:
chalk.hcat(
    [
        matrix(2, 4, colormap=lambda i, j: color(1, 4)(0, j)),
        matrix(4, 3, colormap=lambda i, j: color(1, 4)(0, i)),
    ],
    0.5,
)
Out[2]:

Example: Matmul¶

In [3]:
image_matmul_simple()
Out[3]:

Matmul Simple¶

In [4]:
image_matmul_full()
Out[4]:

Simple Matmul¶

A.shape == (I, J)
  B.shape == (J, K)
  out.shape == (I, K)

Simple Matmul Pseudocode¶

for outer_index in out.indices():
      for inner_val in range(J):
          out[outer_index] += A[outer_index[0], inner_val] * \
                              B[inner_val, outer_index[1]]

Complexities¶

  • Indices to strides
  • Minimizing index operations
  • Broadcasting

Matmul Speedups¶

What can be parallelized?

for outer_index in out.indices():
    for inner_val in range(J):
        out[outer_index] += A[outer_index[0], inner_val] * \
                            B[inner_val, outer_index[1]]

Advantages¶

  • No three dimensional intermediate
  • No save_for_backwards
  • Can use core matmul libraries (in the future)

CUDA¶

CUDA¶

  • NVidia's programming language for GPU
  • Compute Unified Device Architecture
  • A parallel programming language

CUDA¶

  • NVidia's programming language for GPU
  • Compute Unified Device Architecture
  • A parallel programming language

NVidia Structure¶

NVidia Structure¶

Where are we?¶

NVidia News

General Purpose GPUs¶

  • NVidia: can we make these programmable
  • ~2008: CUDA langauge

Machine Learning¶

  • Growth in ML parallels GPU development
  • Major deep learning results require GPU
  • All modern training is on GPU (or more)

CUDA Challeges¶

  • Low-level control
  • Hard to code efficiently.
  • Our Goal: Enough to understand the benefits

Threads¶

Thread prange code¶

def add(a, b):
    b = a + 10
cuda_add = numba.cuda.jit()(add)

cuda_add[1, 1](a, b)

Threads code¶

@numba.cuda.jit()
def add(a, b):
    b = a + 10

cuda_add[1, 10](a, b)

Threads code¶

@numba.cuda.jit()
def cuda_add(a, b):
    b = a + 10

cuda_add[1, (10, 10)](a, b)

Block code¶

@numba.cuda.jit()
def add(a, b):
    b = a + 10

cuda_add[(10, 10), (10, 10)](a, b)

Check¶

@numba.cuda.jit()
def printer(a):
    print("hello!")
    a[:] = 10 + 50

a = numpy.zeros(10)
printer[10, 10](a)

Output¶

Output

  hello!
  hello!
  hello!
  hello!
  hello!
  ...

Stack¶

  • Threads: Run the code
  • Block: Groups "close" threads
  • Grid: All the thread blocks
  • Total Threads: threads_per_block x total_blocks

Thread Names¶

Printing code

@numba.cuda.jit()
def printer(a):
    print(numba.cuda.threadIdx.x, numba.cuda.threadIdx.y)
    a[:] = 10 + 50

a = numpy.zeros(10)
printer[1, (10, 10)](a)

Output¶

Output

 6 3
 7 3
 8 3
 9 3
 0 4
 1 4
 2 4
 3 4
 4 4

Thread Names¶

@numba.cuda.jit()
def printer(a):
    print(numba.cuda.blockIdx.x,
          numba.cuda.threadIdx.x, numba.cuda.threadIdx.y)
    a[:] = 10 + 50

a = numpy.zeros(10)
printer[10, (10, 10)](a)

Output¶

Output

  7 6 9
  7 7 9
  7 8 9
  7 9 9
  2 6 9
  2 7 9

What's my name?¶

Name

BLOCKS_X = 32
BLOCKS_Y = 32
THREADS_X = 10
THREADS_Y = 10
@numba.cuda.jit()
def fn(a):
    x = numba.cuda.blockIdx.x * THREADS_X + numba.cuda.threadIdx.x
    y = numba.cuda.blockIdx.y * THREADS_Y + numba.cuda.threadIdx.y
    ...

fn[(BLOCKS_X, BLOCKS_Y), (THREADS_X, THREAD_Y)](a)

Simple Map¶

BLOCKS_X = 32
THREADS_X = 32
@numba.cuda.jit()
def fn(out, a):
    x = numba.cuda.blockIdx.x * THREADS_X + numba.cuda.threadIdx.x
    if x >=0 and x < a.size:
      out[x] = a[x] + 10

fn[BLOCKS_X, THREADS_X](a)

Guards¶

Guards

x = numba.cuda.blockIdx.x * BLOCKS_X + numba.cuda.threadIdx.x
if x >=0 and x < a.size:

Colab¶

  • https://colab.research.google.com/drive/1nzH-BHZi-LYK9Ee4t3xvfSr73-qaASwQ#scrollTo=mVmikf3wrekV

QA¶