Compute sliding average over a list
sub_size = 2
a = [4, 2, 5, 6, 2, 4]
out = [3, 3.5, 5.5, 4, 3]
Compute CUDA
def sliding(out, a):
i = numba.cuda.blockIdx.x * TPB \
+ numba.cuda.threadIdx.x
if i + sub_size < a.size:
out[i] = 0
for j in range(sub_size):
out[i] += a[i + j]
out[i] = out[i] / sub_size
Two global reads per thread ::
def sliding(out, a):
shared = numba.cuda.shared.array(TPB + sub_size)
i = numba.cuda.blockIdx.x * TPB \
+ numba.cuda.threadIdx.x
local_idx = numba.cuda.threadIdx.x
if i + sub_size < a.size:
shared[local_idx] = a[i]
if local_idx < sub_size and i + TPB < a.size:
shared[local_idx + TPB] = a[i + TPB]
numba.cuda.syncthreads()
temp = 0
for j in range(sub_size):
temp += shared[local_idx + j]
out[i] = temp / sub_size
Compute sum reduction over a list
a = [4, 2, 5, 6, 1, 2, 4, 1]
out = [26]
Formula $$a = 4 + 2 + 5 + 6 + 1 + 2 + 4 + 1$$ Same as $$a = (((4 + 2) + (5 + 6)) + ((1 + 2) + (4 + 1)))$$
Round 1 (4 threads needed, 8 loads) $$a = (((4 + 2) + (5 + 6)) + ((1 + 2) + (4 + 1)))$$
Round 2 (2 threads needed, 4 loads) $$a = ((6 + 11) + (3 + 5))$$ Round 3 (1 thread needed, 2 loads) $$a = (17 + 8)$$ Round 4 $$a = 25$$
Quiz
$$ \begin{eqnarray*} \text{lin}(x; w, b) &=& x_1 \times w_1 + x_2 \times w_2 + b \\ \end{eqnarray*} $$
draw_equation(
[m(4, 1), m(4, 1), None, m(1, 1)]
)
Compute dot product for a batch of examples $x^{1}, \ldots, x^{J}$
draw_equation(
[m(5, 4), m(4, 1), None, m(5, 1)]
)
$$ \begin{eqnarray*} \text{lin}(x; w, b) &=& x_1 \times w_1 + x_2 \times w_2 + b \\ h_1 &=& \text{ReLU}(\text{lin}(x; w^0, b^0)) \\ h_2 &=& \text{ReLU}(\text{lin}(x; w^1, b^1))\\ m(x_1, x_2) &=& \text{lin}(h; w, b) \end{eqnarray*} $$ Parameters: $w_1, w_2, w^0_1, w^0_2, w^1_1, w^1_2, b, b^0, b^1$
draw_equation(
[m(5, 4), m(4, 3), None, m(5, 3)]
)
image_matmul_full()

image_matmul_full()
save_for_backwardsimage_matmul_simple()
image_matmul_simple()
A.shape == (I, J)
B.shape == (J, K)
out.shape == (I, K)
for outer_index in out.indices():
for inner_val in range(J):
out[outer_index] += A[outer_index[0], inner_val] * \
B[inner_val, outer_index[1]]
Code
ZIP STEP
C = zeros(broadcast_shape(A.view(I, J, 1), B.view(1, J, K)))
for C_outer in C.indices():
C[C_out] = A[outer_index[0], inner_val] * \
B[inner_val, outer_index[1]]
REDUCE STEP
for outer_index in out.indices():
for inner_val in range(J):
out[outer_index] = C[outer_index[0], inner_val,
outer_index[1]]
What can be parallelized?
for outer_index in out.indices():
for inner_val in range(J):
out[outer_index] += A[outer_index[0], inner_val] * \
B[inner_val, outer_index[1]]
Basic CUDA ::
def mm_simple(out, a, b, K):
i = numba.cuda.blockIdx.x * TPB \
+ numba.cuda.threadIdx.x
j = numba.cuda.blockIdx.y * TPB \
+ numba.cuda.threadIdx.y
for k in range(K):
out[i, j] += a[i, k] * b[k, j]
Which elements does out[i, j] depend on?
How many are there?

Assume a, b, out are all 2x2 matrices
Idea -> copy all needed values to shared?
Basic CUDA ::
def mm_shared1(out, a, b, K):
...
sharedA[local_i, local_j] = a[i, j]
sharedB[local_i, local_j] = b[i, j]
...
for k in range(K):
t += sharedA[local_i, k] * sharedB[k, local_j]
out[i, j] = t
Basic CUDA ::
def mm_shared1(out, a, b, K):
...
for s in range(0, K, TPB):
sharedA[local_i, local_j] = a[i, s + local_j]
sharedB[local_i, local_j] = b[s + local_i, j]
...
for k in range(TPB):
t += sharedA[local_i, k] * sharedB[k, local_j]
out[i, j] = t
How do you handle the different size of the matrix?
How does this interact with the block size?