

threads_per_block x total_blocks
Quiz
(Puzzles)
Compute sliding average over a list
sub_size = 2
a = [4, 2, 5, 6, 2, 4]
out = [3, 3.5, 5.5, 4, 3]
Compute CUDA
def slide_cuda(out, a):
i = numba.cuda.blockIdx.x * THREADS \
+ numba.cuda.threadIdx.x
if i + sub_size < a.size:
out[i] = 0
for j in range(sub_size):
out[i] += a[i + j]
out[i] = out[i] / sub_size
Two global reads per thread ::
def slide_cuda(out, a):
shared = numba.cuda.shared.array(THREADS + sub_size)
i = numba.cuda.blockIdx.x * THREADS \
+ numba.cuda.threadIdx.x
local_idx = numba.cuda.threadIdx.x
if i + sub_size < a.size:
shared[local_idx] = a[i]
if local_idx < sub_size and i + THREADS < a.size:
shared[local_idx + THREADS] = a[i + THREADS]
numba.cuda.syncthreads()
temp = 0
for j in range(sub_size):
temp += shared[local_idx + j]
out[i] = temp / sub_size
Compute sum reduction over a list
a = [4, 2, 5, 6, 1, 2, 4, 1]
out = [26]
Formula $$a = 4 + 2 + 5 + 6 + 1 + 2 + 4 + 1$$ Same as $$a = (((4 + 2) + (5 + 6)) + ((1 + 2) + (4 + 1)))$$
Round 1 $$a = (((4 + 2) + (5 + 6)) + ((1 + 2) + (4 + 1)))$$ Round 2 $$a = ((6 + 11) + (3 + 5))$$ Round 3 $$a = (17 + 8)$$ Round 4 $$a = 25$$
Round 1 (4 threads needed, 8 loads) $$a = (((4 + 2) + (5 + 6)) + ((1 + 2) + (4 + 1)))$$
Round 2 (2 threads needed, 4 loads) $$a = ((6 + 11) + (3 + 5))$$ Round 3 (1 thread needed, 2 loads) $$a = (17 + 8)$$ Round 4 $$a = 25$$
| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
|----------|-----------|----------|----------|
| 4 + 2 | 5 + 6 | 1 + 2 | 4 + 1 |
| 6 + 11 | (zzz) | 3 + 5 | (zzz) |
| 17 + 18 | (zzz) | (zzz) |(zzz) |
start, e.g. 0startSequence may have more elements than our block.
Do not want to share values between of blocks.
However, can run the code multiple times.
Formula $$a = 4 + 2 + 5 + 6 + 1 + 2 + 4 + 1 + 10$$ Block size 8 $$a = (((4 + 2) + (5 + 6)) + ((1 + 2) + (4 + 1))) + 10$$