for outer_index in out.indices():
for inner_val in range(J):
out[outer_index] += A[outer_index[0], inner_val] * \
B[inner_val, outer_index[1]]
Code
ZIP STEP
C = zeros(broadcast_shape(A.view(I, J, 1), B.view(1, J, K)))
for C_outer in C.indices():
C[C_out] = A[outer_index[0], inner_val] * \
B[inner_val, outer_index[1]]
REDUCE STEP
for outer_index in out.indices():
for inner_val in range(J):
out[outer_index] = C[outer_index[0], inner_val,
outer_index[1]]
Basic CUDA ::
def mm_shared1(out, a, b, K):
...
for s in range(0, K, TPB):
sharedA[local_i, local_j] = a[i, s + local_j]
sharedB[local_i, local_j] = b[s + local_i, j]
...
for k in range(TPB):
t += sharedA[local_i, k] * sharedB[k, local_j]
out[i, j] = t
How do you handle the different size of the matrix?
How does this interact with the block size?
Quiz


1) How do we handle input features? 2) How do we look at variable-size areas? 3) How do we predict multiple labels?


Get word vector
VOCAB = 1000
EMB = 100
embeddings = rand(EMB, VOCAB)
word = 20
embeddings[0, word]
# * Challenge: How to compute `backward`
0.7649509638874962
Get word vector
# word_one_hot = tensor([0 if i != word else 1
# for i in range(VOCAB)])
# embeddings @ word_one_hot.view(VOCAB, 1)

(word_emb1 * word_emb2).sum()
Easy to write as a layer
class Embedding(minitorch.Module):
def __init__(self, vocab_size, emb_size):
super().__init__()
self.weights = \
minitorch.Parameter(minitorch.rand((vocab_size, emb_size)))
self.vocab_size = vocab_size
def forward(input):
return (input @ self.weights.values)
Query 1
^(lisbon|portugal|america|washington|rome|athens|london|england|greece|italy)$
Query 2
^(doctor|patient|lawyer|client|clerk|customer|author|reader)$














