======================== Module 3 - Efficiency ======================== .. image:: figs/gpu/threadid@3x.png :align: center In addition to helping simplify code, tensors provide a basis for speeding up computation. In fact, they are really the only way to efficiently write deep learning code in a slow language like Python. However, nothing we have done so far really makes anything faster than :doc:`module0`. This module is focused on taking advantage of tensors to write fast code, first on standard CPUs and then using GPUs. All starter code is available in https://github.com/minitorch/Module-3 . To begin, remember to activate your virtual environment first, and then clone your assignment: >>> git clone {{STUDENT_ASSIGNMENT3_URL}} >>> cd {{STUDENT_ASSIGNMENT_NAME}} >>> pip install -Ue . You need the files from previous assignments, so maker sure to pull them over to your new repo. Be sure to continue to follow the :doc:`contributing` guidelines. .. toctree:: :maxdepth: 1 :glob: :caption: Guides parallel matrixmult cuda Tasks ******** For this assignment you will need to run you commands in the Google Colab virtual environment. Follow these instructions for `Colab setup `_. Task 3.1: Parallelization ================================== .. note:: This task requires basic familiarity with Numba `prange`. Be sure to very carefully read the section on :doc:`parallel`, `Numba `_ and review :doc:`module2`. The main backend for our codebase are the three functions `map`, `zip`, and `reduce`. If we can speed up these three, everything we built so far will get better. This exercise asks you to utilize Numba and the `njit` function to speed up these functions. In particular if you can utilize parallelization through `prange` you can get some big wins. Be careful though! Parallelization can lead to funny bugs. In order to help debug this code, we have created a parallel analytics script for you :: python project/parallel_test.py Running this script will run NUMBA diagnostics on your functions. .. todo:: Complete the following in `minitorch/fast_ops.py` and pass tests marked as `task3_1`. Furthermore include the diagnostics output from the above script in your README. Your code should have at least one parallelized loop. .. autofunction:: minitorch.fast_ops.tensor_map .. autofunction:: minitorch.fast_ops.tensor_zip .. autofunction:: minitorch.fast_ops.tensor_reduce Task 3.2: Matrix Multiplication ================================== .. note:: This task requires basic familiarity with matrix multiplication. Be sure to read the Guide on :doc:`matrixmult`. Matrix multiplication is key to all of the models that we have trained so far. In the last module, we computed matrix multiplication using broadcasting. In this task, we ask you to implement it directly as a function. Do your best to make the function efficient, but for now all that matters is that you correctly produce a multiply function that passes our tests and has some parallelism. In order to use this function, you will also need to add a new `MatMul` Function to `tensor_functions.py`. We have added a version in the starter code you can copy. You might also find it useful to add a slow broadcasted `matrix_multiply` to `tensor_ops.py` for debugging. In order to help debug this code, we have created a parallel analytics script for you :: python project/parallel_test.py Running this script will run NUMBA diagnostics on your functions. After you finish this task, you may want to skip to 3.5 and experiment with training on the real task under speed conditions. .. todo:: Complete the following function in `minitorch/fast_ops.py` and copy the `MatMul` Function to `tensor_functions.py`. Pass tests marked as `task3_2`. .. autofunction:: minitorch.fast_ops.tensor_matrix_multiply Task 3.3: CUDA Operations ================================== .. note:: This task requires basic familiarity with CUDA. Be sure to read the Guide on :doc:`cuda` and the Numba CUDA guide. We can do even better than parallelization if we have access to specialized hardware. This task asks you to build a GPU implementation of the backend operations. It will be hard to equal what PyTorch does, but if you are clever you can make these computations really fast (aim for 2x of task 3.1). .. todo:: Complete the following functions in `minitorch/cuda_ops.py`, and pass the tests marked as `task3_3`. .. autofunction:: minitorch.cuda_ops.tensor_map .. autofunction:: minitorch.cuda_ops.tensor_zip .. autofunction:: minitorch.cuda_ops.tensor_reduce Task 3.4: CUDA Matrix Multiplication ================================================= .. note:: This task requires basic familiarity with CUDA. Be sure to read the Guide on :doc:`cuda` and the Numba CUDA guide. Finally we can combine both these approaches and implement CUDA `matmul`. This operation is probably the most important in all of deep learning and is central to making models fast. Again, we first strive for accuracy, but, the faster you can make it, the better. .. todo:: Implement `minitorch/cuda_ops.py` with CUDA, and pass tests marked as `task3_4`. .. autofunction:: minitorch.cuda_ops.tensor_matrix_multiply Task 3.4b: Extra Credit ======================= Implementing matrix multiplication and reduction efficiently is hugely important for many deep learning tasks. We have seen one method for implementing these functions, but there are many more optimizations that you could apply. For extra credit, first read these two tutorials: * https://numba.pydata.org/numba-doc/latest/cuda/reduction.html * https://nyu-cds.github.io/python-numba/05-cuda/ Then implement a version of CUDA `tensor_reduce` or `tensor_matrix_multiply` that take advantage of other aspects of the GPU. To get full credit, you should document your code to show us that you understand each line. Prove to us that these lead to speed-ups on large matrix operations by making a graph comparing them to naive operations. Task 3.5: Training ======================== If your code works, you should now be able to move on to the tensor training script in `project/run_fast_tensor.py`. This code is the same basic training setup as :doc:`module2`, but now utilizes your fast tensor code. We have left the `matmul` layer blank for you to implement with your tensor code. .. todo:: * Implement the missing functions in `project/run_fast_tensor.py`. These should do exactly the same thing as the corresponding functions in `project/run_tensor.py`, but now use the faster backend * Train a tensor model and add your results for all dataset to the README. * Run a bigger model and record the time per epoch reported by the trainer. Here is the command :: python run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05 python run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET split --RATE 0.05 Train a tensor model and add your results for all three dataset to the README. Also record the time per epoch reported by the trainer. (As a reference, our parallel implementation gave a 10x speedup). On a standard Colab GPU setup, aim for you CPU to get below 2 seconds per epoch and GPU to be below 1 second per epoch. (With some cleverness you can do much better.)