pip install -r requirements.extra.txt
pip install -r requirements.txt
pip install -e .
Copy the files below from your Assignment 1 implementation:
autodiff.py -> minitorch/autodiff.py
run_sentiment.py -> project/run_sentiment_linear.py
Note the slight different suffix _linear
.
Please ONLY copy your solution of assignment 1 in MatrixMultiplyKernel
, mapKernel
, zipKernel
, reduceKernel
to the combine.cu
file for assignment 2.
combine.cu -> src/combine.cu
We have made some changes in combine.cu
and cuda_kernel_ops.py
for assignment 2 compared with assignment 1. We have relocated the GPU memory allocation, deallocation, and memory copying operations from cuda_kernel_ops.py
to combine.cu
, both for host-to-device and device-to-host transfers. We also change the datatype of Tensor._tensor._storage
from numpy.float64
to numpy.float32
.
bash compile_cuda.sh
We're still missing a few important arithmetic operations for Transformers, namely element-wise (e-wise) power and element-wise tanh.
1. Implement the forward and backward functions for the Tanh and PowerScalar tensor function in minitorch/tensor_functions.py
Recall from lecture the structure of minitorch. Calling .tanh()
on a tensor for example will call a Tensor Function defined in tensor_functions.py
. These functions are implemented on the CudaKernelBackend, which execute the actual operations on the tensors.
You should utilize tanh_map
and pow_scalar_zip
, which have already been added to the TensorBackend
class, which your CudaKernelOps should then implement.
Don't forget to save the necessary values in the context in the forward pass for your backward pass when calculating the derivatives.
Since we're taking e-wise tanh and power, your gradient calculation should be very simple.
Edit the following snippet in your __device__ float fn
function in minitorch/combine.cu
case POW: {
return;
}
case TANH: {
return;
}
Complete the Cuda code to support element-wise power and tanh.
You can look up the relevant mathematical functions here: CUDA Math API
The accompanying tests are in tests/test_tensor_general_student.py
Run the following to test an individual function eg.
python -m pytest -l -v -k "test_pow_1_student"
Run the following to test all parts to problem 1.
python -m pytest -l -v -m a2_1
We provide Adam optimizer for HW2 at optim.py. You should be able to verify Adam's performance by the performance of run_sentiment_linear.py after you've implemented Pow.
python project/run_sentiment_linear.py
Its validation performance should get above 60%
in 5 epochs.
You will be implementing all the necessary functions and modules to implement a decoder-only transformer model. PLEASE READ THE IMPLEMENTATION DETAILS SECTION BEFORE STARTING regarding advice for working with miniTorch.
Implement the GELU activation, logsumexp, one_hot, and softmax_loss functions in minitorch/nn.py
The accompanying tests are in tests/test_nn_student.py
Hints:
-
one_hot: Since MiniTorch doesn't support slicing/indexing with tensors, you'll want to utilize Numpy's eye function. You can use the .to_numpy() function for MiniTorch Tensors here. (Try to avoid using this in other functions because it's expensive.)
-
softmax_loss: You'll want to make use of your previously implemented one_hot function.
Run the following to test an individual function eg.
python -m pytest -l -v -k "test_gelu_student"
Run the following to test all the parts to Problem 2
python -m pytest -l -v -m a2_2
Implement the Embedding, Dropout, Linear, and LayerNorm1d modules in minitorch/modules_basic_student.py
The accompanying tests are in tests/test_modules_basic.py
Updates:
- Dropout : Feel free to ignore the 3rd section of the dropout test that employs p=0.5. This is failing unexpectedly because of a random seed problem.
- Linear : For people who've cloned the repo already, there is a typo in the initialization of the Linear Layer. Please use the Uniform(-sqrt(1/in_features), sqrt(1/in_features)) to initialize your weights as per PyTorch.
Hints:
- Embedding: You'll want to use your one_hot function to easily get embeddings for all your tokens. This function will test both your one_hot function in combination with your Embedding module.
- Dropout : Please use numpy.random.binomial with the appropriate parameters and shape for your mask.
Run the following to test an individual function eg.
python -m pytest -l -v -k "test_embedding_student"
Run the following to test all the parts to Problem 3
python -m pytest -l -v -m a2_3
Implement the MultiHeadAttention, FeedForward, TransformerLayer, and DecoderLM module in minitorch/modules_transfomer_student.py
.
The accompanying tests are in tests/test_modules_transformer.py
Run the following to test an individual function eg.
python -m pytest -l -v -k "test_multihead_attention_student"
Run the following to test question 1.1
python -m pytest -l -v -m a2_4
Implement a machine translation pipeline in project/run_machine_translation.py
Once all blanks are filled, run
python project/run_machine_translation.py
The outputs and bleu scores will be save in ./workdir
.
you should get BLEU score around 7 in the first epoch, and around 20 in 10 epochs. Every epoch takes around an hour.
You'll get all points if your performance goes beyond 10.
- Always add backend
Always ensure your parameters are initialized with the correct backend (with your CudaKernelOps) to ensure they're run correctly.
- Initializing parameters
When initializing weights in a Module, always wrap them with Parameter(.)
, otherwise miniTorch will not update it.
- Requiring Gradients
When you initialize parameters eg. in LayerNorm, make sure you set the require_grad_ field for parameters or tensors for which you'll need to update.
- Using
_from_numpy
functions
We've provided a new set of tensor initialization functions eg. tensor_from_numpy
.
Feel free to use them in functions like one_hot, since minitorch doesn't support slicing, or other times when you need numpy functions and minitorch doesn't support them. In this case, you can call .to_numpy()
and compute your desired operation. However, use this sparingly as this impacts your performance.
- Initializing weights
You'll need to initialize weights from certain distributions. You may want to do so with Numpy's random functions and use tensor_from_numpy to create the corresponding tensor.
- Broadcasting - implicit broadcasting
Unlike numpy or torch, we don't have the broadcast_to
function available. However, we do have implicit broadcasting. eg. given a tensors of shape (2, 2) and (1, 2), you can add the two tensors and the second tensor will be broadcasted to the first tensor using standard broadcasting rules. You will encounter this when building your modules, so keep this in mind if you ever feel like you need broadcast_to
.
- Contiguous Arrays
Some operations like view require arrays to be contiguous. Sometimes adding a .contiguous() may fix your error.
- No sequential
There is easy way to add sequential modules. Do not put transformer layers in a list/iterable and iterate through it in your forward function, because miniTorch will not recognize it
- Batch Matrix Multiplication
We support batched matrix multiplication: Given tensors A and B of shape (a, b, m, n) and (a, b, n, p), A @ B will be of shape (a, b, m, p), whereby matrices are multiplied elementwise across dimensions 0 and 1.
- MiniTorch behavior when preserving dimensions
MiniTorch sometimes may have different behavior compared to Numpy.
minitorch.tensor([[1,2],[3,4]]).sum(1).shape == (2,1)
whereas np.array(np.array([[1,2],[3,4]]).sum(1).shape == (2,)
. If you're relying on broadcasting in your operations and you're getting errors, be careful of the shapes.
- Linear/Layernorm
You need to make your input tensors 2D.