Assignment 2: Graph Executor with TVM

In Assignment 1, we implemented the User API layer (computation graph and autodiff) of the deep learning system stack. In this assignment, we will go down the stack, and implement a simple version of the rest of the stack.

We need to implement a computation graph executor that can manage memory for users and execute the forward and backward passes. We also need to implement deep learning kernels using a compiler-based approach through TVM.

In the end, we would have implemented a simple version of the entire deep learning system stack. Our code should be able to construct a simple MLP model using computation graph API implemented in Assignment 1, and train and test the model using operators generated by TVM.

Key concepts and data structures that we would need to implement are

Shape inference on computation graph given input shapes.
Executor memory management for computation graph.
TVM kernel implementations of common kernels, e.g. Relu, MatMul, Softmax.

Overview of Module

python/dlsys/autodiff.py: Implements computation graph, autodiff, executor.
python/dlsys/tvm_op.py: Implementation of kernels using TVM.
test/test_tvm_op.py: test suite for all tvm ops.
test/mnist_dlsys.py: training loop for MLP.

What you need to do?

Understand the code skeleton and tests. Fill in implementation wherever marked """TODO: Your code here""".

There are only two files with TODOs for you.

python/dlsys/autodiff.py
python/dlsys/tvm_op.py

In autodiff.py, you need to implement shape inference and memory management (make sure your executor reuse memory across training iterations) for the executor.

In tvm_op.py, you need to implement kernels using TVM, and you need to come up with an optimized schedule for matrix multiply kernel that achieves at least 10x speedup compared to default schedule (see more details below).

The available primitives can be found at TVM scheduling primitives.

Tests cases

There are 12 tests in tests/test_tvm_op.py. We would grade your TVM kernel implementations based on 10 of those tests.

We would also grade your implementation of shape inference and memory management based on tests/mnist_dlsys.py.

We would grade your optimized matrix multiply kernel by checking if there is a ~10x reduction in per-epoch running time.

Compile

export PYTHONPATH="${PYTHONPATH}:/path/to/assignment2/python"

Run all tests with

# sudo pip install nose
nosetests -v tests/test_tvm_op.py

Run neural nets training and testing with

# see cmd options with 
# python tests/mnist_dlsys.py -h

# run logistic regression on numpy
python tests/mnist_dlsys.py -l -m logreg
# run MLP on numpy
python tests/mnist_dlsys.py -l -m mlp

If your implementation is correct, you would see

generally decreasing loss value with epochs
your dev set accuracy for logreg about 92% and MLP about 97% for mnist using the default parameters

If you use default TVM schedules for all kernels, MLP training would be noticeably slower than logistic regression because the matrix multiple kernel with default schedule is highly unoptimized. To get full score on this assignment, you need to optimize the matrix multiple kernel using a combination of techniques such as blocking, vectorization, loop permutations and others mentioned in lecture.

References for optimizing TVM kernels:

With default schedules, you may see some numbers like

qiao$ python tests/mnist_dlsys.py -l -m mlp
=== Build 3-layer MLP model...
Loading data...
Start training loop...
epoch 0
loss = 0.567862; Time taken this epoch = 47.856022 s
epoch 1
loss = 0.312280; Time taken this epoch = 47.192292 s

With optimized schedules, you may see some numbers like

qiao$ python tests/mnist_dlsys.py -l -m mlp
=== Build 3-layer MLP model...
Loading data...
Start training loop...
epoch 0
loss = 0.568680; Time taken this epoch = 5.455820 s
epoch 1
loss = 0.309237; Time taken this epoch = 4.663048 s

qiao$ python tests/mnist_dlsys.py -l -m mlp
=== Build 3-layer MLP model...
Loading data...
Start training loop...
epoch 0
loss = 0.568680; Time taken this epoch = 2.073490 s
epoch 1
loss = 0.309237; Time taken this epoch = 1.202837 s

Your per-epoch running time would not exactly match ours, but you should aim to get a 10x reduction.

Grading rubrics

test_tvm_op.test_matrix_elementwise_add ... Implemented by us, not graded.
test_tvm_op.test_matrix_elementwise_add_by_const ... 1pt
test_tvm_op.test_matrix_elementwise_mul ... 1pt
test_tvm_op.test_matrix_elementwise_mul_by_const ... 1pt
test_tvm_op.test_matrix_multiply ... 2pt
test_tvm_op.test_conv2d ... 2pt
test_tvm_op.test_relu ... 1pt
test_tvm_op.test_relu_gradient ... 1pt
test_tvm_op.test_softmax ... 1pt
test_tvm_op.test_softmax_cross_entropy ... 2pt
test_tvm_op.test_reduce_sum_axis_zero ... Implemented by us, not graded.
test_tvm_op.test_broadcast_to ... Implemented by us, not graded.
mnist with logistic regression ... 2 pt
mnist with MLP ... 3 pt
mnist with MLP and optimized matrix multiply schedule ... 3 pt

Total: 12 pt + 8 pt = 20 pt

Submitting your work

Please submit your assignment2.tar.gz to Canvas dropbox under Assignment 2. Due: 5/8/2018, 5pm.

# compress
tar czvf assignment2.tar.gz assignment2/