PyTorch HammerBlade Port

This work aims to port PyTorch to HammerBlade.

How to build PyTorch to use COSIM

This assumes that you have a working HB Cosimulation installed through bsg_bladerunner. Then:

Enable devtoolset-8 or any toolchain that supports C++14.

Set following variable to point to bsg_bladerunner clone:

export BRG_BSG_BLADERUNNER_DIR=<path to bsg_bladerunner that has be setup>

Clone hb-pytorch repo:

git clone -b hb-device git@github.com:cornell-brg/hb-pytorch.git

Create python virtual environment:
```
python3.6 -m venv ./venv_pytorch
```

Install dependencies:

pip install --upgrade pip
pip install numpy pyyaml mkl mkl-include setuptools cmake cffi typing sklearn tqdm pytest ninja hypothesis thop pillow

Remove automatically installed PyTorch:
```
pip uninstall torch
```

Init pytorch third party dependencies:

git submodule update --init --recursive

Setup building environment variables:

cd hb-pytorch && source setup_cosim_build_env.sh

Build pytorch. This step can take up to 15 minutes:
```
python setup.py develop
```
Above command also compiles device kernels with RISCV toolchain and installs the kernel binary. Optionally, kernels can be compiled with Clang by running the following instead of above:
```
CLANG=1 python setup.py develop
```
It's important that CLANG=1 has to be present everytime we build/re-build `hb-pytorch to compile kernels with Clang. To check if the current build compiled kernels with Clang, we can run:
```
readelf -p .comment <hb-pytorch-root>/build/c10/hammerblade/kernel.riscv
```
The output when compiled with Clang should be something like this:
```
String dump of section '.comment':
  [     0]  clang version 10.0.0 (https://github.com/bespoke-silicon-group/llvm-project.git 3ee81f3def2c4c2a818f9f939f4421b3f3af313e)
  [    7a]  GCC: (GNU) 9.2.0
```
PyTorch can be used with cosim by running one of the following executables, instead of python:
- pycosim: Runs python with cosim backend
- pycosim.trace: Enables device instruction trace
- pycosim.wave: Enbales device instruction trace AND waveform dumps
For example, a PyTorch program foo.py can be executed with hb-pytorch's cosim backend with on of the following:
```
pycosim foo.py
pycosim.trace foo.py # To get HB device execution trace
pycosim.wave foo.py # To get HB device execution trace and RTL simulation waveform.
```

How to build PyTorch with Emulation Layer

Clone this repository:

git clone git@github.com:cornell-brg/hb-pytorch.git

Create a Python virtual environment:

python3 -m venv ./venv_pytorch
source ./venv_pytorch/bin/activate

Install some dependencies:

pip install numpy pyyaml mkl mkl-include setuptools cmake cffi typing sklearn tqdm pytest ninja hypothesis

Init PyTorch third party dependencies:

git submodule update --init --recursive

Setup building environment variables:
```
source setup_emul_build_env.sh
```
Build PyTorch. This step can take up to 15 minutes:
```
python setup.py develop
```
Turn on emulation debug info
```
export HBEMUL_DEBUG=1
```

Setup emulated HB device size

export HBEMUL_TILE_X_DIM=16
export HBEMUL_TILE_Y_DIM=8

Run Pytests

Goto hb-pytorch directory cd hb-pytorch/hammerblade/torch
Run pytest python pytest_runner.py

Important files and directories related to HammerBlade

files used to run pytest (adapted from Baseline)

hammerblade/fragments/
hammerblade/environment.mk
baseline-README.md
run-hb-pytest.sh (source this one to run pytest!)
hammerblade/torch/

HammerBlade device code

hammerblade/torch/kernel

Pytest tests

hammerblade/torch/tests/

files that interacts with HammerBlade CUDALite runtime

c10/hammerblade/

How to implement a new kernel

func: sigmoid(Tensor self) -> Tensor
use_c10_dispatcher: full
supports_named_tensor: True
variants: function, method
dispatch:
 CPU: sigmoid
 CUDA: sigmoid
+ HammerBlade: sigmoid
 MkldnnCPU: mkldnn_sigmoid

Add host code to aten/src/ATen/native/hammerblade/Sigmoid.cpp Add the dummiest host code possible, without calling the kernel.
Add tests to hammerblade/torch/tests/test_sigmoid.py
With Emulation Layer, make sure the code compiles and tests fail only because of incorrect results
Add kernel code to hammerblade/torch/kernel/kernel_sigmoid.cpp, which is also the dummiest code.
Change the host code to be more realistic: call the kernel and do nothing else.
Implement both the host and kernel code for real, assuming 1x1 tile group.
Make sure everything pass on Emulation layer, and write more tests. Then you are ready to create a PR!
Make sure your code works on COSIM.
Optimizations, like parallelization etc.

Kernel Development Tips

Maintaining two clones, one for emulation and one for cosim (eg., hb-pytorch/ and hb-pytorch-cosim/), eases the burden of cosim evaluation. This requires two separate pytorch environments as well (eg., venv_pytorch and venv_pytorch_cosim).
Ideally, you would only ever need to run once, to debug an issue. Use gdb extensively with emulation.

$ gdb python
(gdb) b tensorlib_sigmoid
(gdb) r -m pytest test_sigmoid.py

Linking would become a bottleneck when running in tight loop. As a result, gdb could save a lot of time compared to printf debugging.

Sometimes new cpp files are not taken into account by cmake. Since kernel authors would only ever need to add new files either to aten/src/Aten/native or hammerblade/torch/ running following command might solve the failure:

touch aten/src/ATen/CMakeLists.txt # New host code sources
touch c10/hammerblade/CMakeLists.txt # New device code sources

Native Profiling Tools

Native Profiling tools provide ATen operator level info, including per operator execution time break down and unimplemented HB operator info.

To enable profiling tools, call torch.hammerblade.profiler.enable() To disable profiling tools, call torch.hammerblade.profiler.disable() To test if the profiling tools are currently running, call torch.hammerblade.profiler.is_in_ROI()

import torch

# start of ROI
torch.hammerblade.profiler.enable()
x = torch.randn(10)
y = x + x
# end of ROI
torch.hammerblade.profiler.disable()

To read profiling data, call torch.hammerblade.profiler.stats() By default, this returns a string of per ATen operator execution time (ExecTime) and unimplemented operators (Unimpl). One may also pass in a list using KeyArg key. Available options are ExecTime, ExecTime-Latex, ExecTime-Raw, Unimpl

import torch

torch.hammerblade.profiler.enable()
x = torch.randn(10)
torch.hammerblade.profiler.disable()

print(torch.hammerblade.profiler.stats(key=['ExecTime-Raw'],
      trimming=True))

Here trimming is a "simulated time" correction mechanism.

HB Profiling

HB Kernel Call Logs

HB emulation can output a file with the list of kernel calls along with associated data in json format. This can be used as:

import torch
import torch.hammerblade.kernel_logger as hblog

x = torch.rand(2, 2).hammerblade()
y = torch.rand(2, 2).hammerblade()

# Enables the log
hblog.enable()

print(x + y)

# Disbales the log
hblog.disable()

# This is excluded from the log
print(x - y)

# Logs only the tensor add
print(hblog.json())

# Clears above operations from the logger
hblog.clear()

hblog.enable()
print(x * y)
hblog.disable()

# Logs only the tensor mul
print(hblog.json())

HB Key Kernel Charting

Chart provides a way to log down the "execution chart" of key kernels in a workload.

To use Chart, one needs to register one or more ATen operator signatures.

import torch

M = torch.randn(2, 3)
mat1 = torch.randn(2, 3)
mat2 = torch.randn(3, 3)

# reset chart
torch.hammerblade.profiler.chart.clear()
# add signature
torch.hammerblade.profiler.chart.add("at::Tensor at::CPUType::{anonymous}::addmm(const at::Tensor&, const at::Tensor&, const at::Tensor&, c10::Scalar, c10::Scalar)")
# turn on profiling
torch.hammerblade.profiler.enable()
# run addmm
torch.addmm(M, mat1, mat2)
# end profiling
torch.hammerblade.profiler.disable()
# dump chart
print(torch.hammerblade.profiler.chart.json())

The output should be

[
    {
        "offload": false,
        "signature": "at::Tensor at::CPUType::{anonymous}::addmm(const at::Tensor&, const at::Tensor&, const at::Tensor&, c10::Scalar, c10::Scalar)"
    }
]

HB Key Kernel Redispatching

One may choose to redispatch a kernel that should run on CPU to HB with Route. Route takes in the JSON produced by Chart. To redispatch a kernel, one just needs to change "offload": false to "offload": true.

import torch

M = torch.randn(2, 3)
mat1 = torch.randn(2, 3)
mat2 = torch.randn(3, 3)

route = """[
{
    "offload": false,
    "signature": "at::Tensor at::CPUType::{anonymous}::addmm(const at::Tensor&, const at::Tensor&, const at::Tensor&, c10::Scalar, c10::Scalar)"
},
{
    "offload": true,
    "signature": "at::Tensor at::CPUType::{anonymous}::add(const at::Tensor&, const at::Tensor&, c10::Scalar)"
}
]
"""
data = json.loads(route)
torch.hammerblade.profiler.route.set_route_from_json(data)

torch.hammerblade.profiler.enable()
torch.addmm(M, mat1, mat2)
# this add should be redispatch to HB
torch.add(M, mat1)
torch.hammerblade.profiler.disable()