antares: A C++ repository from sahil280114

What is Antares:

Antares is an automatic engine to generate multi-platform kernels with optimization for DNN developers (targeting to backends like CUDA/ROCm/CPU/DirectX12/Graphcore/OneAPI/..). It is also a framework for Hardware developers to extend new backends/hareware quickly and easily. Antares provides IR that follows "One Language Syntax for All Platforms", and general-purpose device access APIs that hide the differences of not only DNN description but also device mapping.

Features
- Backend Extension
- Effective Auto Tuning
- Einsum-based Antares IR
- Framework JIT Extension (Op Maker Plugin for Pytorch/Tensorflow/Tensorflow2)
How to Use Antares
- Senario-1: Quick Start for Developers that Use Antares to Tune Operator/Sub-graph in Foreground Terminal
- Senario-2: Quick Start for Developers that Use Antares to Extend Operator/Sub-graph in Pytorch/Tensorflow
Antares Pre-dependencies for Different Backends
- Linux-based: cuda, rocm, mcpu, scpu, gc, sycl_intel, sycl_cuda, ocl_amdgpu, ocl_nvidia, ocl_android, ..
- Windows-based: cuda_win64, rocm_win64, hlsl_win64, ..
About Microsft Open Source

About Antares Features:

a. Backend Extension

The current version of Antares supports code generation for the following backends (in orange blocks) and devices (in black blocks):

b. Effective Auto Tuning

Auto tuning by Antares contributes to not only much less tuning time, but also equivalent or better performance for Intra-op/Inter-op execution (against TVM Ansor).

c. Einsum-based Antares IR

Antares IR is the frontend of both kernel generation and automatic optimization.
The syntax of Antares IR is slim to describe most MLP/CNN/RNN/LSTM/Transformer based models like MNIST/ResNet/BERT/GPT/..

E.g. The following computation logic describes a layer of standard BERT transformer:

  merged_layer_local[R, B, S1, N1, H1] +=! input_tensor[B, S1, N, H] * qkv_weight[R, N, H, N1, H1];
  merged_layer_trans[R, B, N1, S1, H1] = merged_layer_local[R, B, S1, N1, H1] + qkv_bias[R, N1, H1];
  attention_scores[B, N1, S1, S2] +=! merged_layer_trans[0, B, N1, S1, H1] * merged_layer_trans[1, B, N1, S2, H1] / const({H}).cast(`float32`);
    softmax_1_temp0[B, N1] >=! attention_scores[B, N1, S1, S2];
    softmax_1_temp1[B, N1] +=! (attention_scores[B, N1, S1, S2] - softmax_1_temp0[B, N1]).call(`exp`);
  attention_probs[B, N1, S1, S2] = (attention_scores[B, N1, S1, S2] - softmax_1_temp0[B, N1]).call(`exp`) / softmax_1_temp1[B, N1];
  ... ...
  layer_norm_2_src[B, S1, N2, H2] = layer_output[B, S1, N2, H2] + attention_output_norm[B, S1, N2, H2];
    layer_norm_2_temp0[B, S1] += layer_norm_2_src[B, S1, N2, H2];
    layer_norm_2_temp1[B, S1] += layer_norm_2_src[B, S1, N2, H2] * layer_norm_2_src[B, S1, N2, H2];
  layer_output_norm[B, S1, N2, H2] = (layer_norm_2_src[B, S1, N2, H2] * {N * H} - layer_norm_2_temp0[B, S1]) * (layer_norm_2_temp0[B, S1] * {N * H} - layer_norm_2_temp1[B, S1] * layer_norm_2_temp1[B, S1]).call(`max`, [1e-8]).call(`rsqrt`);

For more IR usage or examples, please follow documentation here: Antares IR & Examples

d. Pytorch/Tensorflow/Tensorflow2 Op Maker (JIT Plugin)

Antares provides JIT plugin for Pytorch/Tensorflow/Tensorflow2 to help frameworks to easily extend new operators, e.g.:

# Tensorflow/Tensorflow2 Example:
op = antares.make_op(ir='dot_0[N, M] +=! data[N, K] * weight[K, M]', feed_dict={'data': x, 'weight': y}).emit()
result_1 = sess.run(op)
print('The custom result_1 is:\n%s' % result_1)
result_2 = sess.run(tf.add(op, op))
print('The custom result_2 is:\n%s' % result_2)  

# Pytorch Example:
custom_op = CustomOp(ir='dot_0[N, M] +=! data[N, K] * weight[K, M]', input_orders={'data': x, 'weight': y}).to(device).emit()
result = custom_op(x, y)
print('The custom result is:', result)

For complete programs, please follow examples here: Antares Examples for Pytorch and Antares Examples for TF/TF2

How to Use Antares?

Senario-1: Quick Start for Developers that Use Antares to Tune Operator/Sub-graph in Foreground Terminal:

Step-1 (Recommend for User with Root): Prepare Environment for Docker Mode

# Setup Package dependencies
sudo apt install docker.io

# Get Antares
git clone https://github.com/microsoft/antares --branch v0.2.x
cd antares/

# Set default backend type:
echo 'c-cuda' > backend.default

# Build the environment with sudo (if this step failed, please go to "Pre-dependencies" section to check which "backend-related dependencies" are missing):
sudo make

Step-1 (Recommend for Non-root Users): Prepare Environment for Host Mode

# Ensure Package dependencies (Please ensure the following root-required dependencies has been installed.)
sudo apt install git python3-pip g++ make g++-mingw-w64-x86-64

# Get Antares
git clone https://github.com/microsoft/antares --branch v0.2.x
cd antares/

# Set default backend type:
echo 'c-cuda' > backend.default

# Build the environment without sudo (if this step failed, please go to "Pre-dependencies" section to check which "backend-related dependencies" are missing):
make

All valid backends are listed in directory antares/backends

Step-2: Tune a Specific Workload in Foreground

# Example-1: Run the following command in bash to tune MatMul (4096, 4096) x (4096, 4096) using 2000 trials:
COMMIT=force STEP=2000 COMPUTE_V1='- S = 4096; einstein_v2(input_dict={"input0": {"dtype": "float32", "shape": [S, S]}, "input1": {"dtype": "float32", "shape": [S, S]}}, exprss="output0[N, M] +=! input0[N, K] * input1[K, M]")' make

# Example-2: Run the following command in bash to tune MNIST-inference using 5000 trials:
COMMIT=force STEP=5000 COMPUTE_V1='- einstein_v2(input_dict={"data": {"dtype": "float32", "shape": [64, 784]}, "weight_0": {"dtype": "float32", "shape": [784, 512]}, "weight_1": {"dtype": "float32", "shape": [512, 512]}, "weight_2": {"dtype": "float32", "shape": [512, 10]}, "bias_0": {"dtype": "float32", "shape": [512]}, "bias_1": {"dtype": "float32", "shape": [512]}, "bias_2": {"dtype": "float32", "shape": [10]}}, extra_outputs=[], exprss="data_0[N, M] +=!  data[N, K] * weight_0[K, M];   data_1[N, K] =   (data_0[N, K] + bias_0[K]).call(`max`, [0.0]);   data_2[N, M] +=!  data_1[N, K] * weight_1[K, M];   data_3[N, K] =   (data_2[N, K] + bias_1[K]).call(`max`, [0.0]);   data_4[N, M] +=!  data_3[N, K] * weight_2[K, M];   data_5[N, K] =   (data_4[N, K] + bias_2[K]);")' make

Apart from detailed reporting logs during the tuning procedure, the best kernel record will be saved to directory antares/codehub. If you don't want to create/overwrite existing kernel record in codehub, environment variable COMMIT=force in the tuning command can be removed.

Senario-2: Quick Start for Developers that Use Antares to Extend Operator/Sub-graph in Pytorch/Tensorflow (only for CUDA & ROCm backend currently):

Step-1: Prepare Environment

You need to follow Step-1 from Senario-1 to finish environment preparation beforehand. This prevents many environmental issues when walking to the next step.

Step-2: Set up a corresponding TF/TF2/Pytorch version that matches your CUDA/ROCm driver version. (If you have installed TF/TF2/Pytorch, please just ignore this step)

Here we provide several prebuilt package sources that match different environment requirements:

  For Tensorflow 1.x & 2.x: Recommended Packages (tested in Ubuntu 20.04):
  #   Tensorflow-1 for NVIDIA CUDA 10.0:
  python3 -m pip install --upgrade pip && python3 -m pip install tensorflow-gpu==1.15.4
  #   Tensorflow-1 for NVIDIA CUDA 11.0:
  python3 -m pip install --upgrade pip && python3 -m pip install https://github.com/ghostplant/tensorflow-wheel-collections/releases/download/cuda-11/tensorflow_gpu-1.15.4_cuda11+nv-cp38-cp38-linux_x86_64.whl
  #   Tensorflow-1 for AMD ROCm 4.2:
  python3 -m pip install tensorflow-rocm==1.15.12

  #   Tensorflow-2 for NVIDIA CUDA 11.0:
  python3 -m pip install --upgrade pip && python3 -m pip install tensorflow-gpu==2.4.0
  #   Tensorflow-2 for AMD ROCm 4.2:
  python3 -m pip install tensorflow-rocm==2.4.3

  For Pytorch 1.10: Recommended Packages (tested in Ubuntu 20.04):
  #   Pytorch for NVIDIA CUDA >= 10.2:
  python3 -m pip install --user torch==1.10.0+cu102 torchvision==0.11.1+cu102 torchaudio==0.10.0+cu102 -f https://download.pytorch.org/whl/torch_stable.html
  #   Pytorch for NVIDIA CUDA >= 11.3:
  python3 -m pip install --user torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
  #   Pytorch for AMD ROCm 4.2:
  python3 -m pip install --user torch==1.10.0+rocm4.2 torchvision==0.11.1+rocm4.2 -f https://download.pytorch.org/whl/torch_stable.html

Step-3: Install JIT Plugin Client and Run Examples

# If using Pytorch, set up JIT Plugin for Pytorch:
sudo python3 ./frameworks/pytorch/setup.py

# If using Tensorflow, set up JIT Plugin for Tensorflow/Tensorflow2:
sudo python3 ./frameworks/tensorflow/setup.py

# Set the path of Antares root directory:
export ANTARES_ROOT="(..clone path to antares..)/"

# Test Examples for Pytorch:
cd ./frameworks/pytorch/examples
./1_hello_world.py

# Test Examples for Tensorflow:
cd ./frameworks/tensorflow/examples
./1_hello_world.py

More examples here: Antares Examples for Pytorch and Antares Examples for TF/TF2

Antares Predependencies for Different Backends:

Before running make command in antares root directory, you need to ensure the corresponding backend driver is installed correctly.

Predependencies for backend c-cuda, c-sycl_cuda:

Requirement: Ubuntu >= 18.04

Requirement: Install NVIDIA CUDA toolkit (>= 10.0) on Host OS

Requirement: docker
Predependencies for backend c-ocl_nvidia:

Requirement: Ubuntu >= 18.04

Requirement: Install NVIDIA CUDA toolkit (>= 10.0) to Host OS

Requirement: run bash command "make install_host" in antares root directory beforehand
Predependencies for backend c-ocl_android, c-mcpu_android:

Requirement: Ubuntu >= 18.04

Requirement: Install package "adb", connect to rooted Android device and ensure command "adb shell su -c 'ls /sdcard'" works

Requirement: run bash command "make install_host" in antares root directory beforehand

Requirement: Install Android NDK and make a link in PATH to ensure command "aarch64-linux-android-clang++" works, e.g. "sudo ln -s (path to ndk's clang)/bin/aarch64-linux-androidi26-clang++ /usr/local/bin/aarch64-linux-android-clang++"
Predependencies for backend c-rocm, c-ocl_amdgpu:

Requirement: Ubuntu >= 18.04

Requirement: Install AMD ROCm (>= 4.0) package "rock-dkms" & "rock-dkms-firmware" from repo http://repo.radeon.com/rocm/apt/debian to Host OS

Requirement: docker
Predependencies for backend c-ipu:

Requirement: Ubuntu >= 18.04

Requirement: Install Poplar SDK to Host OS, ensure "popc" command exists in system PATH

Requirement: run bash command "make install_host" in antares root directory beforehand
Predependencies for backend c-scpu, c-mcpu, c-sycl_intel:

Requirement: Ubuntu >= 18.04

Requirement: docker
Predependencies for backend c-hlsl_win64, c-hlsl_xbox:

Requirement: Windows 10 64 bit (>= 2004), run "dxdiag.exe" to ensure Direct3D 12.0 Accleration is enabled

Requirement: Windows Subsystem Linux 1.0 How to Install WSL 1.0

Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).

Requirement: run bash command "make install_host" in antares root directory beforehand
Predependencies for backend c-rocm_win64:

Requirement: Windows 10 64 bit (>= 2004)

Requirement: Windows Subsystem Linux 1.0 How to Install WSL 1.0

Requirement: Install Official AMD GPU driver (release version >= 2020.11). Ensure C:\Windows\System32\amdhip64.dll exists after installation.

Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).

Requirement: run bash command "make install_host" in antares root directory beforehand
Predependencies for backend c-cuda_win64:

Requirement: Windows 10 64 bit (>= 2004)

Requirement: Windows Subsystem Linux 1.0 How to Install WSL 1.0

Requirement: Install Official NVIDIA CUDA driver (>= 10.0). Ensure C:\Windows\System32\nvcuda.dll exists after installation.

Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).

Requirement: run bash command "make install_host" in antares root directory beforehand

Current Support Table:

	HIP-C(c-rocm/c-rocm_win64)	CUDA(c-cuda/c-cuda_win64)	CPU(c-mcpu/c-scpu)	DirectX12(c-hlsl_win64)	Graphcore(c-ipu)	Intel OneAPI(c-sycl_intel)	Codeplay DPCPP (c-sycl_cuda)
Deploy Environment	Linux/WSL1/WSL2	Linux	Linux	WSL1/WSL2	Linux	Linux
Target Device	AMDGPU	NVGPU	Generic CPU	Generic Graphic Card	IPU Device	Intel CPU/HD Graphic/FPGA	NVGPU
Global schedules	Y	Y	Y	Y	Y	Y	Y
Local schedules	Y	Y	Y	Y		Y	Y
Head fusion	Y	Y	Y	Y	Y	Y	Y
Tail fusion	Y	Y		Y			Y
Evaluator	Y	Y	Y	Y	Y	Y	Y
Tensorflow Plugin	Y	Y	Y (intel-tensorflow)
Pytorch Plugin	Y	Y	Y
Multi Kernel Eval	Y	Y	Y	Y		Y	Y

About Microsft Open Source

For more information about Microsoft Open Source Policy, please see Microsoft Open Source Code of Conduct

sahil280114/antares