Path-1: Antares Kernel Optimizer (CUDA/ROCm/DX/SYCL/OpenCL/CPU/IPU/Android):
python3 -m pip install antares
, which follows: README for Antares.
Path-2: AutoRT for Pytorch Runtime & Device Benchmark (CUDA/DirectX/..):
AutoRT is a compiler solution that helps runtime users to invent, benchmark and optimize operators for Pytorch using your own accelerators:
- AutoRT can be as a benchmark utility for device performance testing and profiling.
- AutoRT can also generate Pytorch2 of your device to accelerate standard Pytorch applications (e.g. DirectX).
- Additionally, AutoRT futher helps to construct custom defined / fused operators that are beyond the built-in functions of Pytorch.
- AutoRT for Windows DirectX 12 / Linux CUDA has experimental version released.
- Click here to suggest more platforms (e.g. Pytorch2 for Windows ROCm / OpenCL / SYCL / Apple Metal / ..) you would like AutoRT to support in the follow-up releases.
Platform | OS Requirement | Python Requirement | Download Link |
---|---|---|---|
DirectX 12 | Windows >= 10 / Microsoft XBox | Python3.8 | python.exe -m pip install --verbose https://github.com/microsoft/antares/releases/download/v0.9.1/autort-0.9.1.3_directx-cp38-cp38-win_amd64.whl |
CUDA >= 11 | Ubuntu >= 18.04 (or images) | Python 3.8/3.9/3.10/3.11/3.12 | python3 -m pip install --verbose https://github.com/microsoft/antares/releases/download/v0.9.1/autort-0.9.1.3+cuda.linux.tar.gz |
.. | .. | .. | .. (More coming soon) .. |
For CUDA, here are several Ubuntu >= 18.04 equivalent containers below:
- Docker Image: nvidia/cuda:12.0.1-cudnn8-devel-ubuntu18.04
- Docker Image: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
- Docker Image: nvidia/cuda:12.0.1-cudnn8-devel-ubuntu20.04
- Docker Image: nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
- ..
To enable AutoRT to produce custom CUDA operators for Pytorch 2, please also ensure Pytorch is installed before AutoRT, with:
python3 -m pip install torch --index-url https://download.pytorch.org/whl/cu118
$ python.exe -m autort.utils.memtest
...
[1000/1000] AutoRT Device Memory Bandwidth: (Actual ~= 468.12 GB/s) (Theoretical ~= 561.75 GB/s)
$ python.exe -m autort.utils.fp32test
...
[5000/5000] AutoRT FP32 TFLOPS: (Actual ~= 9.84 TFLOPS) (Theoretical ~= 10.93 TFLOPS)
- Style-1: "AutoRT API Style" Custom Operator Generation:
>> import torch, autort
>> data = torch.arange(0, 10, dtype=torch.float32, device=autort.device())
>> f = autort.export(ir="sigmoid_f32[N] = 1 - 1 / (1 + data[N].call(strs.exp))", inputs=["data=float32[N:4096000]"], config="tune:5")
>> print(f(data))
tensor([0.5000, 0.7311, 0.8808, 0.9526, 0.9820, 0.9933, 0.9975, 0.9991, 0.9997, 0.9999])
>> print(autort.ops.sigmoid_f32(data))
tensor([0.5000, 0.7311, 0.8808, 0.9526, 0.9820, 0.9933, 0.9975, 0.9991, 0.9997, 0.9999])
- Style-2: "Command Line Style" Custom Operator Generation:
# Fist, create a custom sigmoid activation operator with auto-tuning steps == 10:
$ python.exe -m autort.utils.export --ir "sigmoid_f32[N] = 1 - 1 / (1 + data[N].call(strs.exp))" -i data=float32[N:4096000] -c "tune:5"
# Then, use it in Pytorch 2 session:
$ python.exe
>> import torch, autort
>>
>> data = torch.arange(0, 10, dtype=torch.float32, device=autort.device())
>> output = autort.ops.sigmoid_f32(data)
>> print(output)
tensor([0.5000, 0.7311, 0.8808, 0.9526, 0.9820, 0.9933, 0.9975, 0.9991, 0.9997,
0.9999])
$ python.exe -m autort.examples.mnist
...
step = 100, loss = 2.2871, accuracy = 21.88 %
step = 200, loss = 2.1408, accuracy = 46.88 %
step = 300, loss = 1.6713, accuracy = 62.50 %
step = 400, loss = 0.9573, accuracy = 62.50 %
step = 500, loss = 0.8338, accuracy = 68.75 %
step = 600, loss = 0.5882, accuracy = 84.38 %
step = 700, loss = 0.2738, accuracy = 87.50 %
step = 800, loss = 0.5159, accuracy = 87.50 %
step = 900, loss = 0.5511, accuracy = 84.38 %
step = 1000, loss = 0.2616, accuracy = 93.75 %
...
Quick Test 3: Fine-tune existing operators to make Pytorch Builtin Operators run faster (DirectX only).
$ python.exe -m autort.utils.mmtest
>> Performance of your device:
`MM-Perf` (current) = 4.15 TFLOPS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> ...
$ python.exe -m autort.utils.export -s 4000
Module file for operator `gemm_f32` has been exported to `.\ops\gemm_f32.mod`.
..
$ python.exe -m autort.utils.mmtest
>> Performance of your device:
`MM-Perf` (current) = 9.71 TFLOPS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> ...
If you like it, welcome to report issues or donate stars which can encourage AutoRT to support more backends, more OS-type and more documentations. See More Information about Microsoft Contributing and Trademarks.