Python based gemm benchmark for tensor computation
1. cd utils/deviceQuery
2. make -j8
1. git clone https://github.com/llvm/llvm-project.git
2. cd llvm-project
3. git checkout llvmorg-13.0.1
4. mkdir build && cd build
5. cmake -G Ninja ../llvm \
-DLLVM_ENABLE_PROJECTS="mlir;clang" \
-DLLVM_TARGETS_TO_BUILD="host;RISCV;NVPTX" \
-DMLIR_ENABLE_CUDA_RUNNER=ON \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DCMAKE_BUILD_TYPE=RELEASE
6. ninja -j12
7. ninja check-all
8. sudo ninja install
9. export PATH=/home/yang/Desktop/asplos-tvm/llvm-project/build/bin:$PATH
1. git clone --recursive https://github.com/Hzfengsy/asplos-tvm.git
2. cd asplos-tvm
3. mkdir build && cd build
4. cp ../cmake/config.cmake ./
5. set(LLVM ON) set(CUDA ON)
6. cmake .. && make -j12
7. pip install synr xgboost==1.5
8. export TVM_HOME=/path/to/tvm
9. export PYTHONPATH=$TVM_HOME/python:${PYTHONPATH}
10. export TVM_TARGET="nvidia/geforce-rtx-3090"
1. git clone https://github.com/NVIDIA/cutlass.git
2. git checkout v2.9.1
3. export CUDACXX=/usr/local/cuda/bin/nvcc
4. mkdir build && cd build
5. cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=all
6. make cutlass_profiler -j16
python bench.py --engine tvm_ms --workload GEMM --BSA 10 --BSB 10 --m 512 --k 1024 --n 512 --TransA N --TransB T
current engine is tvm_ms
batch_GEMM_10_10_512_512_1024_N_T_tvm_ms_input_f16_acc_f32_output_f32
./results/RTX3090/workloads
/home/yangbai/Desktop/github_yang/gemm-benchmark/utils/deviceQuery/deviceQuery
2024-02-26 17:49:26.865 INFO LocalRunner: max_workers = 1
primfn(var_X: handle, var_Y: handle, var_Z: handle) -> ()
attr = {"global_symbol": "main", "tir.noalias": True}
buffers = {X: Buffer(X_1: Pointer(global float16), float16, [10, 512, 1024], []),
Y: Buffer(Y_1: Pointer(global float16), float16, [10, 512, 1024], []),
Z: Buffer(Z_1: Pointer(global float32), float32, [10, 512, 512], [])}
buffer_map = {var_X: X, var_Y: Y, var_Z: Z} {
block([], "root") {
tir.reads([])
tir.writes([])
for (i0: int32, 0, 10) {
for (i1: int32, 0, 512) {
for (i2: int32, 0, 512) {
for (i3: int32, 0, 1024) {
block([10, 512, 512, tir.reduce_axis(0, 1024)], "Z") as [b, i, j, k] {
bind(b, i0)
bind(i, i1)
bind(j, i2)
bind(k, i3)
tir.reads([X[b, i, k], Y[b, j, k]])
tir.writes([Z[b, i, j]])
with init() {
Z[b, i, j] = 0f32
}
Z[b, i, j] = (Z[b, i, j] + (cast(float32, X[b, i, k])*cast(float32, Y[b, j, k])))
}
}
}
}
}
start tuning with meta schedule ...
2024-02-26 17:49:27.166 INFO LocalBuilder: max_workers = 24
hhhhh
2024-02-26 17:49:27.422 INFO Logging directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/logs
2024-02-26 17:49:27.422 INFO Logging directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/logs
2024-02-26 17:49:27.422 INFO Working directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000
2024-02-26 17:49:27.422 INFO Creating JSONDatabase. Workload at: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/database_workload.json. Tuning records at: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/database_tuning_record.json
2024-02-26 17:49:27.505 INFO Initializing Task #0: "main"
2024-02-26 17:49:27.536 INFO
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
---------------------------------------------------------------------------------------------------------------
0 | main | 5368709120 | 1 | N/A | N/A | N/A | 0 |
---------------------------------------------------------------------------------------------------------------
Total trials: 0
Total latency (us): 0
2024-02-26 17:49:27.536 INFO Scheduler picks Task #0: "main"
2024-02-26 17:49:41.048 INFO Sending 64 sample(s) to builder
2024-02-26 17:49:48.840 INFO Sending 64 sample(s) to runner
/home/yangbai/anaconda3/lib/python3.10/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
/home/yangbai/anaconda3/lib/python3.10/site-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated. See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
warnings.warn(f'Old style callback is deprecated. See: {link}', UserWarning)
2024-02-26 17:50:13.101 INFO [Updated] Task #0: "main"
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
---------------------------------------------------------------------------------------------------------------
0 | main | 5368709120 | 1 | 57412.0973 | 93.5118 | 93.5118 | 64 |
---------------------------------------------------------------------------------------------------------------
Total trials: 64
Total latency (us): 93.5118
python bench.py --engine tvm_ms --workload GEMM --BSA 10 --BSB 10 --m 512 --k 1024 --n 512 --TransA T --TransB N
current engine is tvm_ms
batch_GEMM_10_10_512_512_1024_T_N_tvm_ms_input_f16_acc_f32_output_f32
./results/RTX3090/workloads
/home/yangbai/Desktop/github_yang/gemm-benchmark/utils/deviceQuery/deviceQuery
2024-02-26 17:51:08.859 INFO LocalRunner: max_workers = 1
primfn(var_X: handle, var_Y: handle, var_Z: handle) -> ()
attr = {"global_symbol": "main", "tir.noalias": True}
buffers = {X: Buffer(X_1: Pointer(global float16), float16, [10, 1024, 512], []),
Y: Buffer(Y_1: Pointer(global float16), float16, [10, 1024, 512], []),
Z: Buffer(Z_1: Pointer(global float32), float32, [10, 512, 512], [])}
buffer_map = {var_X: X, var_Y: Y, var_Z: Z} {
block([], "root") {
tir.reads([])
tir.writes([])
for (i0: int32, 0, 10) {
for (i1: int32, 0, 512) {
for (i2: int32, 0, 512) {
for (i3: int32, 0, 1024) {
block([10, 512, 512, tir.reduce_axis(0, 1024)], "Z") as [b, i, j, k] {
bind(b, i0)
bind(i, i1)
bind(j, i2)
bind(k, i3)
tir.reads([X[b, k, i], Y[b, k, j]])
tir.writes([Z[b, i, j]])
with init() {
Z[b, i, j] = 0f32
}
Z[b, i, j] = (Z[b, i, j] + (cast(float32, X[b, k, i])*cast(float32, Y[b, k, j])))
}
}
}
}
}
start tuning with meta schedule ...
2024-02-26 17:51:09.169 INFO LocalBuilder: max_workers = 24
hhhhh
2024-02-26 17:51:09.420 INFO Logging directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/logs
2024-02-26 17:51:09.421 INFO Logging directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/logs
2024-02-26 17:51:09.421 INFO Working directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000
2024-02-26 17:51:09.421 INFO Creating JSONDatabase. Workload at: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/database_workload.json. Tuning records at: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/database_tuning_record.json
2024-02-26 17:51:09.422 INFO Initializing Task #0: "main"
2024-02-26 17:51:09.450 INFO
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
---------------------------------------------------------------------------------------------------------------
0 | main | 5368709120 | 1 | N/A | N/A | N/A | 0 |
---------------------------------------------------------------------------------------------------------------
Total trials: 0
Total latency (us): 0
2024-02-26 17:51:09.450 INFO Scheduler picks Task #0: "main"
2024-02-26 17:51:23.017 INFO Sending 64 sample(s) to builder
2024-02-26 17:51:39.480 INFO Sending 64 sample(s) to runner
/home/yangbai/anaconda3/lib/python3.10/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
/home/yangbai/anaconda3/lib/python3.10/site-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated. See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
warnings.warn(f'Old style callback is deprecated. See: {link}', UserWarning)
2024-02-26 17:51:52.878 INFO [Updated] Task #0: "main"
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
---------------------------------------------------------------------------------------------------------------
0 | main | 5368709120 | 1 | 26579.7810 | 201.9847 | 201.9847 | 64 |
---------------------------------------------------------------------------------------------------------------
Total trials: 64
Total latency (us): 201.985
python bench.py --engine tvm_ms --workload GEMM --BSA 10 --BSB 10 --m 512 --k 1024 --n 512 --TransA T --TransB T
current engine is tvm_ms
batch_GEMM_10_10_512_512_1024_T_T_tvm_ms_input_f16_acc_f32_output_f32
./results/RTX3090/workloads
/home/yangbai/Desktop/github_yang/gemm-benchmark/utils/deviceQuery/deviceQuery
2024-02-26 17:52:14.148 INFO LocalRunner: max_workers = 1
primfn(var_X: handle, var_Y: handle, var_Z: handle) -> ()
attr = {"global_symbol": "main", "tir.noalias": True}
buffers = {X: Buffer(X_1: Pointer(global float16), float16, [10, 1024, 512], []),
Y: Buffer(Y_1: Pointer(global float16), float16, [10, 512, 1024], []),
Z: Buffer(Z_1: Pointer(global float32), float32, [10, 512, 512], [])}
buffer_map = {var_X: X, var_Y: Y, var_Z: Z} {
block([], "root") {
tir.reads([])
tir.writes([])
for (i0: int32, 0, 10) {
for (i1: int32, 0, 512) {
for (i2: int32, 0, 512) {
for (i3: int32, 0, 1024) {
block([10, 512, 512, tir.reduce_axis(0, 1024)], "Z") as [b, i, j, k] {
bind(b, i0)
bind(i, i1)
bind(j, i2)
bind(k, i3)
tir.reads([X[b, k, i], Y[b, j, k]])
tir.writes([Z[b, i, j]])
with init() {
Z[b, i, j] = 0f32
}
Z[b, i, j] = (Z[b, i, j] + (cast(float32, X[b, k, i])*cast(float32, Y[b, j, k])))
}
}
}
}
}
start tuning with meta schedule ...
2024-02-26 17:52:14.454 INFO LocalBuilder: max_workers = 24
hhhhh
2024-02-26 17:52:14.718 INFO Logging directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/logs
2024-02-26 17:52:14.718 INFO Logging directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/logs
2024-02-26 17:52:14.718 INFO Working directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000
2024-02-26 17:52:14.718 INFO Creating JSONDatabase. Workload at: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/database_workload.json. Tuning records at: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_T_T_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/database_tuning_record.json
2024-02-26 17:52:14.763 INFO Initializing Task #0: "main"
2024-02-26 17:52:14.791 INFO
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
---------------------------------------------------------------------------------------------------------------
0 | main | 5368709120 | 1 | N/A | N/A | N/A | 0 |
---------------------------------------------------------------------------------------------------------------
Total trials: 0
Total latency (us): 0
2024-02-26 17:52:14.791 INFO Scheduler picks Task #0: "main"
2024-02-26 17:52:28.603 INFO Sending 64 sample(s) to builder
2024-02-26 17:52:49.406 INFO Sending 64 sample(s) to runner
/home/yangbai/anaconda3/lib/python3.10/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
/home/yangbai/anaconda3/lib/python3.10/site-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated. See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
warnings.warn(f'Old style callback is deprecated. See: {link}', UserWarning)
2024-02-26 17:53:07.494 INFO [Updated] Task #0: "main"
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
---------------------------------------------------------------------------------------------------------------
0 | main | 5368709120 | 1 | 26706.9909 | 201.0226 | 201.0226 | 64 |
---------------------------------------------------------------------------------------------------------------
Total trials: 64
Total latency (us): 201.023
python bench.py --engine tvm_ms --workload GEMM --BSA 10 --BSB 10 --m 512 --k 1024 --n 512 --TransA N --TransB N
current engine is tvm_ms
batch_GEMM_10_10_512_512_1024_N_N_tvm_ms_input_f16_acc_f32_output_f32
./results/RTX3090/workloads
/home/yangbai/Desktop/github_yang/gemm-benchmark/utils/deviceQuery/deviceQuery
2024-02-26 17:55:30.266 INFO LocalRunner: max_workers = 1
primfn(var_X: handle, var_Y: handle, var_Z: handle) -> ()
attr = {"global_symbol": "main", "tir.noalias": True}
buffers = {X: Buffer(X_1: Pointer(global float16), float16, [10, 512, 1024], []),
Y: Buffer(Y_1: Pointer(global float16), float16, [10, 1024, 512], []),
Z: Buffer(Z_1: Pointer(global float32), float32, [10, 512, 512], [])}
buffer_map = {var_X: X, var_Y: Y, var_Z: Z} {
block([], "root") {
tir.reads([])
tir.writes([])
for (i0: int32, 0, 10) {
for (i1: int32, 0, 512) {
for (i2: int32, 0, 512) {
for (i3: int32, 0, 1024) {
block([10, 512, 512, tir.reduce_axis(0, 1024)], "Z") as [b, i, j, k] {
bind(b, i0)
bind(i, i1)
bind(j, i2)
bind(k, i3)
tir.reads([X[b, i, k], Y[b, k, j]])
tir.writes([Z[b, i, j]])
with init() {
Z[b, i, j] = 0f32
}
Z[b, i, j] = (Z[b, i, j] + (cast(float32, X[b, i, k])*cast(float32, Y[b, k, j])))
}
}
}
}
}
start tuning with meta schedule ...
2024-02-26 17:55:30.579 INFO LocalBuilder: max_workers = 24
hhhhh
2024-02-26 17:55:30.837 INFO Logging directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/logs
2024-02-26 17:55:30.838 INFO Logging directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/logs
2024-02-26 17:55:30.838 INFO Working directory: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000
2024-02-26 17:55:30.838 INFO Creating JSONDatabase. Workload at: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/database_workload.json. Tuning records at: ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_N_tvm_ms_input_f16_acc_f32_output_f32_trials_1000/database_tuning_record.json
2024-02-26 17:55:30.838 INFO Initializing Task #0: "main"
2024-02-26 17:55:30.868 INFO
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
---------------------------------------------------------------------------------------------------------------
0 | main | 5368709120 | 1 | N/A | N/A | N/A | 0 |
---------------------------------------------------------------------------------------------------------------
Total trials: 0
Total latency (us): 0
2024-02-26 17:55:30.868 INFO Scheduler picks Task #0: "main"
2024-02-26 17:55:44.699 INFO Sending 64 sample(s) to builder
2024-02-26 17:55:52.089 INFO Sending 64 sample(s) to runner
/home/yangbai/anaconda3/lib/python3.10/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
/home/yangbai/anaconda3/lib/python3.10/site-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated. See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
warnings.warn(f'Old style callback is deprecated. See: {link}', UserWarning)
2024-02-26 17:56:10.128 INFO [Updated] Task #0: "main"
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
---------------------------------------------------------------------------------------------------------------
0 | main | 5368709120 | 1 | 30468.7195 | 176.2040 | 176.2040 | 64 |
---------------------------------------------------------------------------------------------------------------
Total trials: 64
Total latency (us): 176.204
python bench.py --engine torch --workload GEMM --BSA 10 --BSB 10 --m 512 --k 1024 --n 512 --TransA T --TransB N
current engine is torch
batch_GEMM_10_10_512_512_1024_T_N_torch_input_f16_acc_f32_output_f32
./results/RTX3090/workloads
/home/yangbai/Desktop/github_yang/gemm-benchmark/utils/deviceQuery/deviceQuery
58.25422133909627 TFLOPS
python bench.py --engine triton --workload GEMM --BSA 10 --BSB 10 --m 512 --k 1024 --n 512 --TransA N --TransB N
current engine is triton
batch_GEMM_10_10_512_512_1024_N_N_triton_input_f16_acc_f32_output_f32
./results/RTX3090/workloads
/home/yangbai/Desktop/github_yang/gemm-benchmark/utils/deviceQuery/deviceQuery
✅ Triton and Torch match
70.84972664319672 TFLOPS
python bench.py --engine cutlass --workload GEMM --BSA 10 --BSB 10 --m 512 --k 1024 --n 512 --TransA N --TransB N
current engine is cutlass
batch_GEMM_10_10_512_512_1024_N_N_cutlass_input_f16_acc_f32_output_f32
./results/RTX3090/workloads
/home/yangbai/Desktop/github_yang/gemm-benchmark/utils/deviceQuery/deviceQuery
Running: GEMM-10-f32-f32
GEMM-10-f32-f32: 56.090625 TFLOPS
Full benchmark results have been written to ./results/RTX3090/workloads/batch_GEMM_10_10_512_512_1024_N_N_cutlass_input_f16_acc_f32_output_f32/GEMM-10-f32-f32.csv