[Feature]
MalaJeans opened this issue · 2 comments
MalaJeans commented
hi,@yaoyaoding
Hidet is amazing and is learning from him recently.
Please there are some official tutorial for testing matrix multiplication with Hidet?Thank you!
yaoyaoding commented
Hi @MalaJeans,
There is not operator level benchmark suite in hidet for now. For now, you can use the following script to benchmark the matrix multiplication kernel performance generated by hidet:
import hidet
import torch
import torch.backends.cuda
from hidet.graph.ops.definitions.matmul.matmul_f16 import matmul_f16
from hidet.graph.ops.definitions.matmul.batch_matmul import batch_matmul
def bench_matmul_f16():
hidet.option.cache_dir('./outs/cache') # see the cache for the generated kernels
hidet.option.search_space(2)
for (m, k, n) in [(1024, 1024, 1024), (1024, 768, 3072)]:
print('Benchmarking {} x {} x {}'.format(m, k, n))
aa = torch.randn(m, k, dtype=torch.float16, device='cuda')
bb = torch.randn(k, n, dtype=torch.float16, device='cuda')
print('torch: {:.3f} ms'.format(hidet.utils.benchmark_func(lambda: torch.matmul(aa, bb))))
a = hidet.symbol([m, k], dtype='float16', device='cuda')
b = hidet.symbol([k, n], dtype='float16', device='cuda')
c = matmul_f16(a, b)
print('hidet: {:.3f} ms'.format(c.op.latency()))
def bench_matmul_f32():
hidet.option.cache_dir('./outs/cache') # see the cache for the generated kernels
hidet.option.search_space(2)
torch.backends.cuda.matmul.allow_tf32 = True
for (m, k, n) in [(1024, 1024, 1024), (1024, 768, 3072)]:
print('Benchmarking {} x {} x {}'.format(m, k, n))
aa = torch.randn(m, k, dtype=torch.float32, device='cuda')
bb = torch.randn(k, n, dtype=torch.float32, device='cuda')
print('torch: {:.3f} ms'.format(hidet.utils.benchmark_func(lambda: torch.matmul(aa, bb))))
a = hidet.symbol([1, m, k], dtype='float32', device='cuda')
b = hidet.symbol([1, k, n], dtype='float32', device='cuda')
c = batch_matmul(a, b, mma='mma')
print('hidet: {:.3f} ms'.format(c.op.latency()))
def main():
print('float16')
bench_matmul_f16()
print()
print('float32')
bench_matmul_f32()
if __name__ == '__main__':
main()
On my workstation equipped with RTX 3090, I have got the following numbers:
float16
Benchmarking 1024 x 1024 x 1024
torch: 0.043 ms
hidet: 0.037 ms
Benchmarking 1024 x 768 x 3072
torch: 0.092 ms
hidet: 0.078 ms
float32
Benchmarking 1024 x 1024 x 1024
torch: 0.082 ms
hidet: 0.103 ms
Benchmarking 1024 x 768 x 3072
torch: 0.185 ms
hidet: 0.188 ms
(The pytorch uses cublas kernels, both hidet and pytorch used tensor core)
MalaJeans commented
hi, @yaoyaoding
Thank you for your busy reply.
I will have a try.