[Feature]

Question

[Feature]

MalaJeans opened this issue a year ago · 2 comments

hi,@yaoyaoding
Hidet is amazing and is learning from him recently.
Please there are some official tutorial for testing matrix multiplication with Hidet？Thank you!

Answer 1 · 2023-05-24T21:59:06.000Z

Hi @MalaJeans,

There is not operator level benchmark suite in hidet for now. For now, you can use the following script to benchmark the matrix multiplication kernel performance generated by hidet:

import hidet
import torch
import torch.backends.cuda
from hidet.graph.ops.definitions.matmul.matmul_f16 import matmul_f16
from hidet.graph.ops.definitions.matmul.batch_matmul import batch_matmul


def bench_matmul_f16():
    hidet.option.cache_dir('./outs/cache')  # see the cache for the generated kernels
    hidet.option.search_space(2)

    for (m, k, n) in [(1024, 1024, 1024), (1024, 768, 3072)]:
        print('Benchmarking {} x {} x {}'.format(m, k, n))

        aa = torch.randn(m, k, dtype=torch.float16, device='cuda')
        bb = torch.randn(k, n, dtype=torch.float16, device='cuda')
        print('torch: {:.3f} ms'.format(hidet.utils.benchmark_func(lambda: torch.matmul(aa, bb))))

        a = hidet.symbol([m, k], dtype='float16', device='cuda')
        b = hidet.symbol([k, n], dtype='float16', device='cuda')
        c = matmul_f16(a, b)
        print('hidet: {:.3f} ms'.format(c.op.latency()))


def bench_matmul_f32():
    hidet.option.cache_dir('./outs/cache')  # see the cache for the generated kernels
    hidet.option.search_space(2)

    torch.backends.cuda.matmul.allow_tf32 = True

    for (m, k, n) in [(1024, 1024, 1024), (1024, 768, 3072)]:
        print('Benchmarking {} x {} x {}'.format(m, k, n))

        aa = torch.randn(m, k, dtype=torch.float32, device='cuda')
        bb = torch.randn(k, n, dtype=torch.float32, device='cuda')
        print('torch: {:.3f} ms'.format(hidet.utils.benchmark_func(lambda: torch.matmul(aa, bb))))

        a = hidet.symbol([1, m, k], dtype='float32', device='cuda')
        b = hidet.symbol([1, k, n], dtype='float32', device='cuda')
        c = batch_matmul(a, b, mma='mma')
        print('hidet: {:.3f} ms'.format(c.op.latency()))


def main():
    print('float16')
    bench_matmul_f16()
    print()

    print('float32')
    bench_matmul_f32()


if __name__ == '__main__':
    main()

On my workstation equipped with RTX 3090, I have got the following numbers:

float16
Benchmarking 1024 x 1024 x 1024
torch: 0.043 ms
hidet: 0.037 ms
Benchmarking 1024 x 768 x 3072
torch: 0.092 ms
hidet: 0.078 ms

float32
Benchmarking 1024 x 1024 x 1024
torch: 0.082 ms
hidet: 0.103 ms
Benchmarking 1024 x 768 x 3072
torch: 0.185 ms
hidet: 0.188 ms

(The pytorch uses cublas kernels, both hidet and pytorch used tensor core)

Answer 2 · 2023-05-25T08:00:30.000Z

hi, @yaoyaoding
Thank you for your busy reply.
I will have a try.