hidet-org/hidet

Hidet .cpu() function is slow

hsonetta opened this issue · 5 comments

After I compile a torch model using hidet and use the CPU() function on the output tensor. It is slow compared to CPU() function used on original torch model output tensor. This negates the speed up achieved using the hidet compiled model. Whats the reason and is there any workaround? Thanks

with torch.no_grad():
    y1, y2 = pytorchModel(x)
start = time.time()
y1 = y1.sigmoid().cpu()
end = time.time()
print("Pytorch time: ", end - start)

hidetModel = torch.compile(pytorchModel, backend='hidet')
x1, x2 = hidetModel(x)
start1 = time.time()
x1 = x1.sigmoid().cpu()
end1 = time.time()
print("Hidet time: ", end1 - start1)```

Output:

Pytorch time:  0.008183717727661133
[2023-11-05 17:26:34,578] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing forward
[2023-11-05 17:26:35,532] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo done tracing forward (RETURN_VALUE)
[2023-11-05 17:26:35,544] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function hidet_backend
[2023-11-05 17:27:15,066] torch._dynamo.output_graph: [INFO] Step 2: done compiler function hidet_backend
Hidet time:  2.328554630279541

Hi @hsonetta,

Can I know the shape and data type of y1 (aka x1 in hidet case), as well as the GPU you are using?

Besides, could you also add a syncrhonization after model execution like

...
x1, x2 = hidetModel(x)
torch.cuda.syncrhonize()  # <---
start1 = time.time()
x1 = x1.sigmoid().cpu()
end1 = time.time()

for both torch model and hidet compiled model, and benchmark again?

I did a small experient on my machine and seems did not find much difference:

import time
import hidet
import torch


def main():
    shape = [4096, 4096]
    # create a tensor with hidet's memory allocator
    a = hidet.randn(shape, device='cuda')
    # convert to torch tensor by dlpack protocol (zero copy)
    b = torch.from_dlpack(a)
    # create a torch tensor with torch's memory allocator 
    c = torch.randn(*shape, device='cuda')

    for x in [b, c]:
        t1 = time.time()
        x.sigmoid().cpu()
        t2 = time.time()
        print(t2 - t1)
    # output on RTX 3090
    # 0.034680843353271484
    # 0.0343472957611084

if __name__ == '__main__':
    main()

I understand now the CPU operation waits for the kernels to finish. However, I correctly used the benchmark script on hidet quickstart guide page. The results weren't convincing. Do you know why? Code and GPU details below:

import hidet
import torch

# take resnet18 as an example
x = torch.randn(1, 3, 1024, 1024).cuda()

# uncomment the following line to enable kernel tuning
hidet.torch.dynamo_config.search_space(1)           # <---

# optimize the model with 'hidet' backend
model_opt = torch.compile(pytorchModel, backend='hidet')

# run the optimized model
y1 = model_opt(x)
y2 = pytorchModel(x)

# check the correctness
torch.testing.assert_close(actual=y1, expected=y2, rtol=1e-2, atol=1e-2)


# benchmark the performance
for name, model in [('eager', pytorchModel), ('hidet', model_opt)]:
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    torch.cuda.synchronize()
    start_event.record()
    for _ in range(100):
        y1, y2 = model(x)
    end_event.record()
    torch.cuda.synchronize()
    print('{:>10}: {:.3f} ms'.format(name, start_event.elapsed_time(end_event) / 100.0))

Output:
eager: 52.849 ms
hidet: 155.792 ms

import hidet
import torch

# take resnet18 as an example
x = torch.randn(1, 3, 1024, 1024).cuda()

# uncomment the following line to enable kernel tuning
hidet.torch.dynamo_config.search_space(2)           # <---

# optimize the model with 'hidet' backend
model_opt = torch.compile(pytorchModel, backend='hidet')

# run the optimized model
y1 = model_opt(x)
y2 = pytorchModel(x)

# check the correctness
torch.testing.assert_close(actual=y1, expected=y2, rtol=1e-2, atol=1e-2)


# benchmark the performance
for name, model in [('eager', pytorchModel), ('hidet', model_opt)]:
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    torch.cuda.synchronize()
    start_event.record()
    for _ in range(100):
        y1, y2 = model(x)
    end_event.record()
    torch.cuda.synchronize()
    print('{:>10}: {:.3f} ms'.format(name, start_event.elapsed_time(end_event) / 100.0))

Output:
eager: 51.189 ms
hidet: 75.518 ms

GPU:
NVIDIA GeForce RTX 2060
CUDA Version: 12.2

Hi @hsonetta,

It is possible that hidet is slower than PyTorch on some models/operators that have already been highly optimized by pytorch and vendor libraries (e.g., cudnn, cublas) while we did not pay much attention to. Thus, we can not do much before we know your model and have a hardware on our hand to optimize.

Maybe it is worth to try the steps in #368 (comment) to find which operator is slow and optimize them.