hidet-org/hidet

[Bug] Some hidet tensor methods do not support symbolic tensors?

eric8607242 opened this issue · 4 comments

Hi, thanks for the great work!

I am wondering why some hidet tensor methods (e.g., to, cuda, and cpu) do not support symbolic tensors.

class TestMode(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv = nn.Linear(10, 10)

    def forward(self, x):
        z = x.unsqueeze(0).expand(4, 4, 512).to(torch.device("cuda"))
        return z

if __name__ == "__main__":
    model = TestMode()
    model = model.eval().half()
    model = model.to(device)
    hidet.torch.dynamo_config.search_space(2)
    hidet.torch.dynamo_config.use_fp16()
    model_opt = torch.compile(model, backend='hidet')

    tokens = torch.zeros(20, 10).cuda()
    model_opt(tokens)

In the above test case, the exception
NotImplementedError: hidet: Tensor.to(..., device=...) is not supported for symbolic tensors., occurred when calling tensor_to(Tensor(shape=(4, 4, 512), dtype='bool', device='cuda:0'), device(type='cuda')) is raised.

I think the operation (.to(device)) is a common operation for deep learning models as the implementation of huggingface llama

Are there any concerns or limitations regarding these operations for symbolic trace?
Look forward to your response. Thanks!

Hi @eric8607242,

Thanks for bringing this up. We have partially fixed this issue in #214. With this PR, we can run your example:

import torch
from torch import nn
import hidet


class TestMode(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv = nn.Linear(10, 10)

    def forward(self, x):
        z = x.unsqueeze(0).expand(4, 4, 512).to(torch.device("cuda"))
        return z

if __name__ == "__main__":
    model = TestMode()
    model = model.eval().half()
    device = torch.device("cpu")
    model = model.to(device)
    hidet.torch.dynamo_config.search_space(2)
    hidet.torch.dynamo_config.use_fp16()
    model_opt = torch.compile(model, backend='hidet')

    tokens = torch.zeros(4, 512).cuda()
    model_opt(tokens)

The limitation is: for the tensor that is dependent on the model input (e.g., x.unsqueeze(0).expand(4, 4, 512) in your example), it can only be casted to the same device as the itself using either .cuda(), .cpu() or .to(device=...). The weight tensor does not have this limitation.

See the tests for more examples of what is supported and not.

Hi @yaoyaoding,

Thanks for your kindful response and quick fix. It is very helpful.

Sorry for two more silly questions.
Do you mean that if a model input is on the cpu then we can not cast the input to cuda with .cuda() or .to(torch.device("cuda")?
Why there is such a limitation? Big thanks for your help

Hi @eric8607242,

Yes, it is as what you said and you asked a good question.

This is a temparary limitation of our current IR and runtime system. The direct reason is that we do not have an operator like "to_device". We currently do not have a C++ runtime, but replies on CUDA graph to get rid of the framework-level overhead. It is not trivial to track both CPU kernel and GPU kernels in the same CUDA graph. So, before we have an efficient C++ runtime, we will not support the feature to mix kernels on cpu and gpu in a single computation graph.

Of course, if there are some important DNNs that reply on this feature, we would like to give it a higher priority. Currently, we are focusing on dynamic shape support.

Hi @yaoyaoding,

Thanks for the very clear answer.
I have no more questions and the issue is also solved.

Thanks for this amazing work again.
Close the issue.