facebookincubator/AITemplate

multi-gpu at runtime error

ecilay opened this issue · 4 comments

ecilay commented

So say if I have two AIT convered models, model0 on cuda0 and model1 on cuda1.
Even if I used cudaSetDevice to load the models properly on each cuda device, at run time, after running inference on model0 on cuda0, model1 fails to run. Once I move two models into same devices, problem resolved.

Is this expected? Any possible short term fix? I did the experiment on A10g with 4 GPUs.

Hi @ecilay , Thanks for reporting the issue. What's the error message that you got?

ecilay commented

File "/home/test/runtime/runtime/ait/eps_ait.py", line 485, in __call__ return self.forward(
File "/home/test/runtime/runtime/ait/eps_ait.py", line 791, in forward noise_pred = self.dispatch_resolution_forward(inputs)
File "/home/test/runtime/runtime/ait/eps_ait.py", line 890, in dispatch_resolution_forward cur_engines[f"{h}x{w}"].run_with_tensors(inputs, ys, graph_mode=False)
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 587, in run_with_tensors outputs_ait = self.run(
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 490, in run return self._run_impl(
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 429, in _run_impl self.DLL.AITemplateModelContainerRun(
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 196, in _wrapped_func raise RuntimeError(f"Error in function: {method.__name__}")
RuntimeError: Error in function: AITemplateModelContainerRun

Thanks, @ecilay ! Hmm, doesn't have any clue. If it's possible, could you share a small repro that would help us investigate? Thanks!

ecilay commented

@chenyang78 I think you can repro by using any two AIT model (or maybe they could be the same model), load them on different GPUs, and do inference, see if it works? If it does, would appreciate sharing your inference scripts, thanks.