HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs

Error when running demo.py

kevinkhanhvu opened this issue · 3 comments

When I try to run file demo.py on one H100 - 80GB, I got this error (when load model) (I really download all models from requirements and install all dependencies), pls help me to check this issue: @longyuewangdcu @eltociear @YanshekWoo @imryanxu @expapa

While copying the parameter named "base_model.model.model.layers.30.mlp.experts.3.down_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.30.mlp.gate.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.30.mlp.gate.lora_B.default.weight", whose dimensions in the model are torch.Size([4, 8]) and whose dimensions in the checkpoint are torch.Size([4, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.q_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.q_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.k_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.k_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.v_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.o_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.o_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).

We are checking the code to quickly resolve the cause of the problem. Could you tell me which version of the model you are running? For example: Uni-MoE-speech-base-interval and Uni-MoE-speech-v1.5 as suggested in the demo.py?

Thank you so much for post this issue, the demo is not functioning well due to some problems in codes, will be update as soon as possible( problems have been solved and codes have been updated now ). However, the error you encounter may not relate to the functioning of code, it seems to be the problem of pytorch version and cuda version not matching, could you pls check this out?

When I try to run file demo.py on one H100 - 80GB, I got this error (when load model) (I really download all models from requirements and install all dependencies), pls help me to check this issue: @longyuewangdcu @eltociear @YanshekWoo @imryanxu @expapa

While copying the parameter named "base_model.model.model.layers.30.mlp.experts.3.down_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.30.mlp.gate.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.30.mlp.gate.lora_B.default.weight", whose dimensions in the model are torch.Size([4, 8]) and whose dimensions in the checkpoint are torch.Size([4, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.31.self_attn.q_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.31.self_attn.q_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.31.self_attn.k_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.31.self_attn.k_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.31.self_attn.v_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.31.self_attn.o_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',). While copying the parameter named "base_model.model.model.layers.31.self_attn.o_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).

Thanks @expapa , I only install all dependencies follow by file env.txt, I'll check cuda and torch version again!