Lightning-AI/lit-llama

LORA: RuntimeError: GET was unable to find an engine to execute this computation

LamOne1 opened this issue · 1 comments

Hello,

I ran adapter_v2 code without any issue, but I couldn't run lora code with the same environment and without changing anything in the code.

[rank: 0] Global seed set to 1337

Traceback (most recent call last):
  File ".../finetune/lora.py", line 218, in <module>
    CLI(main)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/jsonargparse/cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File ".../finetune/lora.py", line 79, in main
    train(fabric, model, optimizer, train_data, val_data, out_dir)
  File ".../finetune/lora.py", line 112, in train
    logits = model(input_ids)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/lightning-2.1.0.dev0-py3.9.egg/lightning/fabric/wrappers.py", line 116, in forward
    output = self._forward_module(*args, **kwargs)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File ".../lit_llama/model.py", line 105, in forward
    x, _ = block(x, rope, mask, max_seq_length)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File ".../lit_llama/model.py", line 164, in forward
    h, new_kv_cache = self.attn(self.rms_1(x), rope, mask, max_seq_length, input_pos, kv_cache)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File ".../lit_llama/model.py", line 196, in forward
    q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
  File ".../.conda/envs/llm2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File ".../lit_llama/lora.py", line 318, in forward
    after_B = F.conv1d(
RuntimeError: GET was unable to find an engine to execute this computation

The issue is fixed after changing the GPU from A100 to V100