qwopqwop200/GPTQ-for-LLaMa

Errors encountered when running benchmark FP16 baseline on multiple GPUs

foamliu opened this issue · 2 comments

Trying to run FP16 baseline benchmark for LLaMA 30B model on a server with 8 V100 32GB GPUs:

CUDA_VISIBLE_DEVICES=0,1 python llama.py /dev/shm/ly/models/hf_converted_llama/30B/ wikitext2 --benchmark 2048 --check

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:49<00:00, 7.09s/it]
Using the latest cached version of the module from /home/ly/.cache/huggingface/modules/datasets_modules/datasets/wikitext/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126 (last modified on Tue Apr 11 15:29:08 2023) since it couldn't be found locally at wikitext., or remotely on the Hugging Face Hub.
Found cached dataset wikitext (/home/ly/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Using the latest cached version of the module from /home/ly/.cache/huggingface/modules/datasets_modules/datasets/wikitext/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126 (last modified on Tue Apr 11 15:29:08 2023) since it couldn't be found locally at wikitext., or remotely on the Hugging Face Hub.
Found cached dataset wikitext (/home/ly/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Benchmarking ...
Traceback (most recent call last):
File "/home/ly/GPTQ-for-LLaMa/llama.py", line 492, in
benchmark(model, input_ids, check=args.check)
File "/home/ly/GPTQ-for-LLaMa/llama.py", line 411, in benchmark
out = model(input_ids[:, i:i + 1], past_key_values=cache['past'], attention_mask=attention_mask[:, :(i + 1)].reshape((1, -1)))
File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/ly/GPTQ-for-LLaMa/llama.py", line 351, in forward
tmp = self.module(*inp, **kwargs)
File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 202, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 134, in apply_rotary_pos_emb
cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)

accelerate nailed it:
acc

Still not working. Replace codes as mentioned, I got the following error:
image