Cannot copy out of meta tensor; no data!
Closed this issue · 2 comments
Gooooooogo commented
When I run litgpt finetune lora --data Alpaca
error:
{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'),
'data': Alpaca(mask_prompt=False, val_split_fraction=0.03865, prompt_style=<litgpt.prompts.Alpaca object at 0x7f1976ff0d00>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca')),
'devices': 3,
'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100, initial_validation=False),
'logger_name': 'csv',
'lora_alpha': 16,
'lora_dropout': 0.05,
'lora_head': False,
'lora_key': False,
'lora_mlp': False,
'lora_projection': False,
'lora_query': True,
'lora_r': 8,
'lora_value': True,
'out_dir': PosixPath('out/finetune/lora'),
'precision': None,
'quantize': None,
'seed': 1337,
'train': TrainArgs(save_interval=1000, log_interval=1, global_batch_size=16, micro_batch_size=1, lr_warmup_steps=100, lr_warmup_fraction=None, epochs=1, max_tokens=None, max_steps=None, max_seq_length=None, tie_embeddings=None, learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'),
'data': Alpaca(mask_prompt=False, val_split_fraction=0.03865, prompt_style=<litgpt.prompts.Alpaca object at 0x7f6b4bd8fd60>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca')),
'devices': 3,
'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100, initial_validation=False),
'logger_name': 'csv',
'lora_alpha': 16,
'lora_dropout': 0.05,
'lora_head': False,
'lora_key': False,
'lora_mlp': False,
'lora_projection': False,
'lora_query': True,
'lora_r': 8,
'lora_value': True,
'out_dir': PosixPath('out/finetune/lora'),
'precision': None,
'quantize': None,
'seed': 1337,
'train': TrainArgs(save_interval=1000, log_interval=1, global_batch_size=16, micro_batch_size=1, lr_warmup_steps=100, lr_warmup_fraction=None, epochs=1, max_tokens=None, max_steps=None, max_seq_length=None, tie_embeddings=None, learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}
{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'),
'data': Alpaca(mask_prompt=False, val_split_fraction=0.03865, prompt_style=<litgpt.prompts.Alpaca object at 0x7fca4ffcbb20>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca')),
'devices': 3,
'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100, initial_validation=False),
'logger_name': 'csv',
'lora_alpha': 16,
...
'precision': None,
'quantize': None,
'seed': 1337,
'train': TrainArgs(save_interval=1000, log_interval=1, global_batch_size=16, micro_batch_size=1, lr_warmup_steps=100, lr_warmup_fraction=None, epochs=1, max_tokens=None, max_steps=None, max_seq_length=None, tie_embeddings=None, learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?5a6f3aaa-cfd8-4716-a486-c1f66bb0ae64) or open in a [text editor](command:workbench.action.openLargeOutput?5a6f3aaa-cfd8-4716-a486-c1f66bb0ae64). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------
[rank: 0] Seed set to 1337
[rank: 2] Seed set to 1337
[rank: 1] Seed set to 1337
Number of trainable parameters: 1,126,400
Number of non-trainable parameters: 1,100,048,384
The longest sequence length in the train data is 1305, the model's maximum sequence length is 1305 and context length is 2048
Validating ...
Traceback (most recent call last):
File "/home/jwan3704/litgpt-venv/bin/litgpt", line 8, in <module>
sys.exit(main())
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/__main__.py", line 143, in main
fn(**kwargs)
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 144, in setup
fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval)
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 845, in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
return to_run(*args, **kwargs)
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
return to_run(*args, **kwargs)
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 197, in main
fit(
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 259, in fit
validate(fabric, model, val_dataloader, dataclasses.replace(eval, max_iters=2)) # sanity check
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 354, in validate
logits = model(input_ids)
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
...
lora = self.zero_pad(after_B) * self.scaling # (64, 64, 256) after zero_pad (64, 64, 384)
File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/lora.py", line 345, in zero_pad
self._lora_ind_cache[result.device] = lora_ind = self._lora_ind.to(result.device)
NotImplementedError: Cannot copy out of meta tensor; no data!
rasbt commented
Haven't had a chance to test or try it yet, but this looks familiar @robieta re #1374:
self._lora_ind_cache[result.device] = lora_ind = self._lora_ind.to(result.device)
NotImplementedError: Cannot copy out of meta tensor; no data!
It may or may not be related. But I'm curious when you implemented #1374 have you tested in on multi-GPU?