Training Problem
ZetangForward opened this issue · 3 comments
Then, I try to utilize the following code, it still fails:
(zecheng) amax@amax:~/zecheng/gfn-lm-tuning/next_sentence$ torchrun --nnodes=1 --nproc_per_node=1 train.py task=openwebtext_gpt2 device=gpu
[rank: 0] Seed set to 27
[2024-02-20 13:54:15,022][sentence_transformers.SentenceTransformer][INFO] - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
[2024-02-20 13:54:19,179][sentence_transformers.SentenceTransformer][INFO] - Use pytorch device_name: cuda
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Error executing job with overrides: ['task=openwebtext_gpt2', 'device=gpu']
Traceback (most recent call last):
File "/home/amax/zecheng/gfn-lm-tuning/next_sentence/train.py", line 103, in train
trainer.fit(model=task, datamodule=data)
File "/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 99, in launch
self.cluster_environment.validate_settings(num_devices=self.num_processes, num_nodes=self.num_nodes)
File "/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/lightning_fabric/plugins/environments/torchelastic.py", line 90, in validate_settings
raise ValueError(
ValueError: You set `devices=8` and `num_nodes=1` in Lightning, but the product (8 * 1) does not match the world size (1).
Hi @ZetangForward. I've only designed and tested the code for single-gpu use at the moment, and I have no guarantee that the code will easily work or run at all for multiple GPUs. I think you'll have to play around with both hydra
and lightning
settings to get it to work correctly, and may potentially have to change some python code as well. Here's some lightning docs I found that could help: https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html
Hi @ZetangForward. I've only designed and tested the code for single-gpu use at the moment, and I have no guarantee that the code will easily work or run at all for multiple GPUs. I think you'll have to play around with both
hydra
andlightning
settings to get it to work correctly, and may potentially have to change some python code as well. Here's some lightning docs I found that could help: https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html
ok, thx for your suggestion. I will try it