Training Problem

Question

Training Problem

ZetangForward opened this issue 7 months ago · 3 comments

ZetangForward commented 7 months ago

Hi, thanks for your great work. However, when I run the code

The GPU usage is strange:

Normally, it should be one process per GPU. However, each GPU has 8 processes.
Besides, I face the following problem:

How do I fix these issue? I believe it is hydra setting problem...

Answer 1 · 2024-02-20T05:55:22.000Z

Then, I try to utilize the following code, it still fails:

(zecheng) amax@amax:~/zecheng/gfn-lm-tuning/next_sentence$ torchrun --nnodes=1 --nproc_per_node=1 train.py task=openwebtext_gpt2 device=gpu
[rank: 0] Seed set to 27
[2024-02-20 13:54:15,022][sentence_transformers.SentenceTransformer][INFO] - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
[2024-02-20 13:54:19,179][sentence_transformers.SentenceTransformer][INFO] - Use pytorch device_name: cuda
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Error executing job with overrides: ['task=openwebtext_gpt2', 'device=gpu']
Traceback (most recent call last):
  File "/home/amax/zecheng/gfn-lm-tuning/next_sentence/train.py", line 103, in train
    trainer.fit(model=task, datamodule=data)
  File "/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 99, in launch
    self.cluster_environment.validate_settings(num_devices=self.num_processes, num_nodes=self.num_nodes)
  File "/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/lightning_fabric/plugins/environments/torchelastic.py", line 90, in validate_settings
    raise ValueError(
ValueError: You set `devices=8` and `num_nodes=1` in Lightning, but the product (8 * 1) does not match the world size (1).

Answer 2 · 2024-02-21T06:00:14.000Z

Hi @ZetangForward. I've only designed and tested the code for single-gpu use at the moment, and I have no guarantee that the code will easily work or run at all for multiple GPUs. I think you'll have to play around with both hydra and lightning settings to get it to work correctly, and may potentially have to change some python code as well. Here's some lightning docs I found that could help: https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html

Answer 3 · 2024-02-25T07:08:52.000Z

Hi @ZetangForward. I've only designed and tested the code for single-gpu use at the moment, and I have no guarantee that the code will easily work or run at all for multiple GPUs. I think you'll have to play around with both hydra and lightning settings to get it to work correctly, and may potentially have to change some python code as well. Here's some lightning docs I found that could help: https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html

ok, thx for your suggestion. I will try it