Accelerator='ddp' is an invalid accelerator name

Question

Accelerator='ddp' is an invalid accelerator name

Closed this issue 2 years ago · 6 comments

Hello again! I encounter an error when I'm trying to run the train.py, which shows that accelerator='ddp' is an invalid accelerator name. The error message is shown at the end of the issue.
My environment is：
CUDA 11.3
Python 3.10.8
PyTorch 1.12.1
PyTorch lightning 1.8.3

and I've also tried the environment setting as follows and still encounter the same problem:
CUDA 11.3
Python 3.8.13
PyTorch 1.10.0
PyTorch lightning 1.7.7

Can you kindly offer some suggestions? Thanks a lot and looking forward to your reply!

~/FastFlow3D-main$ python train.py --accelerator='ddp' --batch_size=16 --gpus=4 --num_workers=16 --learning_rate=0.0001 --disable_ddp_unused_check=True
No weights and biases API key set. Using tensorboard instead!
Disabling unused parameter check for DDP
Traceback (most recent call last):
File "/home/fjy/FastFlow3D-main/train.py", line 286, in
cli()
File "/home/fjy/FastFlow3D-main/train.py", line 263, in cli
trainer = pl.Trainer.from_argparse_args(args,
File "/home/fjy/anaconda3/envs/fastflow/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1917, in from_argparse_args
return from_argparse_args(cls, args, **kwargs)
File "/home/fjy/anaconda3/envs/fastflow/lib/python3.10/site-packages/pytorch_lightning/utilities/argparse.py", line 66, in from_argparse_args
return cls(**trainer_kwargs)
File "/home/fjy/anaconda3/envs/fastflow/lib/python3.10/site-packages/pytorch_lightning/utilities/argparse.py", line 340, in insert_env_defaults
return fn(self, **kwargs)
File "/home/fjy/anaconda3/envs/fastflow/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 408, in init
self._accelerator_connector = AcceleratorConnector(
File "/home/fjy/anaconda3/envs/fastflow/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 192, in init
self._check_config_and_set_final_flags(
File "/home/fjy/anaconda3/envs/fastflow/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 291, in _check_config_and_set_final_flags
raise ValueError(
ValueError: You selected an invalid accelerator name: accelerator='ddp'. Available names are: cpu, cuda, hpu, ipu, mps, tpu.

Answer 1 · 2022-11-25T21:03:15.000Z

There seems to be a lot of changes since the pyTorch lightning version we used. Especially regarding distributed training. Set accelerator to gpu and strategy="ddp" in the trainer. https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_intermediate_1.html#setup-the-training-script

Answer 2 · 2022-11-26T09:02:56.000Z

Yeah, I think you're right! It's my first time to use pyTorch lightning and I get quite confused about the parameter passing required for model training. It seems that some essential settings like max_epochs and weights_save_path are not given in the run.sh file or train.py. Does this have something to do with the pyTorch lightning version or maybe these model training configurations are given elsewhere? Sincerely looking forward to your reply, thanks a lot!

Answer 3 · 2022-11-26T14:13:52.000Z

Besides, after setting the accelerator to 'gpu' and strategy='ddp' in the trainer, a new error shows up, as given at the end of this comment. Since the pyTorch lightning version is updated, I think the plugin setting should be renewed as well, but I'm not sure which type to choose for plugin. Looking forward to your kind guidance, thank you!

~/FastFlow3D-main$ python train.py --accelerator='gpu' --batch_size=16 --gpus=4 --num_w orkers=16 --learning_rate=0.0001 --disable_ddp_unused_check=True
No weights and biases API key set. Using tensorboard instead!
Disabling unused parameter check for GPU
Traceback (most recent call last):
File "train.py", line 287, in
cli()
File "train.py", line 264, in cli
trainer = pl.Trainer.from_argparse_args(args,
File "/home/fjy/anaconda3/envs/pvraft/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", l ine 2449, in from_argparse_args
return from_argparse_args(cls, args, **kwargs)
File "/home/fjy/anaconda3/envs/pvraft/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse.py" , line 72, in from_argparse_args
return cls(**trainer_kwargs)
File "/home/fjy/anaconda3/envs/pvraft/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse.py" , line 345, in insert_env_defaults
return fn(self, **kwargs)
File "/home/fjy/anaconda3/envs/pvraft/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", l ine 433, in init
self._accelerator_connector = AcceleratorConnector(
File "/home/fjy/anaconda3/envs/pvraft/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/acc elerator_connector.py", line 193, in init
self._check_config_and_set_final_flags(
File "/home/fjy/anaconda3/envs/pvraft/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/acc elerator_connector.py", line 327, in _check_config_and_set_final_flags
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: Found invalid type for plugin <pytorch_lig htning.strategies.ddp.DDPStrategy object at 0x7f4d4c1d08b0>. Expected one of: PrecisionPlugin, CheckpointIO, ClusterEnviroment, or LayerSync.

Answer 4 · 2022-11-27T11:37:54.000Z

Hi, a had a look. There are really many changes to PytorchLightning. It looks even more powerful now. This means you need to adapt the code for the new PL version. Or use the same version as we did. I don't know which one this was right now. Sorry for not pinning the version. 1. The issue is likely because setting and configuring DDP has changed. https://pytorch-lightning.readthedocs.io/en/latest/advanced/model_parallel.html#ddp-optimizations Instead of using a Plugin to configure it now you set a custom strategy object. I like this way more actually. I guess the issue is the plugin we use to disable the unused checks for DDP. 2. You are right about the parameters. We had no need to limit the number of epochs, as training took so long we could stop it manually. There is a "max_time" parameter, however. 3. Yes, the checkpoint path is not set. We used the default checkpoint path. It is based on the logger and automatically uploaded the checkpoints to WnB.

Answer 5 · 2022-11-27T11:43:29.000Z

In general I highly recommend using PytorchLightning. It's great. Based on the time the version we used should be 1.3.8 https://pypi.org/project/pytorch-lightning/1.3.8/ You can test if it work with this version. Maybe there is a guide on how to migrate if you want to.

Answer 6 · 2022-11-28T11:31:44.000Z

I will look into it, thanks for your suggestions! 👍