Distributed training

Question

Distributed training

ch1998 opened this issue a year ago · 1 comments

ch1998 commented a year ago

I use the command “ python -m torch.distributed.launch --nproc_per_node=4 train_net.py --config configs/nhr/sport1.py” you gave for distributed training, but the following error pops up

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
data/trained_model/nhr/sport1
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68315 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 68316) of binary: /mnt/data/local-disk2/software/anaconda3/envs/mlp_maps/bin/python

Single gpu training is possible

Are there any other parameters that need to be set？

Answer 1 · 2024-01-11T09:56:28.000Z

@ch1998 Hi, have you figured out how to solve the problem. I am facing the same error during distributed training. Thanks!