Distributed training
ch1998 opened this issue · 1 comments
I use the command “ python -m torch.distributed.launch --nproc_per_node=4 train_net.py --config configs/nhr/sport1.py” you gave for distributed training, but the following error pops up
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
data/trained_model/nhr/sport1
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68315 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 68316) of binary: /mnt/data/local-disk2/software/anaconda3/envs/mlp_maps/bin/python
Single gpu training is possible
Are there any other parameters that need to be set?