zhengchen1999/DAT

DDP expects same model across all ranks

tahir0khalil opened this issue · 2 comments

Hi,

I am trying to train DAT model on my custom dataset and have made all the required changes in .yml file. I have added the data in the designated directories but when i give it the command to start training it spends quite a lot of time once the following message is displayed:
INFO: Network [DAT] is created.

Then I get the following error message and training fails. Kindly let me know how can I fix this issue. I am running the model in inference mode with pretrained models on my custom data and it works perfectly.

PS: I am training on 4 3090 GPUs.

image
image
image

It seems there is a problem with DDP. This should be due to GPU-pytorch. You can try reinstalling a new environment and install pytorch+cuda separately:
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
Also, comment out lines 12 and 13 in the requirements.txt.

try to add "NCCL_P2P_DISABLE=1" in your cmd. I tried this to my cmd and "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1 torchrun --nnode=1 --nproc_per_node=8 --master_port=12345 run.py" can work