RLHFlow/Online-RLHF

Distributed training in stage 3.3 keeps hanging

srzer opened this issue · 2 comments

In stage 3.3, when I set distributed_type as NO, the code runs well; while when I try distributed_type as DEEPSPEED or MULTI_GPU, the code gets stuck when loading training_args = TrainingArguments(. For DEEPSPEED, the terminal stucks when showing

[2024-06-11 00:23:36,254] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-11 00:23:36,254] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-11 00:23:36,296] [INFO] [comm.py:637:init_distributed] cdb=None

Do you have some idea? My cuda version is 12.4

Once upon a time, I encountered this issue when there are other codes are running and the codes use deepspeed or accelerate but I am not sure whether this is related to your situation.

You may look into this potential solution
microsoft/DeepSpeed#3416

Thank you for your suggestions. And I finally find that this issue was due to my adding one line 'main_process_port: 0' in the zerox.yaml configs.