Distributed training in stage 3.3 keeps hanging
srzer opened this issue · 2 comments
srzer commented
In stage 3.3, when I set distributed_type
as NO
, the code runs well; while when I try distributed_type
as DEEPSPEED
or MULTI_GPU
, the code gets stuck when loading training_args = TrainingArguments(
. For DEEPSPEED
, the terminal stucks when showing
[2024-06-11 00:23:36,254] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-11 00:23:36,254] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-11 00:23:36,296] [INFO] [comm.py:637:init_distributed] cdb=None
Do you have some idea? My cuda version is 12.4
WeiXiongUST commented
Once upon a time, I encountered this issue when there are other codes are running and the codes use deepspeed or accelerate but I am not sure whether this is related to your situation.
You may look into this potential solution
microsoft/DeepSpeed#3416
srzer commented
Thank you for your suggestions. And I finally find that this issue was due to my adding one line 'main_process_port: 0' in the zerox.yaml configs.