[BUG]: weird stuck while training
Opened this issue · 8 comments
Is there an existing issue for this bug?
- I have searched the existing issues
🐛 Describe the bug
When training a language model with the GeminiPlugin, I encountered an issue where the process got stuck during the forward step. I was saving a checkpoint every 3000 steps, and when it got stuck, I had to kill the process and resume from the latest checkpoint.
The stuck times
start step | stuck step | total step in each run |
---|---|---|
225000 | 271464 | 46464 |
180000 | 226463 | 46463 |
135000 | 181463 | 46463 |
90000 | 136463 | 46463 |
45000 | 91463 | 46463 |
0 | 46465 | 46465 |
Is there any idea to find why? Thanks a lot.
Environment
CUDA: 12.1
NCCL: 2.18
Pytorch: 2.1.2
Python: 3.8
Colossalai: 0.4.2
Can you share any relevant messages and stack trace on stuck or exit?
Can you share any relevant messages and stack trace on stuck or exit?
I didn’t receive any useful information or logs. All nodes seem to be functioning correctly. The only option I have is to kill the training process and resume it.
When I add more logs, the process gets stuck at the forward step.
Could you share the stack trace when you kill by ctrl c and a reproducible script?
Could you share the stack trace when you kill by ctrl c and a reproducible script?
Could it caused by the weird behavior described in #6111 ?
You can probably test the behavior of all_gather_object
, see if it spawns multiple processes.
What happens with booster.save_optimizer(optimizer, path_optimizer, shard=True, size_per_shard=2048)
is that it calls into save_sharded_optimizer, which all_gathers the states . You can try removing some barriers along this call stack and ping other members with your findings (whether it fixes the stuck).
I observed that, following this line:
, the PID for other ranks starts appearing on rank-0Furthermore, after reaching this line:
If device
is replaced with torch.device(f"cuda:{torch.cuda.current_device()}")
, each rank retains only one PID, just as at the start.
compacted_states = torch.zeros(
compacted_size,
dtype=dtype,
device=torch.device(f"cuda:{torch.cuda.current_device()}"),
requires_grad=False
)
And after reaching this line:
the PID for other ranks still starts appearing on each rank.
And after reaching this line:
the PID for other ranks still starts appearing on each rank.
This might just be the default behavior. All gather by definition collects tensor-based objects from other ranks.
https://discuss.pytorch.org/t/distributed-all-gather-object-produces-multiple-additional-processes/164991
For the stuck, please try removing dist.barrier call