[BUG]: weird stuck while training

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

When training a language model with the GeminiPlugin, I encountered an issue where the process got stuck during the forward step. I was saving a checkpoint every 3000 steps, and when it got stuck, I had to kill the process and resume from the latest checkpoint.

The stuck times

start step	stuck step	total step in each run
225000	271464	46464
180000	226463	46463
135000	181463	46463
90000	136463	46463
45000	91463	46463
0	46465	46465

Is there any idea to find why? Thanks a lot.

Environment

CUDA: 12.1
NCCL: 2.18
Pytorch: 2.1.2
Python: 3.8
Colossalai: 0.4.2

Can you share any relevant messages and stack trace on stuck or exit?

Can you share any relevant messages and stack trace on stuck or exit?

I didn’t receive any useful information or logs. All nodes seem to be functioning correctly. The only option I have is to kill the training process and resume it.

When I add more logs, the process gets stuck at the forward step.

Could you share the stack trace when you kill by ctrl c and a reproducible script?

Could you share the stack trace when you kill by ctrl c and a reproducible script?

Could it caused by the weird behavior described in #6111 ?

You can probably test the behavior of all_gather_object, see if it spawns multiple processes.
What happens with booster.save_optimizer(optimizer, path_optimizer, shard=True, size_per_shard=2048) is that it calls into save_sharded_optimizer, which all_gathers the states . You can try removing some barriers along this call stack and ping other members with your findings (whether it fixes the stuck).

I observed that, following this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 525 in 2f583c1

    
           compacted_states = self.pack_optimizer_states_to_tensor(param_id, state_names) if own_param else None

, the PID for other ranks starts appearing on rank-0

Furthermore, after reaching this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 593 in 2f583c1

    
           compacted_states = torch.zeros(compacted_size, dtype=dtype, device=device, requires_grad=False)

If device is replaced with torch.device(f"cuda:{torch.cuda.current_device()}"), each rank retains only one PID, just as at the start.

compacted_states = torch.zeros(
    compacted_size,
    dtype=dtype,
    device=torch.device(f"cuda:{torch.cuda.current_device()}"),
    requires_grad=False
)

And after reaching this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 532 in 2f583c1

    
           dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size], group=zero_group)

the PID for other ranks still starts appearing on each rank.

Hi @ver217，could you take a look? Thanks very much.

And after reaching this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 532 in 2f583c1

dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size], group=zero_group)

the PID for other ranks still starts appearing on each rank.

This might just be the default behavior. All gather by definition collects tensor-based objects from other ranks.
https://discuss.pytorch.org/t/distributed-all-gather-object-produces-multiple-additional-processes/164991
For the stuck, please try removing dist.barrier call