Training stuck

Question

Training stuck

Closed this issue a month ago · 5 comments

linkailong555-del commented 2 months ago

My rule model is a local model deployed in VLLM, and whenever I need to score after generating trajectories, it gets stuck for a long time:
Training data size: 3683
Training for 3 epoch(s)
Generating 3 responses per input for RULER to compare
============ Training Loop =============
Training: 0it [00:00, ?it/s]
Training step 285: 3%|████▍ | 285/9549 [00:00<?, ?batch/s]
gather: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00, 5.56s/it, reward=0, completion_tokens=902]

However, no logs were output on the VLLM.

Answer 1 · 2025-08-28T06:02:52.000Z

@linkailong555-del that sounds really annoying. Are you running one of the notebooks?

Answer 2 · 2025-09-02T00:58:09.000Z

I amrunning the modified code based on Autorl notebook, but I found that it often got stuck during the training process. The usage rates of the trained model and the rule-based judgment llm were both 0, and then the training process got stuck, like this:

Rank 1: Score 0.900

Rank 2: Score 0.600

Rank 3: Score 0.500
No "val/reward" metric found in history
Deleted checkpoint ./output/auto-rl-02/models/Qwen2.5-3B-Instruct/checkpoints/0378
Packed 3 trajectories into 1 sequences of length 6144
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\ /| Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 60,000,000 | 0/1 [00:00<?, ?it/s]
O^O/ _/ \ Batch size per device = 1 | Gradient accumulation steps = 1
\ / Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1
"-____-" Trainable parameters = 59,867,136 of 3,145,805,824 (1.90% trained)
more details:
My training model runs on a RTX 4090 GPU, while the judge model is deployed on another server's A100

Answer 3 · 2025-09-02T00:59:24.000Z

Every time I encounter this situation, I restart the code. At first, the training was fast, but after training a few times, this situation occurs again.

Answer 4 · 2025-09-03T02:00:06.000Z

@linkailong555-del
I assume it's because of the out-of-memory issue.
Can you verify this by running the training job on a more performant GPU (such as an H100) and see if it gets stuck?

Answer 5 · 2025-09-07T10:41:59.000Z

There is a high probability that the reason why the training code is stuck is that the call tool in the roll out phase does not create a connection correctly when communicating with the Smithery or the MCP of the local host.