training stuck

Question

training stuck

Opened this issue 3 months ago · 3 comments

i get an issue during training where the process is getting stuck at the gather stage. specifically, at this progress point:

gather: 75%|█████████████████████████████████████████████████████████████████████████████████████▌ | 9/12 [00:24<00:05, 1.79s/it, reward=0, correct=0, completion_tokens=82]

Answer 1 · 2025-08-07T16:15:06.000Z

I recommend running nvidia-smi to see if vLLM is still running. You can also look at .art/{project}/{model}/logs/vllm.log to get more visibility into what vLLM is doing.

Answer 2 · 2025-08-20T00:55:20.000Z

I just pushed something to address the OpenAI-compatible server hanging. Hopefully it will crash instead of getting stuck and you can add retry logic like the following if you like:

for _ in range(RETRIES)
  # register for every try
  await model.register(backend)
  try:
    # train loop, something like this
    for _ in range(await model.get_step(), 1_000):
      train_groups = await art.gather_trajectory_groups(
          (
              art.TrajectoryGroup(rollout(openai_client, prompt) for _ in range(32))
              for prompt in prompts
          ),
          pbar_desc="gather",
      )
      await model.train(
          train_groups
      )
except Exception:
  pass

Not sure if this will address the underlying issue, so would be interested to hear if it helps.

To get the latest version of openpipe-art:

uv add 'git+https://github.com/OpenPipe/ART.git#egg=openpipe-art[backend]'

Answer 3 · 2025-08-24T08:45:24.000Z

same issue . i could share with you the code if it will help