training stuck
Opened this issue · 3 comments
i get an issue during training where the process is getting stuck at the gather stage. specifically, at this progress point:
gather: 75%|█████████████████████████████████████████████████████████████████████████████████████▌ | 9/12 [00:24<00:05, 1.79s/it, reward=0, correct=0, completion_tokens=82]
I recommend running nvidia-smi to see if vLLM is still running. You can also look at .art/{project}/{model}/logs/vllm.log to get more visibility into what vLLM is doing.
I just pushed something to address the OpenAI-compatible server hanging. Hopefully it will crash instead of getting stuck and you can add retry logic like the following if you like:
for _ in range(RETRIES)
# register for every try
await model.register(backend)
try:
# train loop, something like this
for _ in range(await model.get_step(), 1_000):
train_groups = await art.gather_trajectory_groups(
(
art.TrajectoryGroup(rollout(openai_client, prompt) for _ in range(32))
for prompt in prompts
),
pbar_desc="gather",
)
await model.train(
train_groups
)
except Exception:
pass
Not sure if this will address the underlying issue, so would be interested to hear if it helps.
To get the latest version of openpipe-art:
uv add 'git+https://github.com/OpenPipe/ART.git#egg=openpipe-art[backend]'
same issue . i could share with you the code if it will help