About appropriate groups_per_step parameter setting given training and validation dataset volume

Question

About appropriate groups_per_step parameter setting given training and validation dataset volume

Opened this issue 2 months ago · 0 comments

We basically implemented our trainer based on https://github.com/OpenPipe/ART/blob/5a60fa017ab876910bbec61add43f81ef4103eb9/dev/art-e/art_e/train.py. Essentially, the same code skeleton for local vllm hosting.

However, during the experiment, one werid issue raised when setting groups_per_step = 50/40:

On a small training and validation dataset, for instance, 100, 50, the end-to-end training can be completed successfully. (groups_per_step=50)
Scaling out the training and validation daset to 300, 100, then end-to-end training can't be completed because LLM client reports a lot of timeout issue during rollout. (both groups_per_step = 50/40). The exception traces as following:

2025-09-08 10:56:55.806 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:56:55.845 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:56:56.022 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:56:56.416 INFO _base_client - _sleep_for_retry: Retrying request to /chat/completions in 0.902451 seconds
2025-09-08 10:56:56.422 INFO _base_client - _sleep_for_retry: Retrying request to /chat/completions in 0.842315 seconds
2025-09-08 10:56:56.460 ERROR rl_train_func - rollout_per_prediction_turn: API call error (attempt 1/3): litellm.InternalServerError: InternalServerError: Hosted_vllmException - Connection error.
2025-09-08 10:56:56.460 INFO rl_train_func - rollout_per_prediction_turn: Waiting for 0.22 seconds before retrying...
2025-09-08 10:56:57.790 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:56:57.924 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:56:58.015 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:56:58.555 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:56:58.757 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:57:00.028 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:57:00.160 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:57:04.125 INFO _base_client - _sleep_for_retry: Retrying request to /chat/completions in 0.804868 seconds
2025-09-08 10:57:04.486 ERROR llm_utils - parse_cot_result: Error parsing COT result: No <think> section found in the response
2025-09-08 10:57:05.256 ERROR rl_train_func - rollout_per_prediction_turn: API call error (attempt 1/3): litellm.Timeout: APITimeoutError - Request timed out. Error_str: Request timed out. - timeout value=600.0, time taken=1813.92 seconds
2025-09-08 10:57:05.256 INFO rl_train_func - rollout_per_prediction_turn: Waiting for 0.91 seconds before retrying...
2025-09-08 10:57:05.263 ERROR rl_train_func - rollout_per_prediction_turn: API call error (attempt 1/3): litellm.Timeout: APITimeoutError - Request timed out. Error_str: Request timed out. - timeout value=600.0, time taken=1814.36 seconds
2025-09-08 10:57:05.264 INFO rl_train_func - rollout_per_prediction_turn: Waiting for 0.44 seconds before retrying...
2025-09-08 10:57:05.271 ERROR rl_train_func - rollout_per_prediction_turn: API call error (attempt 1/3): litellm.Timeout: APITimeoutError - Request timed out. Error_str: Request timed out. - timeout value=600.0, time taken=1813.89 seconds

I compared with the vllm monitor for the successful training (pink line), the failed training (green line).

It seems virtual memory allocation of the vllm hosting process looks abnormal.

Another information about data volume is:
Each data point is actually a dialogue conversation with at most 20 turns. We let LLM expand for specific turns in this conversation, each of them will rollout 4 times.

For my understanding, in art groups_per_step decides how many parallel requests are sent to the local hosted VLLM. Considering no optimized vllm setting, I can understand the delayed client requests will be queued until they are consumed by VLLM. I observed the vllm monitor diagrams and found the corresponding time delays.
Maybe the accumulated delayed requests can't be process timely by VLLM so that LLM clients encountered the issue of timeout. But why the virtual memory of hosting processes keeps at a high level until the VLLM client reports timeout?

Can we have some ways to solve/alleviate this issue? We run our experiments on a powerful machine, so maybe an appropriate setting can help us address this issue.

Best Regards
Orlando