OpenPipe/ART

train process hang out after a few steps

Closed this issue · 2 comments

train log
[16:33:24] LLM returned no tool_calls; skipping tool execution | turn=3
[16:33:24] LLM request | step=4 model='mcprl-8b-tac2' tools=12 last_user='Please complete this task: Determine the number of days in the current month and calculate how many days have passed so far. Include a summary of the work done ...'
[16:33:24] LLM request (preview):
{
"model": "mcprl-8b-tac2",
"messages_len": 8,
"tools_len": 12
}
[16:33:24] LLM response parsed | finish_reason='stop' has_tool_calls=False content_preview='The current month, March 2025, has 31 days. \n\nTo determine how many days have passed so far in March 2025, we calculated the relative time from the start of the month (March 1, 2025) to the current da...'
[16:33:24] LLM returned no tool_calls; skipping tool execution | turn=2
[16:33:24] LLM request | step=3 model='mcprl-8b-tac2' tools=12 last_user='Please complete this task: Determine the number of days in the current month and calculate how many days have passed so far. Include a summary of the work done ...'
[16:33:24] LLM request (preview):
{
"model": "mcprl-8b-tac2",
"messages_len": 6,
"tools_len": 12
}
[16:33:28] LLM response parsed | finish_reason='tool_calls' has_tool_calls=True content_preview='None'
[16:33:28] Tool call received | name='complete_task' raw_args='{"summary": "Determined the number of days in March 2025 (31 days) and calculated that 23 days have passed so far."}'
[16:33:29] LLM response parsed | finish_reason='tool_calls' has_tool_calls=True content_preview='None'out_of_turns=0, llm_completion_duration=31.3,
[16:33:29] Tool call received | name='complete_task' raw_args='{"summary": "The number of days in the current month is 30, and 17 hours have passed since the last time check. The task has been successfully completed."}'
[16:33:30] LLM response parsed | finish_reason='tool_calls' has_tool_calls=True content_preview='None'out_of_turns=0, llm_completion_duration=31.8,
[16:33:30] Tool call received | name='complete_task' raw_args='{"summary": "Determined the number of days in the current month (30 days), calculated the current date (September 7, 2025), and found the relative time (17 hours ago). The task was completed with a detailed analysis and report of results."}'

train gather step 9: 75%|▊| 3/4 [00:33<00:07, 7.90s/it, reward=0, task_completed=1, success=0, ran_out_of_turns=0, llm_completion_duration=32.3,

vllm log
root@training-pod2-1:/workspace/pytorch# ps aux | grep python
root 9348 0.0 0.0 430408 136212 pts/4 Sl+ Sep03 0:23 /data/venvs/arte/bin/python3 /data/venvs/arte/bin/jupyter-lab
root 86126 1.2 0.0 219936 138076 pts/27 Sl+ 16:11 0:02 /data/lhy/MCP-Bridge/.venv/bin/python3 mcp_bridge/main.py
root 86177 11.7 0.4 1360975668 2593156 pts/40 Dl+ 16:12 0:25 python mcp_rl_training.py
root 86355 327 0.0 4376080 291320 pts/40 Dl+ 16:15 0:06 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86440 0.0 0.0 4092 2024 pts/41 S+ 16:15 0:00 grep --color=auto python
root@training-pod2-1:/workspace/pytorch# ps aux | grep python
root 9348 0.0 0.0 430408 136212 pts/4 Sl+ Sep03 0:23 /data/venvs/arte/bin/python3 /data/venvs/arte/bin/jupyter-lab
root 86126 0.8 0.0 219936 138288 pts/27 Sl+ 16:11 0:03 /data/lhy/MCP-Bridge/.venv/bin/python3 mcp_bridge/main.py
root 86177 115 3.4 1383536428 18463484 pts/40 Sl+ 16:12 6:37 python mcp_rl_training.py
root 86355 6.6 0.3 1356876848 2089944 pts/40 Sl+ 16:15 0:08 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86442 0.0 0.1 1354474112 986692 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86444 0.0 0.1 1354474112 986420 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86446 0.0 0.1 1354474112 986552 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86448 0.0 0.1 1354474112 986488 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86450 0.0 0.1 1354474112 986492 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86452 0.0 0.1 1354474112 986476 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86454 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86456 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86458 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86460 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86462 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86464 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86466 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86468 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86470 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86472 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86474 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86476 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86478 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86480 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86482 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86484 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86486 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86488 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86490 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86492 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86494 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86496 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86498 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86500 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86502 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86504 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86515 0.1 0.0 21876 18896 pts/40 S+ 16:16 0:00 /data/venvs/arte/bin/python -B -c from multiprocessing.resource_tracker import main;main(34)
root 86953 0.0 0.0 4092 1968 pts/41 S+ 16:17 0:00 grep --color=auto python
root@training-pod2-1:/workspace/pytorch# ps aux | grep python
root 9348 0.0 0.0 430408 136212 pts/4 Sl+ Sep03 0:23 /data/venvs/arte/bin/python3 /data/venvs/arte/bin/jupyter-lab
root 86126 0.3 0.0 218912 137680 pts/27 Sl+ 16:11 0:05 /data/lhy/MCP-Bridge/.venv/bin/python3 mcp_bridge/main.py
root 86177 91.7 4.1 1385606832 22108948 pts/40 Sl+ 16:12 23:56 python mcp_rl_training.py
root 86355 0.6 0.3 1356876848 2089944 pts/40 Sl+ 16:15 0:08 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86442 0.0 0.1 1354474112 986692 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86444 0.0 0.1 1354474112 986420 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86446 0.0 0.1 1354474112 986552 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86448 0.0 0.1 1354474112 986488 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86450 0.0 0.1 1354474112 986492 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86452 0.0 0.1 1354474112 986476 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86454 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86456 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86458 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86460 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86462 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86464 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86466 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86468 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86470 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86472 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86474 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86476 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86478 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86480 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86482 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86484 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86486 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86488 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86490 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86492 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86494 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86496 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86498 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86500 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86502 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86504 0.0 0.1 1354474112 986428 pts/40 Sl+ 16:15 0:00 /data/venvs/arte/bin/python /usr/local/lib/python3.12/site-packages/torch/_inductor/compile_worker/main.py --pickler=torch._inductor.compile_worker.subproc_pool.SubprocPickler --kind=fork --workers=32 --parent=86177 --read-fd=7 --write-fd=30
root 86515 0.0 0.0 21876 18896 pts/40 S+ 16:16 0:00 /data/venvs/arte/bin/python -B -c from multiprocessing.resource_tracker import main;main(34)
root 87122 0.0 0.0 4092 2004 pts/41 S+ 16:38 0:00 grep --color=auto python
root@training-pod2-1:/workspace/pytorch# nvtop
root@training-pod2-1:/workspace/pytorch# cd ..
root@training-pod2-1:/workspace# cd ..
root@training-pod2-1:/# cd data/lhy/.art/
gamer--rl/ tac--rl/ tac1--rl/ tac2--rl/ time--rl/ timeandcook--rl/ usweather--rl/
root@training-pod2-1:/# cd data/lhy/.art/
gamer--rl/ tac--rl/ tac1--rl/ tac2--rl/ time--rl/ timeandcook--rl/ usweather--rl/
root@training-pod2-1:/# cd data/lhy/.art/tac2--rl/models/mcprl-8b-tac2/logs/
root@training-pod2-1:/data/lhy/.art/tac2--rl/models/mcprl-8b-tac2/logs# vi vllm.log

683 INFO 09-07 16:35:54 [metrics.py:433] Prefix cache hit rate: GPU: 94.01%, CPU: 0.00%
684 ^[[32mINFO^[[0m: 127.0.0.1:34726 - "POST /v1/completions HTTP/1.1" 200
685 INFO 09-07 16:36:04 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
686 INFO 09-07 16:36:04 [metrics.py:433] Prefix cache hit rate: GPU: 94.01%, CPU: 0.00%
687 ^[[32mINFO^[[0m: 127.0.0.1:41398 - "GET /metrics HTTP/1.1" 200
688 INFO 09-07 16:36:09 [metrics.py:417] Avg prompt throughput: 0.2 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
689 INFO 09-07 16:36:09 [metrics.py:433] Prefix cache hit rate: GPU: 94.01%, CPU: 0.00%
690 ^[[32mINFO^[[0m: 127.0.0.1:41402 - "POST /v1/completions HTTP/1.1" 200
691 ^[[32mINFO^[[0m: 127.0.0.1:49040 - "GET /metrics HTTP/1.1" 200
692 INFO 09-07 16:36:24 [metrics.py:417] Avg prompt throughput: 0.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
693 INFO 09-07 16:36:24 [metrics.py:433] Prefix cache hit rate: GPU: 94.01%, CPU: 0.00%
694 ^[[32mINFO^[[0m: 127.0.0.1:49052 - "POST /v1/completions HTTP/1.1" 200
695 INFO 09-07 16:36:34 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
696 INFO 09-07 16:36:34 [metrics.py:433] Prefix cache hit rate: GPU: 94.01%, CPU: 0.00%
697 ^[[32mINFO^[[0m: 127.0.0.1:58164 - "GET /metrics HTTP/1.1" 200
698 INFO 09-07 16:36:39 [metrics.py:417] Avg prompt throughput: 0.2 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
699 INFO 09-07 16:36:39 [metrics.py:433] Prefix cache hit rate: GPU: 94.01%, CPU: 0.00%
700 ^[[32mINFO^[[0m: 127.0.0.1:58170 - "POST /v1/completions HTTP/1.1" 200
701 ^[[32mINFO^[[0m: 127.0.0.1:36434 - "GET /metrics HTTP/1.1" 200
702 INFO 09-07 16:36:54 [metrics.py:417] Avg prompt throughput: 0.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
703 INFO 09-07 16:36:54 [metrics.py:433] Prefix cache hit rate: GPU: 94.01%, CPU: 0.00%
704 ^[[32mINFO^[[0m: 127.0.0.1:36440 - "POST /v1/completions HTTP/1.1" 200

i try to restart the train script, it the MCP RL notebook. but it always hang after some steps

me too