Training freezes at 0% "train" after "gather" stage is complete
Opened this issue · 18 comments
Running locally ART's 2048.ipynb notebook in Docker, skipping the first cell.
Training freezes at 0% "train" after the "gather" stage is complete. GPU utilization is at 0% in nvidia-smi.
Unsloth's Qwen3 GRPO notebook (without the use of ART) works as expected, training in it doesn't freeze.
NVIDIA RTX 5060 Ti
Dockerfile:
FROM quay.io/jupyter/pytorch-notebook:cuda12-python-3.12
USER root
RUN apt-get update && apt-get install -y build-essential
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
RUN dpkg -i cuda-keyring_1.1-1_all.deb
RUN apt update && apt install -y cuda-toolkit
USER jovyan
RUN pip install openpipe-art==0.4.7 openpipe-art[backend]==0.4.7 --extra-index-url https://download.pytorch.org/whl/cu128 --extra-index-url https://wheels.vllm.ai/nightly
# Blackwell fix:
RUN pip uninstall -y xformers
RUN git clone --depth=1 https://github.com/facebookresearch/xformers --recursive && cd xformers && export TORCH_CUDA_ARCH_LIST="12.0" && python setup.py install
Output from the training cell:
gather: 100%
18/18 [01:35<00:00, 3.23s/it, reward=1.19, max_value=102, board_value=187, move_number=82.4, completion_tokens=21.8]
WARNING:weave.trace.op:Warning: Traces will not be logged. Call weave.init to log your traces to a project.
(subsequent messages of this type will be suppressed)
No "val/reward" metric found in history
Packed 18 trajectories into 15 sequences of length 6144
train: 0%
0/15 [00:00<?, ?it/s]
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 30,000,000
O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 1
\ / Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
"-____-" Trainable parameters = 14,966,784 of 3,100,905,472 (0.48% trained)
Unsloth: Will smartly offload gradients to save VRAM!
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 60,000,000
O^O/ \_/ \ Batch size per device = 1 | Gradient accumulation steps = 1
\ / Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1
"-____-" Trainable parameters = 14,966,784 of 3,100,905,472 (0.48% trained)
Any help is appreciated.
I pushed something earlier today that may help address the hanging you're seeing. To get the latest version of openpipe-art:
uv add 'git+https://github.com/OpenPipe/ART.git#egg=openpipe-art[backend]'
I pushed something earlier today that may help address the hanging you're seeing. To get the latest version of
openpipe-art:uv add 'git+https://github.com/OpenPipe/ART.git#egg=openpipe-art[backend]'
No, unfortunately, it doesn't seem to change anything for me
Are you trying to run this on a B200? I'm not sure we support Blackwell yet.
Are you trying to run this on a B200? I'm not sure we support Blackwell yet.
Haha, I wish I had a B200. No, I use an NVIDIA RTX 5060 Ti.
Unsloth supports Blackwell and I was under the impression that ART GPU support is basically tied to Unsloth GPU support, since ART uses Unsloth under the hood.
Is that not the case?
openpipe-art 0.4.4
GPU H200 x2I unfortunately encounter the same phenomenon quite regularly.
As far as i can tell, in my cases it results from LLMEngine crashes.
Important to note is that I use two clones of the model to perform rollouts simultaneously on 2 GPUS, but these crashes always occur on the vLLM backed from the art library (the second vLLM instance i start manually and sync its LoRA weights as the training goes).
Example crash logs from logs/vllm.log:
I use Qwen-3 14B, thinking disabled.
For log1 and log2 the program got stuck at training 0%, for log 3 the program got stuck at rollout 0%. Note that during rollout I use max_exceptions=4 so its possible the exception for logs 1 and 2 occurred during rollout, but max_exceptions>0 allowed the rollout to finish even though the engine itself crashed, and then it got stuck at the training stage.
Also important to note is that it happens (when it does) in the middle of the training and not in the beginning, so its pretty hard to reproduce. From my experience running model on CPU should display the actual error clearly (instead of the cryptic CUDA stuff), but as it happens after a while, this is quite infeasible.
Log 1
...
INFO 08-22 14:42:17 [metrics.py:433] Prefix cache hit rate: GPU: 83.33%, CPU: 0.00%
INFO 08-22 14:42:22 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 338.3 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.1%, CPU KV cache usage: 0.0%.
INFO 08-22 14:42:22 [metrics.py:433] Prefix cache hit rate: GPU: 83.33%, CPU: 0.00%
INFO 08-22 14:42:27 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 334.9 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.5%, CPU KV cache usage: 0.0%.
INFO 08-22 14:42:27 [metrics.py:433] Prefix cache hit rate: GPU: 83.33%, CPU: 0.00%
�[32mINFO�[0m: 127.0.0.1:60158 - "POST /v1/chat/completions HTTP/1.1" 200
ERROR 08-22 14:42:34 [async_llm_engine.py:67] Engine background task failed
Traceback (most recent call last):
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 57, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 202, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
result = coro.send(None)
^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 834, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 202, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/tasks.py", line 316, in __step_run_and_handle_result
result = coro.throw(exc)
^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/art/vllm/engine.py", line 75, in engine_step
return await _engine_step(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 757, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 355, in step_async
outputs = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 266, in execute_model_async
output = await make_async(self.execute_model)(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 289, in __await__
yield self # This tells Task to wait for completion.
^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup
future.result()
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 202, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 141, in execute_model
output = self.collective_rpc("execute_model",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/utils.py", line 2671, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 421, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 593, in execute_model
outputs = self._final_process_outputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 437, in _final_process_outputs
output.pythonize(model_input, self._copy_stream,
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 101, in pythonize
self._pythonize_sampler_output(input_metadata, copy_stream,
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 129, in _pythonize_sampler_output
self.sampler_output_ready_event.synchronize()
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/torch/cuda/streams.py", line 227, in synchronize
super().synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[32mINFO�[0m: 127.0.0.1:60136 - "POST /v1/chat/completions HTTP/1.1" 500
�[32mINFO�[0m: 127.0.0.1:60498 - "POST /v1/chat/completions HTTP/1.1" 500
�[32mINFO�[0m: 127.0.0.1:60508 - "POST /v1/chat/completions HTTP/1.1" 500
�[32mINFO�[0m: 127.0.0.1:60520 - "POST /v1/chat/completions HTTP/1.1" 500
�[32mINFO�[0m: Shutting down
�[32mINFO�[0m: Waiting for application shutdown.
�[32mINFO�[0m: Application shutdown complete.
�[32mINFO�[0m: Finished server process [�[36m1881623�[0m]Log 2
INFO 08-15 11:28:41 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 333.7 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.1%, CPU KV cache usage: 0.0%.
INFO 08-15 11:28:41 [metrics.py:433] Prefix cache hit rate: GPU: 69.38%, CPU: 0.00%
ERROR 08-15 11:28:41 [async_llm_engine.py:67] Engine background task failed
Traceback (most recent call last):
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 57, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 202, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
result = coro.send(None)
^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 834, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 202, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/tasks.py", line 316, in __step_run_and_handle_result
result = coro.throw(exc)
^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/art/vllm/engine.py", line 75, in engine_step
return await _engine_step(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 757, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 355, in step_async
outputs = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 266, in execute_model_async
output = await make_async(self.execute_model)(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 289, in __await__
yield self # This tells Task to wait for completion.
^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup
future.result()
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 202, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 141, in execute_model
output = self.collective_rpc("execute_model",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/utils.py", line 2671, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 421, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 593, in execute_model
outputs = self._final_process_outputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 437, in _final_process_outputs
output.pythonize(model_input, self._copy_stream,
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 101, in pythonize
self._pythonize_sampler_output(input_metadata, copy_stream,
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 131, in _pythonize_sampler_output
_pythonize_sampler_output(input_metadata, self.sampler_output,
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 823, in _pythonize_sampler_output
) = (deferred_pythonize_logprobs(output, sampling_metadata,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 722, in deferred_pythonize_logprobs
) = get_logprobs(logprobs_tensor, sampling_metadata, sampler_result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/sampler.py", line 902, in get_logprobs
selected_logprobs = selected_logprobs.to('cpu')
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[32mINFO�[0m: 127.0.0.1:48508 - "POST /v1/chat/completions HTTP/1.1" 500
�[32mINFO�[0m: 127.0.0.1:52710 - "POST /v1/chat/completions HTTP/1.1" 500
�[32mINFO�[0m: 127.0.0.1:52440 - "POST /v1/chat/completions HTTP/1.1" 500
�[32mINFO�[0m: 127.0.0.1:52392 - "POST /v1/chat/completions HTTP/1.1" 500
�[32mINFO�[0m: Shutting down
�[32mINFO�[0m: Waiting for application shutdown.
�[32mINFO�[0m: Application shutdown complete.
�[32mINFO�[0m: Finished server process [�[36m1937525�[0m]Log 3
INFO 08-08 19:20:00 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 69.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.8%, CPU KV cache usage: 0.0%.
INFO 08-08 19:20:00 [metrics.py:433] Prefix cache hit rate: GPU: 91.78%, CPU: 0.00%
�[32mINFO�[0m: 127.0.0.1:35408 - "POST /v1/chat/completions HTTP/1.1" 200
ERROR 08-08 19:20:03 [async_llm_engine.py:67] Engine background task failed
Traceback (most recent call last):
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 57, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 202, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
result = coro.send(None)
^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 834, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 202, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/tasks.py", line 316, in __step_run_and_handle_result
result = coro.throw(exc)
^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/art/vllm/engine.py", line 75, in engine_step
return await _engine_step(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 757, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 355, in step_async
outputs = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 266, in execute_model_async
output = await make_async(self.execute_model)(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 289, in __await__
yield self # This tells Task to wait for completion.
^^^^^^^^^^
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup
future.result()
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/futures.py", line 202, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/fre.gilad/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 141, in execute_model
output = self.collective_rpc("execute_model",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/utils.py", line 2671, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 421, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 593, in execute_model
outputs = self._final_process_outputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 437, in _final_process_outputs
output.pythonize(model_input, self._copy_stream,
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 101, in pythonize
self._pythonize_sampler_output(input_metadata, copy_stream,
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/vllm/worker/multi_step_model_runner.py", line 129, in _pythonize_sampler_output
self.sampler_output_ready_event.synchronize()
File "/home/fre.gilad/source/AgentDaC/.venv/lib/python3.12/site-packages/torch/cuda/streams.py", line 227, in synchronize
super().synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
As far as i can tell, in my cases it results from LLMEngine crashes.
I think you might have a different problem. I just checked my vllm.log and there are no crashes or errors. Everything is normal until the training stage begins; after that there are no new log lines and the training remains stuck at 0%.
I have the same issue. I'm using RTX A6000
I've rechecked this issue with ART 0.4.9, just in case, but there is no improvement so far.
I also tried running the new 2048 notebook unmodified in Colab (on a T4) and encountered the same problem. I think this is no longer limited to Blackwell - nobody can run ART? @bradhilton
Let me try reproducing.
K, I was able to reproduce on a T4. Thank you @Aranxtonel.
Appears to be a OOM error, but it's failing silently.
Appears to be a OOM error, but it's failing silently.
Interesting - but why? This model doesn't use that much VRAM, and reducing gpu_memory_utilization also doesn't seem to help. Could it be an excessive memory allocation or some memory leak?
going through same issue.
Im wondering is this related to this issue huggingface/trl#3933
same issue with latest code.
Well, after trying the new 0.5.0 version with the old notebook (which still supports local execution), I can say it somewhat works. Training no longer freezes, and I managed to finish local training after a couple of restarts and complete a few training steps on Colab.
Unfortunately, it only somewhat works because it randomly spits the following errors; restarting the notebook allows training to continue from the previous checkpoint.
ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-11' coro=<LocalBackend._monitor_openai_server() done, defined at /usr/local/lib/python3.12/dist-packages/art/local/backend.py:287> exception=NotFoundError("Error code: 404 - {'detail': 'Not Found'}")>
Traceback (most recent call last):
File "/usr/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
result = coro.send(None)
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/art/local/backend.py", line 332, in _monitor_openai_server
raise e
File "/usr/local/lib/python3.12/dist-packages/art/local/backend.py", line 322, in _monitor_openai_server
await openai_client.models.retrieve(
File "/usr/local/lib/python3.12/dist-packages/openai/resources/models.py", line 182, in retrieve
return await self._get(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1730, in get
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
raise self._make_status_error_from_response(err.response) from None
openai.NotFoundError: Error code: 404 - {'detail': 'Not Found'}
"./.art/2048-multi-turn/models/agent-002/history.jsonl" not found
and
AssertionError Traceback (most recent call last)
/usr/local/lib/python3.12/dist-packages/unsloth_zoo/vllm_utils.py in load_vllm(model_name, config, gpu_memory_utilization, max_seq_length, dtype, training, float8_kv_cache, random_state, enable_lora, max_lora_rank, max_loras, use_async, use_engine, disable_log_stats, enforce_eager, enable_prefix_caching, compilation_config, conservativeness, max_logprobs, use_bitsandbytes, unsloth_vllm_standby, is_vision_model, return_args)
1660 if use_async:
-> 1661 llm = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**engine_args))
1662 elif use_engine:
21 frames
AssertionError: Sleep mode can only be used for one instance per process.
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.12/dist-packages/unsloth_zoo/vllm_utils.py in load_vllm(model_name, config, gpu_memory_utilization, max_seq_length, dtype, training, float8_kv_cache, random_state, enable_lora, max_lora_rank, max_loras, use_async, use_engine, disable_log_stats, enforce_eager, enable_prefix_caching, compilation_config, conservativeness, max_logprobs, use_bitsandbytes, unsloth_vllm_standby, is_vision_model, return_args)
1688 )
1689 else:
-> 1690 raise RuntimeError(error)
1691 pass
1692 pass
RuntimeError: Sleep mode can only be used for one instance per process.
I'm not quite sure whether I should open a new issue for this, since it appears to be a continuation of the mentioned training problem in the same environments.
RuntimeError: Sleep mode can only be used for one instance per process.
In my experience this is usually raised due to insufficient GPU memory
I still encounter this problem with the latest art version (0.5.1). Does anyone know a workaround for this?