CUDA error: no kernel image is available for execution on the device

Question

CUDA error: no kernel image is available for execution on the device

Closed this issue a month ago · 0 comments

/home/jake/anaconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
from vllm.version import version as VLLM_VERSION
INFO 10-14 22:51:57 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='/home/jake/LLaMA-Factory/finetunes', speculative_config=None, tokenizer='/home/jake/LLaMA-Factory/finetunes', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/jake/LLaMA-Factory/finetunes, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-14 22:51:59 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-14 22:51:59 selector.py:115] Using XFormers backend.
/home/jake/anaconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/jake/anaconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-14 22:52:00 model_runner.py:1060] Starting to load model /home/jake/LLaMA-Factory/finetunes...
INFO 10-14 22:52:00 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-14 22:52:00 selector.py:115] Using XFormers backend.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.95s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.95s/it]

INFO 10-14 22:52:02 model_runner.py:1071] Loading model weights took 2.3185 GB
INFO 10-14 22:52:02 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-225202.pkl...
INFO 10-14 22:52:02 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-225202.pkl.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1665, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 556, in forward
[rank0]: model_output = self.model(input_ids, positions, kv_caches,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 345, in forward
[rank0]: hidden_states, residual = layer(positions, hidden_states,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 257, in forward
[rank0]: hidden_states = self.self_attn(positions=positions,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 184, in forward
[rank0]: qkv, _ = self.qkv_proj(hidden_states)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 371, in forward
[rank0]: output_parallel = self.quant_method.apply(self, input, bias)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 135, in apply
[rank0]: return F.linear(x, layer.weight, bias)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "/home/jake/Downloads/MMLU-Pro-main/evaluate_from_local.py", line 284, in
[rank0]: main()
[rank0]: File "/home/jake/Downloads/MMLU-Pro-main/evaluate_from_local.py", line 200, in main
[rank0]: model, tokenizer = load_model()
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/home/jake/Downloads/MMLU-Pro-main/evaluate_from_local.py", line 30, in load_model
[rank0]: llm = LLM(model=args.model, gpu_memory_utilization=float(args.gpu_util),
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 177, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 574, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 349, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1309, in profile_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
[rank0]: raise type(err)(
[rank0]: RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-225202.pkl): CUDA error: no kernel image is available for execution on the device
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Running on Debian Linux
Cuda 12.6 is installed