[Bug]: Serving Whisper with trtllm-serve – Error Encountered
Opened this issue · 1 comments
Alireza3242 commented
System Info
a100
1.1.0 rc4
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I want to serve the Whisper model. I run the following command:
trtllm-serve /app/data/asr/whisper --max_batch_size 128 --host 0.0.0.0 --kv_cache_free_gpu_memory_fraction 0.3
But it throws an error:
[2025-09-15 04:49:26] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 04:49:26] INFO config.py:66: Polars version 1.25.2 available.
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
2025-09-15 04:49:30,123 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc4
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
warnings.warn(
[09/15/2025-04:49:31] [TRT-LLM] [I] Using LLM with PyTorch backend
[09/15/2025-04:49:31] [TRT-LLM] [I] Set nccl_plugin to None.
[09/15/2025-04:49:31] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
/app/data/asr/whisper
[09/15/2025-04:49:31] [TRT-LLM] [I] Unregistered model, using DefaultInputProcessor
rank 0 using MpiPoolSession to spawn MPI processes
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
[2025-09-15 04:49:38] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 04:49:38] INFO config.py:66: Polars version 1.25.2 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
2025-09-15 04:49:43,326 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc4
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
[09/15/2025-04:49:44] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=False, moe_max_num_tokens=None, moe_load_balancer=None, attention_dp_enable_balance=False, attention_dp_time_out_iters=50, attention_dp_batching_wait_iters=10, batch_wait_timeout_ms=0, attn_backend='TRTLLM', moe_backend='CUTLASS', moe_disable_finalize_fusion=False, enable_mixed_sampler=False, sampler_type=<SamplerType.auto: 'auto'>, kv_cache_dtype='auto', mamba_ssm_cache_dtype='auto', enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_piecewise_cuda_graph_num_tokens=None, torch_compile_enable_userbuffers=True, torch_compile_max_num_streams=1, enable_autotuner=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>, enable_min_latency=False, allreduce_strategy='AUTO', stream_interval=1, force_dynamic_quantization=False, mm_encoder_only=False, _limit_torch_cuda_mem_fraction=True)
[09/15/2025-04:49:44] [TRT-LLM] [I] ATTENTION RUNTIME FEATURES: AttentionRuntimeFeatures(chunked_prefill=False, cache_reuse=True, has_speculative_draft_tokens=False, chunk_size=8192)
/app/data/asr/whisper
[09/15/2025-04:49:45] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[09/15/2025-04:49:45] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[09/15/2025-04:49:45] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 995, in _load_model
model = AutoModelForCausalLM.from_config(config_copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration
[09/15/2025-04:49:45] [TRT-LLM] [E] Failed to initialize executor on rank 0: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration
[09/15/2025-04:49:45] [TRT-LLM] [E] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 995, in _load_model
model = AutoModelForCausalLM.from_config(config_copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 37, in from_config
raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 849, in worker_main
worker: GenerationExecutorWorker = worker_cls(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
self.engine = _create_py_executor(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
_executor = create_executor(**args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 262, in create_py_executor
model_engine = PyTorchModelEngine(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 306, in __init__
self.model = self._load_model(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1013, in _load_model
model = AutoModelForCausalLM.from_config(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 37, in from_config
raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration
[09/15/2025-04:49:45] [TRT-LLM] [E] Executor worker initialization error: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 995, in _load_model
model = AutoModelForCausalLM.from_config(config_copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 37, in from_config
raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 849, in worker_main
worker: GenerationExecutorWorker = worker_cls(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
self.engine = _create_py_executor(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
_executor = create_executor(**args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 262, in create_py_executor
model_engine = PyTorchModelEngine(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 306, in __init__
self.model = self._load_model(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1013, in _load_model
model = AutoModelForCausalLM.from_config(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 37, in from_config
raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/trtllm-serve", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1442, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1363, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1830, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1226, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 794, in invoke
return callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 358, in serve
launch_server(host, port, llm_args, metadata_server_cfg, server_role)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 164, in launch_server
llm = PyTorchLLM(**llm_args)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1031, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 946, in __init__
super().__init__(model,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 216, in __init__
self._build_model()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 975, in _build_model
self._executor = self._executor_cls.create(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/executor.py", line 432, in create
return GenerationExecutorProxy(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 107, in __init__
self._start_executor_workers(worker_kwargs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 332, in _start_executor_workers
raise RuntimeError(
RuntimeError: Executor worker returned error
Expected behavior
no error
actual behavior
error
additional notes
It is worth mentioning that vLLM can serve Whisper with a similar command.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
tongyuantongyu commented
Whisper is not supported by the PyTorch backend yet. List of supported models can be found here.