NVIDIA/TensorRT-LLM

[Bug]: Serving Whisper with trtllm-serve – Error Encountered

Opened this issue · 1 comments

System Info

a100
1.1.0 rc4

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I want to serve the Whisper model. I run the following command:

trtllm-serve /app/data/asr/whisper --max_batch_size 128 --host 0.0.0.0 --kv_cache_free_gpu_memory_fraction 0.3

But it throws an error:

[2025-09-15 04:49:26] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 04:49:26] INFO config.py:66: Polars version 1.25.2 available.
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-09-15 04:49:30,123 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc4
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  warnings.warn(
[09/15/2025-04:49:31] [TRT-LLM] [I] Using LLM with PyTorch backend
[09/15/2025-04:49:31] [TRT-LLM] [I] Set nccl_plugin to None.
[09/15/2025-04:49:31] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
/app/data/asr/whisper
[09/15/2025-04:49:31] [TRT-LLM] [I] Unregistered model, using DefaultInputProcessor
rank 0 using MpiPoolSession to spawn MPI processes
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[09/15/2025-04:49:31] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
[2025-09-15 04:49:38] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-15 04:49:38] INFO config.py:66: Polars version 1.25.2 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-09-15 04:49:43,326 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc4
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
[09/15/2025-04:49:44] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=False, moe_max_num_tokens=None, moe_load_balancer=None, attention_dp_enable_balance=False, attention_dp_time_out_iters=50, attention_dp_batching_wait_iters=10, batch_wait_timeout_ms=0, attn_backend='TRTLLM', moe_backend='CUTLASS', moe_disable_finalize_fusion=False, enable_mixed_sampler=False, sampler_type=<SamplerType.auto: 'auto'>, kv_cache_dtype='auto', mamba_ssm_cache_dtype='auto', enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_piecewise_cuda_graph_num_tokens=None, torch_compile_enable_userbuffers=True, torch_compile_max_num_streams=1, enable_autotuner=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>, enable_min_latency=False, allreduce_strategy='AUTO', stream_interval=1, force_dynamic_quantization=False, mm_encoder_only=False, _limit_torch_cuda_mem_fraction=True)
[09/15/2025-04:49:44] [TRT-LLM] [I] ATTENTION RUNTIME FEATURES:  AttentionRuntimeFeatures(chunked_prefill=False, cache_reuse=True, has_speculative_draft_tokens=False, chunk_size=8192)
/app/data/asr/whisper
[09/15/2025-04:49:45] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[09/15/2025-04:49:45] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[09/15/2025-04:49:45] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 995, in _load_model
    model = AutoModelForCausalLM.from_config(config_copy)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration


[09/15/2025-04:49:45] [TRT-LLM] [E] Failed to initialize executor on rank 0: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration
[09/15/2025-04:49:45] [TRT-LLM] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 995, in _load_model
    model = AutoModelForCausalLM.from_config(config_copy)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 37, in from_config
    raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 849, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 262, in create_py_executor
    model_engine = PyTorchModelEngine(
                   ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 306, in __init__
    self.model = self._load_model(
                 ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1013, in _load_model
    model = AutoModelForCausalLM.from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 37, in from_config
    raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration

[09/15/2025-04:49:45] [TRT-LLM] [E] Executor worker initialization error: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 995, in _load_model
    model = AutoModelForCausalLM.from_config(config_copy)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 37, in from_config
    raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 849, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 177, in __init__
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 149, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 262, in create_py_executor
    model_engine = PyTorchModelEngine(
                   ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 306, in __init__
    self.model = self._load_model(
                 ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1013, in _load_model
    model = AutoModelForCausalLM.from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 37, in from_config
    raise ValueError(
ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration

ValueError: Unknown architecture for AutoModelForCausalLM: WhisperForConditionalGeneration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1442, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1363, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1830, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1226, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 794, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 358, in serve
    launch_server(host, port, llm_args, metadata_server_cfg, server_role)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 164, in launch_server
    llm = PyTorchLLM(**llm_args)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1031, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 946, in __init__
    super().__init__(model,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 216, in __init__
    self._build_model()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 975, in _build_model
    self._executor = self._executor_cls.create(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/executor.py", line 432, in create
    return GenerationExecutorProxy(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 107, in __init__
    self._start_executor_workers(worker_kwargs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 332, in _start_executor_workers
    raise RuntimeError(
RuntimeError: Executor worker returned error

Expected behavior

no error

actual behavior

error

additional notes

It is worth mentioning that vLLM can serve Whisper with a similar command.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Whisper is not supported by the PyTorch backend yet. List of supported models can be found here.