NVIDIA/TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

C++Apache-2.0

Pinned issues

TensorRT-LLM Requests

#632 opened a year ago by ncomly-nvidia

Open15

[Issue Template]Short one-line summary of the issue #270

#783 opened a year ago by juney-nvidia

Open0

Issues

No module named 'tensorrt_llm.bindings'` error message
#2656 opened 8 days ago by maulikmadhavi
0
setuptools conflict
#2655 opened 9 days ago by kanebay
0
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable
#2652 opened 9 days ago by Whisht
0
Mpool Failure on H100 DGX node
#2649 opened 9 days ago by christian-ci
0
Adding custom sampling config
#2609 opened 19 days ago by buddhapuneeth
3
Throughput Measurements
#2648 opened 9 days ago by Alireza3242
0
gemma 2 convert_checkpoint takes gpu ram more than needed
#2647 opened 10 days ago by Alireza3242
0
How to suppress the WARNING logging？
#2610 opened 19 days ago by lxp3
1
Why can't I get a correct response when using ExecutorInstanceBasic.cpp to infer the qwen model with system prompt tokens as input
#2642 opened 11 days ago by aaIce
0
Failed to build engine with lookahead_decoding
#2641 opened 11 days ago by aikitoria
0
How to make it not display info information?executorExampleBasic.cpp
#2637 opened 15 days ago by aaIce
1
greedy search results vary with each inference
#2640 opened 12 days ago by fclearner
2
Multi-Modal on TRT-LLM on aarch64 (Holoscan IGX Devkit) fails to covert VILA checkpoints
#2638 opened 13 days ago by MMelQin
0
Cpp runner outputs wrong results when using lora + tensor parallelism
#2634 opened 15 days ago by ShuaiShao93
0
Attention clarification
#2639 opened 12 days ago by Saeedmatt3r
0
support for T4
#2620 opened 19 days ago by krishnanpooja
2
RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.
#2635 opened 13 days ago by anapple-hub
0
RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.
#2636 opened 15 days ago by anapple-hub
0
Unable to install TensorRT-LLM
#2597 opened 23 days ago by gowthamtupili
3
Troubleshoot mistral model
#2632 opened 17 days ago by krishnanpooja
1
[Performance] KV cache reuse is slower when batch size > 1
#2631 opened 17 days ago by ReginaZh
0
SIGABRT while trying to build trtllm engine for biomistral model on T4
#2619 opened 17 days ago by krishnanpooja
1
Does trtllm-serve support toolparser and guided-decoding? Any plan?
#2624 opened 18 days ago by dwq370
1
Qwen2.5-72B-Instruct YaRN BUG
#2630 opened 17 days ago by PaulX1029
0
[Performance] why do compiled trtllm models have bad performance compared to torch.compile models?
#2627 opened 18 days ago by FPTMMC
4
Error with LoRA Weights Data Type in Quantized TensorRT-LLM Model Execution
#2628 opened 18 days ago by Alireza3242
0
[bug] forwardAsync assertion failed: Unable to get batch slot for reqId
#2626 opened 18 days ago by akhoroshev
0
how to use continuous kv cache with prefix prompt caching in gpt attention plugin in context phase?
#2593 opened 24 days ago by FPTMMC
2
Redrafter fp8 support
#2607 opened 18 days ago by darraghdog
1
[Performance] What is the purpose of compiling a model？
#2617 opened 19 days ago by Flynn-Zh
1
gather_generation_logits doesn't seem to work correctly for SequenceClassification models
#2615 opened 19 days ago by TriLoo
0
Performance of streaming requests is worse than non-streaming
#2613 opened 19 days ago by activezhao
0
Phi4 support?
#2616 opened 19 days ago by oscarbg
1
Error during TensorRT-LLM build - Invalid shape and type mismatch in elementwise addition
#2622 opened 19 days ago by cocovoc
0
decoder_input_ids must be set to 2 when using fairseq model
#2621 opened 19 days ago by cocovoc
0
[Feature Request] Better support for w4a8 quantization
#2605 opened 22 days ago by ShuaiShao93
3
No module named 'tensorrt_llm.bindings'
#2599 opened 23 days ago by WGS-note
1
Qwen 2.5 hallucinating
#2600 opened 20 days ago by ChristophHandschuh
1
trtllm-serve : Failure to launch openAI-API on multiple nodes with 8 GPUs each
#2594 opened 24 days ago by sivabreddy
1
Does prefix caching support VLM models?
#2608 opened 20 days ago by sleepwalker2017
0
[Performance] TTFT of qwen2.5 0.5B model
#2598 opened 23 days ago by ReginaZh
0
SmoothQuant doesn't work with lora
#2604 opened 22 days ago by ShuaiShao93
0
Gemma 2 LoRA support
#2606 opened 22 days ago by Aquasar11
0
how can pass a raw image tensor data to my custom model.forward() function?
#2596 opened 21 days ago by llery
0
lora doesn't work with --use_fp8_rowwise
#2603 opened 22 days ago by ShuaiShao93
0
--use_fp8 doesn't work with llama 3.1 8b
#2602 opened 22 days ago by ShuaiShao93
0
Chinese decoding garbled characters in stream mode
#2595 opened 24 days ago by fan-niu
0
support remove input padding for DiT or Flux.1-dev related model?
#2592 opened 24 days ago by MagicRUBICK
0
Is there any way to only convert the visual part of qwen2vl into a TensorRT model？
#2590 opened 24 days ago by zhaopings
3
internVL with batch_size>1
#2591 opened 24 days ago by nzarif
0