NVIDIA/TensorRT-LLM
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
C++Apache-2.0
Pinned issues
Issues
- 2
[New Model]: Qwen3-Next-80B-A3B
#7694 opened by troycheng - 0
[Feature][AutoDeploy] Update sequence info to also carry other metadata about the request such as number of prefill and decode requests
#8032 opened by suyoggupta - 7
[Bug]: FusedAddRMSNorm failed with error code device-side assert triggered when using Eagle3
#7691 opened by ValeGian - 0
- 14
[Installation]: I am building NVIDIA/TensorRT-LLM on ubuntu 2404, CUDA 12.9, rtx 5060. Here is some records.
#7896 opened by cainslayer - 3
[Bug]: min_tokens not work
#7693 opened by Alireza3242 - 0
Is W4A4KV4 inference (NVFP4 KV Cache) on 5090 still not supported by Trt-LLM?
#7988 opened by zjq0455 - 0
[New Model]: Qwen2.5-VL
#7978 opened by indexer0318 - 5
[Bug]: EXAONE 4.0 with VSWA trtllm-serve failure
#7741 opened by lkm2835 - 1
- 3
- 0
- 1
[Usage]: How to set the value of tp_size and pp_size when there are 1 server and 1 jetson?
#7823 opened by mamba824824 - 1
[New Model]: support for Qwen3-1.7B
#7748 opened by MengyuanHan1 - 6
- 0
[Bug]: Multimodal Model Fails in Multi-turn Dialogue with Mixed Message Types
#7929 opened by Nekofish-L - 0
[Bug]: Guided Decoding with MTP > 1, draft model produces tokens not accepted by matcher (xgrammar)
#7878 opened by Shang-Pin - 0
- 0
[Usage]: Mistral 3.1 Torch Backend
#7847 opened by CAlexander0614 - 2
[Bug]: CUDA Out Of Memory on GPU 0
#7818 opened by CAlexander0614 - 0
[Feature]: Eagle-3 one-model for FP8 target model
#7842 opened by ValeGian - 0
Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
#7839 opened by mamba824824 - 1
[Bug]: Qwen3-30B-A3B-Thinking-2507-FP4卡在Loading weights concurrently: 63%|██████▎ | 703/1113
#7703 opened by lsm03624 - 1
[Usage]: Does Medusa support multi batch multimolal model (like intervl3) inference ?
#7688 opened by cmsfw-github - 1
[Installation]: About building a cross-compiled version of tensorrt-llm with C++ support
#7459 opened by chunyuetang - 3
[Doc]: Failed to parse the arguments for the LLM constructor: _TrtLLM got invalid argument: disable_overlap_scheduler
#7752 opened by wenruihua - 1
- 1
[Bug]: TritonFP8QDQFusedMoEMethod crashed on H100
#7418 opened by Atream - 3
[Bug]: Encountered truncation of \ u0000
#7533 opened by vip-china - 3
[Usage]: Greedy Search Yield Different Results
#7625 opened by kzhou92 - 1
- 2
[Installation]: tensorrt_llm ImportError: tensorrt_llm/libs/libth_common.so: undefined symbol
#7665 opened by haohaibo - 0
[Bug]: TRT-LLM Deepseek-R1 Reasoning Parser error
#7680 opened by access2rohit - 1
[Usage]: [Qwen2-VL] Inquiry Regarding the Usage of cross_attention_mask Input in C++ Runtime
#7677 opened by deadpoppy - 0
- 0
- 1
why so many kernels use cubin? fully open-source?
#7652 opened by wenqibiao - 0
[AutoDeploy] Transformer Mode Unit Tests
#7631 opened by lucaslie - 0
- 3
[Bug]: [FMHA] Only partial results align with eager attention when using specific mask pattern
#7506 opened by fenghuohuo2001 - 1
- 1
[Feature]: Improve cuda-graph capture time
#7440 opened by nzmora-nvidia - 0
[Feature]: Optimize usage of torch.compile
#7540 opened by nzmora-nvidia - 0
- 0
[AutoDeploy] Remove Llama 4 MoE Accuracy Patch
#7494 opened by lucaslie - 0
- 1
[Feature]: PyTorch workflow does not support bad words
#7438 opened by tonyay163 - 0
[Feature]: Improve the performance of FP8 models
#7434 opened by nzmora-nvidia - 0
[Feature]: Reduce model compilation time
#7428 opened by nzmora-nvidia - 0
[Question]: FP4 KV cache usage
#7411 opened by mengniwang95