NVIDIA/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
C++Apache-2.0
Pinned issues
Issues
- 0
- 0
setuptools conflict
#2655 opened by kanebay - 0
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable
#2652 opened by Whisht - 0
Mpool Failure on H100 DGX node
#2649 opened by christian-ci - 3
Adding custom sampling config
#2609 opened by buddhapuneeth - 0
Throughput Measurements
#2648 opened by Alireza3242 - 0
- 1
How to suppress the WARNING logging?
#2610 opened by lxp3 - 0
Why can't I get a correct response when using ExecutorInstanceBasic.cpp to infer the qwen model with system prompt tokens as input
#2642 opened by aaIce - 0
Failed to build engine with lookahead_decoding
#2641 opened by aikitoria - 1
- 2
greedy search results vary with each inference
#2640 opened by fclearner - 0
Multi-Modal on TRT-LLM on aarch64 (Holoscan IGX Devkit) fails to covert VILA checkpoints
#2638 opened by MMelQin - 0
- 0
Attention clarification
#2639 opened by Saeedmatt3r - 2
support for T4
#2620 opened by krishnanpooja - 0
RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.
#2635 opened by anapple-hub - 0
RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.
#2636 opened by anapple-hub - 3
Unable to install TensorRT-LLM
#2597 opened by gowthamtupili - 1
Troubleshoot mistral model
#2632 opened by krishnanpooja - 0
- 1
- 1
- 0
Qwen2.5-72B-Instruct YaRN BUG
#2630 opened by PaulX1029 - 4
[Performance] why do compiled trtllm models have bad performance compared to torch.compile models?
#2627 opened by FPTMMC - 0
Error with LoRA Weights Data Type in Quantized TensorRT-LLM Model Execution
#2628 opened by Alireza3242 - 0
- 2
how to use continuous kv cache with prefix prompt caching in gpt attention plugin in context phase?
#2593 opened by FPTMMC - 1
Redrafter fp8 support
#2607 opened by darraghdog - 1
[Performance] What is the purpose of compiling a model?
#2617 opened by Flynn-Zh - 0
gather_generation_logits doesn't seem to work correctly for SequenceClassification models
#2615 opened by TriLoo - 0
- 1
Phi4 support?
#2616 opened by oscarbg - 0
Error during TensorRT-LLM build - Invalid shape and type mismatch in elementwise addition
#2622 opened by cocovoc - 0
- 3
- 1
No module named 'tensorrt_llm.bindings'
#2599 opened by WGS-note - 1
Qwen 2.5 hallucinating
#2600 opened by ChristophHandschuh - 1
trtllm-serve : Failure to launch openAI-API on multiple nodes with 8 GPUs each
#2594 opened by sivabreddy - 0
Does prefix caching support VLM models?
#2608 opened by sleepwalker2017 - 0
[Performance] TTFT of qwen2.5 0.5B model
#2598 opened by ReginaZh - 0
SmoothQuant doesn't work with lora
#2604 opened by ShuaiShao93 - 0
Gemma 2 LoRA support
#2606 opened by Aquasar11 - 0
- 0
lora doesn't work with --use_fp8_rowwise
#2603 opened by ShuaiShao93 - 0
--use_fp8 doesn't work with llama 3.1 8b
#2602 opened by ShuaiShao93 - 0
Chinese decoding garbled characters in stream mode
#2595 opened by fan-niu - 0
- 3
Is there any way to only convert the visual part of qwen2vl into a TensorRT model?
#2590 opened by zhaopings - 0
internVL with batch_size>1
#2591 opened by nzarif