tensorrt-llm

There are 27 repositories under tensorrt-llm topic.

  • Awesome-LLM-Inference

    xlite-dev/Awesome-LLM-Inference

    📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

    Language:Python4.5k1328306
  • collabora/WhisperLive

    A nearly-live implementation of OpenAI's Whisper.

    Language:Python3.4k41241463
  • shashikg/WhisperS2T

    An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine

    Language:Jupyter Notebook468187161
  • coderonion/awesome-cuda-and-hpc

    🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

  • huggingface/optimum-benchmark

    🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.

    Language:Python31548859
  • npuichigo/openai_trtllm

    OpenAI compatible API for TensorRT LLM triton backend

    Language:Rust20482228
  • NetEase-Media/grps

    Deep Learning Deployment Framework: Supports tf/torch/trt/trtllm/vllm and other NN frameworks. Support dynamic batching, and streaming modes. It is dual-language compatible with Python and C++, offering scalability, extensibility, and high performance. It helps users quickly deploy models and provide services through HTTP/RPC interfaces.

    Language:C++1659313
  • NetEase-Media/grps_trtllm

    Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.

    Language:Python15341510
  • openhackathons-org/End-to-End-LLM

    This repository is an AI Bootcamp material that consist of a workflow for LLM

    Language:Jupyter Notebook8481434
  • vossr/Chat-With-RTX-python-api

    Chat With RTX Python API

    Language:Python664811
  • guidance-ai/llgtrt

    TensorRT-LLM server with Structured Outputs (JSON) built with Rust

    Language:Rust587911
  • fgblanch/OutlookLLM

    Add-in for new Outlook that adds LLM new features (Composition, Summarizing, Q&A). It uses a local LLM via Nvidia TensorRT-LLM

    Language:Python43424
  • menloresearch/cortex.tensorrt-llm

    Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.

    Language:C++422203
  • argonne-lcf/LLM-Inference-Bench

    LLM-Inference-Bench

    Language:Jupyter Notebook39914
  • CactusQ/TensorRT-LLM-Tutorial

    Getting started with TensorRT-LLM using BLOOM as a case study

    Language:Jupyter Notebook21103
  • lix19937/llm-deploy

    AI Infra LLM infer/ tensorrt-llm/ vllm

    Language:Python20100
  • zRzRzRzRzRzRzR/lm-fly

    大模型推理框架加速,让 LLM 飞起来

    Language:Python19304
  • EdVince/whisper-trtllm

    Whisper in TensorRT-LLM

    Language:C++15302
  • Delxrius/MiniMax-01

    MiniMax-01 is a simple implementation of the MiniMax algorithm, a widely used strategy for decision-making in two-player turn-based games like Tic-Tac-Toe. The algorithm aims to minimize the maximum possible loss for the player, making it a popular choice for developing AI opponents in various game scenarios.

  • j3soon/LLM-Tutorial

    LLM tutorial materials include but not limited to NVIDIA NeMo, TensorRT-LLM, Triton Inference Server, and NeMo Guardrails.

    Language:Jupyter Notebook2111
  • ccyrene/flash_whisper

    Whisper optimization for real-time application

    Language:Python1111
  • MustaphaU/Simplify-Documentation-Review-on-Atlassian-Confluence-with-LLAMA2-and-NVIDIA-TensorRT-LLM

    A simple project demonstrating LLM assisted review of documentation on Atlasssian Confluence.

    Language:Python0100
  • Rahman2001/nim-factory

    This project is a factory for NVIDIA NIM containers in which users/businesses can quantize many models and build their own TensorRT-LLM engine for optimized inference.

    Language:Jupyter Notebook0100
  • YconquestY/cc

    Summary of call graphs and data structures of collective communication plugin in NVIDIA TensorRT-LLM

    Language:D20100
  • yui-mhcp/language_models

    A Large Language Models (LLM) oriented project providing easy-to-use features like RAG, translation, summarization, ...

    Language:Python0100
  • cyanff/nyxt

    Language:TypeScript20