vllm
There are 158 repositories under vllm topic.
meta-llama/llama-cookbook
Welcome to the Llama Cookbook! This is your go to guide for Building with Llama: Getting started with Inference, Fine-Tuning, RAG. We also show you how to solve end to end problems using Llama model family and using them on various provider services
xorbitsai/inference
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
OpenRLHF/OpenRLHF
An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & vLLM & Ray & Dynamic Sampling & Async Agentic RL)
katanaml/sparrow
Structured data extraction and instruction calling with ML, LLM and Vision LLM
xlite-dev/Awesome-LLM-Inference
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
containers/ramalama
RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.
vllm-project/vllm-ascend
Community maintained hardware plugin for vLLM on Ascend
bricks-cloud/BricksLLM
🔒 Enterprise-grade API gateway that helps you monitor and impose cost or rate limits per API key. Get fine-grained access control and monitoring per user, application, or environment. Supports OpenAI, Azure OpenAI, Anthropic, vLLM, and open-source LLMs.
substratusai/kubeai
AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.
apconw/sanic-web
一个轻量级、支持全链路且易于二次开发的大模型应用项目(Large Model Data Assistant) 支持DeepSeek/Qwen3等大模型 基于 Dify 、LangChain/LangGraph、Ollama&Vllm、Sanic 和 Text2SQL 📊 等技术构建的一站式大模型应用开发项目,采用 Vue3、TypeScript 和 Vite 5 打造现代UI。它支持通过 ECharts 📈 实现基于大模型的数据图形化问答,具备处理 CSV 文件 📂 表格问答的能力。同时,能方便对接第三方开源 RAG 系统 检索系统 🌐等,以支持广泛的通用知识问答。
prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
harleyszhang/llm_note
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
jakobdylanc/llmcord
Make Discord your LLM frontend - Supports any OpenAI compatible API (Ollama, xAI, Gemini, OpenRouter and more)
varunvasudeva1/llm-server-docs
Documentation on setting up a local LLM server on Debian from scratch, using Ollama/llama.cpp/vLLM, Open WebUI, Kokoro FastAPI, and ComfyUI.
ModelTC/llmc
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
microsoft/vidur
A large-scale simulation framework for LLM inference
varunshenoy/super-json-mode
Low latency JSON generation using LLMs ⚡️
runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
chtmp223/topicGPT
TopicGPT: A Prompt-Based Framework for Topic Modeling (NAACL'24)
jasonacox/TinyLLM
Setup and run a local LLM and Chatbot using consumer grade hardware.
InftyAI/llmaz
☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!
shell-nlp/gpt_server
gpt_server是一个用于生产级部署LLMs、Embedding、Reranker、ASR、TTS、文生图、图片编辑和文生视频的开源框架。
HuiResearch/Fast-Spark-TTS
基于SparkTTS、OrpheusTTS等模型,提供高质量中文语音合成与声音克隆服务。
lucasjinreal/Namo-R1
A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from scratch with ease.
JackYFL/awesome-VLLMs
This repository collects papers on VLLM applications. We will update new papers irregularly.
NetEase-Media/grps
Deep Learning Deployment Framework: Supports tf/torch/trt/trtllm/vllm and other NN frameworks. Support dynamic batching, and streaming modes. It is dual-language compatible with Python and C++, offering scalability, extensibility, and high performance. It helps users quickly deploy models and provide services through HTTP/RPC interfaces.
gotzmann/booster
Booster - open accelerator for LLM models. Better inference and debugging for AI hackers
yoziru/nextjs-vllm-ui
Fully-featured, beautiful web interface for vLLM - built with NextJS.
nbasyl/DoRA
Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"
IDEA-Research/RexSeek
Referring any person or objects given a natural language description. Code base for RexSeek and HumanRef Benchmark
ALucek/ppt2desc
Convert PowerPoint files into semantically rich text using vision language models
Trainy-ai/llm-atc
Fine-tuning and serving LLMs on any cloud
OpenCSGs/llm-inference
llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.
asprenger/ray_vllm_inference
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
llmariner/llmariner
Extensible generative AI platform on Kubernetes with OpenAI-compatible APIs.