llm-inference
There are 521 repositories under llm-inference topic.
nomic-ai/gpt4all
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
microsoft/autogen
A programming framework for agentic AI 🤖
Lightning-AI/litgpt
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
bentoml/OpenLLM
Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.
mistralai/mistral-inference
Official inference library for Mistral models
liguodongiot/llm-action
本项目旨在分享大模型相关技术原理以及实战经验。
SJTU-IPADS/PowerInfer
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
bentoml/BentoML
The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
openvinotoolkit/openvino
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
superduper-io/superduper
Superduper: Integrate AI models and machine learning workflows with your database to implement custom AI applications, without moving your data. Including streaming inference, scalable model hosting, training and vector search.
InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
neuralmagic/deepsparse
Sparsity-aware deep learning inference runtime for CPUs
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
databricks/dbrx
Code examples and resources for DBRX, a large language model developed by Databricks
FasterDecoding/Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
NVIDIA/GenerativeAIExamples
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
intel/intel-extension-for-transformers
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
predibase/lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
liltom-eth/llama2-webui
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
microsoft/aici
AICI: Prompts as (Wasm) Programs
b4rtaz/distributed-llama
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
dstackai/dstack
dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
ray-project/ray-llm
RayLLM - LLMs on Ray
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
lean-dojo/LeanCopilot
LLMs as Copilots for Theorem Proving in Lean
character-ai/prompt-poet
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
SafeAILab/EAGLE
Official Implementation of EAGLE-1 and EAGLE-2
stoyan-stoyanov/llmflows
LLMFlows - Simple, Explicit and Transparent LLM Apps
ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing
LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.
anarchy-ai/LLM-VM
irresponsible innovation. Try now at https://chat.dev/
run-ai/genv
GPU environment and cluster management with LLM support
hpcaitech/SwiftInfer
Efficient AI Inference & Serving
rohan-paul/LLM-FineTuning-Large-Language-Models
LLM (Large Language Model) FineTuning
Kenza-AI/sagify
LLMs and Machine Learning done easily
FlagAI-Open/Aquila2
The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.
EulerSearch/embedding_studio
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.