Awosome-LLM

Leaderboards

Megatron-LM Ongoing research training transformer models at scale.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
RedCoast(Redco) A Lightweight Tool to Automate Distributed Training and Inference. Code

LocalAI
Ollama
vLLM
TensorRT-LLM
llama.cpp
LM Studio
Outlines
gpt4all
gpt4free
privateGPT
MLC-LLM(C++) Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.
llamafile Distribute and run LLMs with a single file
koboldcpp
exllamav2(C++) A fast inference library for running LLMs locally on modern consumer-class GPUs.
xinference
lmdeploy is a toolkit for compressing, deploying, and serving LLMs
FlexGen(Python) Running large language models on a single GPU for throughput-oriented scenarios
OpenLLM Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud
Text Generation Inference
CTranslate2(C++) fast inference engine for Transformer models in C++.
DeepSpeed-MII MII makes low-latency and high-throughput inference possible, powered by DeepSpeed
AirLLM
FlexFlow(C++,Python) Serve is an open-source compiler and distributed system for low latency, high performance LLM serving.
InferFlow(C++) is an efficient and highly configurable inference engine for large language models (LLMs).
ExeGPT Constraint-Aware Resource Scheduling for LLM Inference.

GPTCache
KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation.

LLM-PQ Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization.
HexGen Serving LLMs on heterogeneous decentralized clusters.
MOIRAI Towards Optimal Placement for Distributed Inference on Heterogeneous Devices.
202403 HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
202401 Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control

LLM AutoEval: Automatically evaluate your LLMs using RunPod
LazyMergekit Easily merge models using MergeKit in one click
AutoQuant Quantize LLMs in GGUF, GPTQ, EXL2, AWQ, and HQQ formats in one click
Model Family Tree Visualize the family tree of merged models
ZeroSpace Automatically create a Gradio chat interface using a free ZeroGPU
ExLlamaV2 Colab Quantize and run EXL2 models and upload them to the HF Hub
LMQL is a Python-based programming language for LLM programming with declarative elements.

Sarathi-Serve Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.