- Open LLM Leaderboard
- LLM Perf Leaderboard
- LLMPerf Leaderboard
- LLM API Hosts Leaderboard
- LLM Safety Leaderboard (for compressed models)
- MTEB (Massive Text Embedding Benchmark) Leaderboard
- BigBench
- Megatron-LM Ongoing research training transformer models at scale.
- DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
- RedCoast(Redco) A Lightweight Tool to Automate Distributed Training and Inference. Code
- LocalAI
- Ollama
- vLLM
- TensorRT-LLM
- llama.cpp
- LM Studio
- Outlines
- gpt4all
- gpt4free
- privateGPT
- MLC-LLM(C++) Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.
- llamafile Distribute and run LLMs with a single file
- koboldcpp
- exllamav2(C++) A fast inference library for running LLMs locally on modern consumer-class GPUs.
- xinference
- lmdeploy is a toolkit for compressing, deploying, and serving LLMs
- FlexGen(Python) Running large language models on a single GPU for throughput-oriented scenarios
- OpenLLM Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud
- Text Generation Inference
- CTranslate2(C++) fast inference engine for Transformer models in C++.
- DeepSpeed-MII MII makes low-latency and high-throughput inference possible, powered by DeepSpeed
- AirLLM
- FlexFlow(C++,Python) Serve is an open-source compiler and distributed system for low latency, high performance LLM serving.
- InferFlow(C++) is an efficient and highly configurable inference engine for large language models (LLMs).
- ExeGPT Constraint-Aware Resource Scheduling for LLM Inference.
- Chunking of input documents
- Compression of input tokens: LLMLingua Series
- Summarization of input tokens
- Avoid adding few-shot examples
- Limit the length of the output and its formatting
- LLamaIndex Routers and LLMSingleSelector
- NVIDIA Nemo guardrails
- Dynamically route logic based on input with LangChain
- GPTCache
- KV-Runahead Scalable Causal LLM Inference by Parallel Key-Value Cache Generation.
- 202404 PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
- 2024 DISTPAR:TENSOR PARTITIONING FOR DISTRIBUTED NEURAL NETWORK COMPUTING
- 202211 EFFICIENTLY SCALING TRANSFORMER INFERENCE
- LLM-PQ Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization.
- HexGen Serving LLMs on heterogeneous decentralized clusters.
- MOIRAI Towards Optimal Placement for Distributed Inference on Heterogeneous Devices.
- 202403 HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
- 202401 Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
- LLM AutoEval: Automatically evaluate your LLMs using RunPod
- LazyMergekit Easily merge models using MergeKit in one click
- AutoQuant Quantize LLMs in GGUF, GPTQ, EXL2, AWQ, and HQQ formats in one click
- Model Family Tree Visualize the family tree of merged models
- ZeroSpace Automatically create a Gradio chat interface using a free ZeroGPU
- ExLlamaV2 Colab Quantize and run EXL2 models and upload them to the HF Hub
- LMQL is a Python-based programming language for LLM programming with declarative elements.
- Sarathi-Serve Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.