description |
---|
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on LLMs inference and serving. |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Full Stack Optimization for Transformer Inference: a Survey | Hardware and software co-design | UCB | Arxiv | |
A survey of techniques for optimizing transformer inference | Transformer optimization | Iowa State Univeristy | Journal of Systems Architecture | |
A Survey on Model Compression for Large Language Models | Model Compression | UCSD | Arxiv | |
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | Optimization technique: quant, pruning, continuous batching, virtual memory | CMU | Arxiv | |
LLM Inference Unveiled: Survey and Roofline Model Insights | Performance analysis | Infinigence-AI | Arxiv | LLMViewer |
Paper/OpenSource Project | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
DeepSpeed Infernce: Enabling Efficient Inference of Transformer Models at Unprecedented Scale | Deepspeed; Kerenl Fusion | MicroSoft | SC 2022 | Github repo |
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference | Deepspeed; Split fuse | MicroSoft | Arxiv | Github repo |
Efficient Memory Management for Large Language Model Serving with PagedAttention | vLLM; pagedAttention | UCB | SOSP 2023 | Github repo |
TensorRT-LLM/FastTransformer | NVIDIA | |||
lightLLM | Shanghai Artifcial Intelligence Laboratory | |||
MLC LLM | TVM; Multi-platforms | MLC-Team | ||
Text-Generation-Inference(TGI) | Huggingface |
Paper | Keywords | Institute(first) | Publication | Others |
---|---|---|---|---|
AIOS: LLM Agent Operating System | OS; LLM Agent | Rutgers University | Arxiv |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving | SJTU | Arxiv | Github repo | |
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference | Dynamic Compression | NVIDIA | Arxiv |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | Univeristy of Sydney | VLDB 2024 | Github repo | |
CLLMs: Consistency Large Language Models | Consistency | Shanghai Jiao Tong University | Arxiv | Github repo |
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | Disaggregating Prefill and Decoding | PKU | Arxiv |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Overlap communication with dependent compuation via Decompostion in Large Deep Learning Models | Overlap | ASPLOS 2023 | ||
Efficiently scaling Transformer inference | Scaling | Mlsys 2023 | ||
Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication | communication partition | PKU | ASPLOS 2024 |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Zeus: Understanding and Optimizing GPU energy Consumption of DNN Training | Yale University | NSDI 2023 | Github repo |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs | Consumer-grade GPU | HKBU | Arxiv | |
Petals: Collaborative Inference and Fine-tuning of Large Models | Yandex | Arxiv |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models | cold boot | The University of Edinburgh | OSDI 2024 | Empty Github |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Characterization of Large Language Model Development in the Datacenter | Cluster trace(for LLM) | ShangHai AI Lab | NSDI 2024 | Github |