This survey paper, titled "Accelerated Generation Techniques for Large Language Models (LLMs)", focuses on the cutting-edge algorithmic advancements developed to enhance the efficiency and speed of generating responses from LLMs. As LLMs are increasingly used in various applications such as chatbots, content creation, and language translation, the computational demands for real-time response generation have become a significant challenge. Our survey explores a range of novel algorithms and methodologies designed to optimize the generation process. These include Speculative decoding techniques, Early-exiting strategies, and Non-autoregressive mechanisms. By consolidating the latest research and breakthroughs in these algorithmic approaches, this paper aims to provide valuable insights and practical guidance for researchers and practitioners working to implement faster and more efficient LLMs in their projects.
You can find our paper: here
-
Blockwise parallel decoding for deep autoregressive models, NIPS, 2018. Paper
-
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation, EMNLP-findings, 2023. Paper, Code
-
Accelerating Large Language Model Decoding with Speculative Sampling, ArXiv, 2023. Paper
-
Fast Inference from Transformers via Speculative Decoding, ArXiv, 2023. Paper
-
Online Speculative Decoding, ArXiv, 2023. Paper
-
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, ArXiv 2023. Paper, Code
-
DistillSpec: Improving Speculative Decoding via Knowledge Distillation, ArXiv, 2023. Paper
-
REST: Retrieval-Based Speculative Decoding, NAACL 2024 Paper, Code
-
Cascade Speculative Drafting for Even Faster LLM Inference, ArXiv, 2023. Paper, Code
-
Accelerating LLM Inference with Staged Speculative Decoding, ICML, 2023. Paper
-
PASS: Parallel Speculative Sampling, NeurIPS 2023. Paper
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, ArXiv, 2024. Paper, Code
-
Eagle: Speculative Sampling Requires Rethinking Feature Uncertainty, ICML, 2024. Paper, Code
-
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-Based Speculative Inference and Verification, ACL, 2024. Paper
-
Spectr: Fast Speculative Decoding via Optimal Transport, NeurIPS, 2023. Paper
-
Speculative Decoding with Big Little Decoder, NeurIPS, 2023. Paper
-
Optimal Block-Level Draft Verification for Accelerating Speculative Decoding, ArXiv, 2024. Paper
-
Multi-candidate Speculative Decoding, ArXiv, 2024. Paper, Code
-
The Synergy of Speculative Decoding and Batching in Serving Large Language Models, ArXiv, 2023. Paper
-
BITA: Bi-directional Tuning for Lossless Acceleration in Large Language Models, ArXiv, 2024. Paper
-
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-correct Decoding, ArXiv, 2024. Paper
-
Inference with Reference: Lossless Acceleration of Large Language Models, ArXiv, 2023. Paper
-
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, ArXiv, 2024. Paper, Code
-
Speculative Contrastive Decoding, ArXiv, 2023. Paper
-
Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, ArXiv, 2023. Paper
-
SPEED: Speculative Pipelined Execution for Efficient Decoding, NeurIPS, 2023. Paper
-
Triforce: Lossless Acceleration of Long Sequence Generation with Hierarchical -Speculative Decoding, ArXiv, 2024. Paper
-
Confident adaptive language modeling, NeurIPS, 2022. Paper
-
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, EMNLP 2023. Paper, Code
-
A Simple Hash-Based Early Exiting Approach For Language Understanding and Generation, ACL, 2022. Paper
-
Accelerating Inference for Pretrained Language Models by Unified Multi-Perspective Early Exiting, ACL, 2022. Paper, Code
-
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding, ICML, 2023. Paper
-
Consistentee: A consistent and hardness-guided early exiting method for accelerating language models inference, AAAI, 2024. Paper
-
Skipdecode: Autoregressive skip decoding with batching and caching for efficient LLM inference, ArXiv, 2023. Paper
-
EE-LLM: Large-scale training and inference of early-exit large language models with 3D parallelism, ArXiv, 2023. Paper, Code
-
Layer skip: Enabling early exit inference and self-speculative decoding, ArXiv, 2024. Paper
-
Non-Autoregressive Neural Machine Translation, ICLR, 2018. Paper
-
Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input, AAAI, 2019. Paper
-
Fast decoding in sequence models using discrete latent variables, ICML, 2018. Paper
-
FlowSeq: Non-autoregressive conditional sequence generation with generative flow, EMNLP, 2019. Paper, Code
-
Deterministic non-autoregressive neural sequence modeling by iterative refinement, EMNLP, 2018. Paper, Code
-
Syntactically supervised transformers for faster neural machine translation, ACL, 2019. Paper
-
Fine-tuning by curriculum learning for non-autoregressive neural machine translation, AAAI, 2020. Paper
-
Non-autoregressive translation with dependency-aware decoder, IWSLT, 2023. Paper
-
Non-autoregressive text generation with pre-trained language models, EACL, 2021.Paper, Code
-
Semi-autoregressive neural machine translation, EMNLP, 2018. Paper, Code
-
Fast structured decoding for sequence models, NeurIPS, 2019. Paper
-
Mask-predict: Parallel decoding of conditional masked language models, EMNLP, 2019. Paper
-
Accelerating transformer inference for translation via parallel decoding, ACL, 2023. Paper, Code
-
CLLMs: Consistency large language models, ICML, 2024. Paper, Code
-
Skeleton-of-thought: Large language models can do parallel decoding, ICLR, 2024. Paper