Accelerated Generation Techniques for LLMs

📚 What is This Survey About?

This survey paper, titled "Accelerated Generation Techniques for Large Language Models (LLMs)", focuses on the cutting-edge algorithmic advancements developed to enhance the efficiency and speed of generating responses from LLMs. As LLMs are increasingly used in various applications such as chatbots, content creation, and language translation, the computational demands for real-time response generation have become a significant challenge. Our survey explores a range of novel algorithms and methodologies designed to optimize the generation process. These include Speculative decoding techniques, Early-exiting strategies, and Non-autoregressive mechanisms. By consolidating the latest research and breakthroughs in these algorithmic approaches, this paper aims to provide valuable insights and practical guidance for researchers and practitioners working to implement faster and more efficient LLMs in their projects.

You can find our paper: here

📖 Table of Content

💥 Speculative Decoding

Blockwise parallel decoding for deep autoregressive models, NIPS, 2018. Paper
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation, EMNLP-findings, 2023. Paper, Code
Accelerating Large Language Model Decoding with Speculative Sampling, ArXiv, 2023. Paper
Fast Inference from Transformers via Speculative Decoding, ArXiv, 2023. Paper
Online Speculative Decoding, ArXiv, 2023. Paper
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, ArXiv 2023. Paper, Code
DistillSpec: Improving Speculative Decoding via Knowledge Distillation, ArXiv, 2023. Paper
REST: Retrieval-Based Speculative Decoding, NAACL 2024 Paper, Code
Cascade Speculative Drafting for Even Faster LLM Inference, ArXiv, 2023. Paper, Code
Accelerating LLM Inference with Staged Speculative Decoding, ICML, 2023. Paper
PASS: Parallel Speculative Sampling, NeurIPS 2023. Paper
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, ArXiv, 2024. Paper, Code
Eagle: Speculative Sampling Requires Rethinking Feature Uncertainty, ICML, 2024. Paper, Code
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-Based Speculative Inference and Verification, ACL, 2024. Paper
Spectr: Fast Speculative Decoding via Optimal Transport, NeurIPS, 2023. Paper
Speculative Decoding with Big Little Decoder, NeurIPS, 2023. Paper
Optimal Block-Level Draft Verification for Accelerating Speculative Decoding, ArXiv, 2024. Paper
Multi-candidate Speculative Decoding, ArXiv, 2024. Paper, Code
The Synergy of Speculative Decoding and Batching in Serving Large Language Models, ArXiv, 2023. Paper
BITA: Bi-directional Tuning for Lossless Acceleration in Large Language Models, ArXiv, 2024. Paper
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-correct Decoding, ArXiv, 2024. Paper
Inference with Reference: Lossless Acceleration of Large Language Models, ArXiv, 2023. Paper
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, ArXiv, 2024. Paper, Code
Speculative Contrastive Decoding, ArXiv, 2023. Paper
Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, ArXiv, 2023. Paper
SPEED: Speculative Pipelined Execution for Efficient Decoding, NeurIPS, 2023. Paper
Triforce: Lossless Acceleration of Long Sequence Generation with Hierarchical -Speculative Decoding, ArXiv, 2024. Paper

⚡ Early-Exiting

Confident adaptive language modeling, NeurIPS, 2022. Paper
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, EMNLP 2023. Paper, Code
A Simple Hash-Based Early Exiting Approach For Language Understanding and Generation, ACL, 2022. Paper
Accelerating Inference for Pretrained Language Models by Unified Multi-Perspective Early Exiting, ACL, 2022. Paper, Code
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding, ICML, 2023. Paper
Consistentee: A consistent and hardness-guided early exiting method for accelerating language models inference, AAAI, 2024. Paper
Skipdecode: Autoregressive skip decoding with batching and caching for efficient LLM inference, ArXiv, 2023. Paper
EE-LLM: Large-scale training and inference of early-exit large language models with 3D parallelism, ArXiv, 2023. Paper, Code
Layer skip: Enabling early exit inference and self-speculative decoding, ArXiv, 2024. Paper

✨ Non-Autoregressive

Non-Autoregressive Neural Machine Translation, ICLR, 2018. Paper
Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input, AAAI, 2019. Paper
Fast decoding in sequence models using discrete latent variables, ICML, 2018. Paper
FlowSeq: Non-autoregressive conditional sequence generation with generative flow, EMNLP, 2019. Paper, Code
Deterministic non-autoregressive neural sequence modeling by iterative refinement, EMNLP, 2018. Paper, Code
Syntactically supervised transformers for faster neural machine translation, ACL, 2019. Paper
Fine-tuning by curriculum learning for non-autoregressive neural machine translation, AAAI, 2020. Paper
Non-autoregressive translation with dependency-aware decoder, IWSLT, 2023. Paper
Non-autoregressive text generation with pre-trained language models, EACL, 2021.Paper, Code
Semi-autoregressive neural machine translation, EMNLP, 2018. Paper, Code
Fast structured decoding for sequence models, NeurIPS, 2019. Paper
Mask-predict: Parallel decoding of conditional masked language models, EMNLP, 2019. Paper
Accelerating transformer inference for translation via parallel decoding, ACL, 2023. Paper, Code
CLLMs: Consistency large language models, ICML, 2024. Paper, Code
Skeleton-of-thought: Large language models can do parallel decoding, ICLR, 2024. Paper

Arenaa/Accelerated-Generation-Techniques