This repo's goal is to reproduce the results of the paper Speculative Decoding with Big Little Decoder (BiLD) by Kim et al. (2023). The paper is published in NeurIPS 2023 and the code can be found on the official repository.
Big Little Decoder is a simple framework that enables faster generative inference. It can dramatically accelerate text generation by ~2x, without compromising performance on a variety of text generation scenarios. Furthermore, it is a simple plug-and-play solution that requires no training or architecture redesign.
Here's the key underlying idea:
- BiLD offloads the majority of simple word decisions to a smaller model, and only switches the control back to the larger model when needed.
- The small model "fallbacks" to the large model, when it runs into a hard-to-predict word.
- In case the small model makes a misstep, the larger model can "rollback" the predictions to correct the error
- This collaborative text generation combines the small model's fast autoregressive execution with the large model's accurate and efficient non-autoregressive execution!
Install the required packages using the following command:
pip install -r requirements.txt
We provide the scripts to evaluate the pretrained models on the datasets used in the paper. The scripts are located in the root directory and follow the naming convention run_<experiment>.sh
.
# Syntax for running experiments
bash ./run_<experiment>.sh [model] [fallback_threshold] [rollback_threshold] [max_eval_samples]
Arguments
model
: Specifies the model configuration to use. Possible values are:- vanilla: Uses only the large model.
- unaligned: Uses BiLD with a large model and small model.
- aligned: Uses BilD with a large model and small aligned model.
fallback_threshold
: (Optional) Specifies the fallback threshold for the small model. Values can range from [0.1, 0.9].rollback_threshold
: (Optional) Specifies the rollback threshold for the large model. Values can range from [1, 10].max_eval_samples
: (Optional) Specifies the maximum number of samples to evaluate.
To evaluate the pretrained models on the translation datasets, run the following command:
# IWSLT2017 with vanilla model
CUDA_VISIBLE_DEVICES=0 bash run_iwslt2017.sh vanilla
# WMT2014 with BiLD unaligned model
CUDA_VISIBLE_DEVICES=0 bash run_wmt2014.sh unaligned
To evaluate the pretrained models on the summarization datasets, run the following command:
# XSUM with BiLD aligned model
CUDA_VISIBLE_DEVICES=0 bash run_xsum.sh aligned
# CNNDM with BiLD aligned model with (FB, RB)=(0.5, 5.0) over 10 samples
CUDA_VISIBLE_DEVICES=0 bash run_cnndm.sh aligned 0.5 5 10
IWSLT | WMT | XSUM | CNNDM | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
BLEU | Speedup | (RB,FB) | BLEU | Speedup | (RB,FB) | RL | Speedup | (RB,FB) | RL | Speedup | (RB,FB) | |
Unaligned | 40.33 | 1.43× | (2, 0.6) | 31.28 | 1.34× | (2, 0.6) | 35.12 | 1.48× | (3, 0.5) | 41.44 | 1.71× | (3, 0.4) |
39.44 | 1.58× | (3, 0.5) | 30.47 | 1.43× | (3, 0.5) | 34.02 | 1.72× | (5, 0.3) | 40.57 | 2.05× | (6, 0.2) | |
Aligned | 40.24 | 1.62× | (3, 0.9) | 31.26 | 1.47× | (2, 0.8) | 35.05 | 1.50× | (2, 0.6) | 41.52 | 1.85× | (3, 0.3) |
39.13 | 1.78× | (4, 0.6) | 30.33 | 1.70× | (5, 0.6) | 33.95 | 1.80× | (5, 0.4) | 40.96 | 2.12× | (6, 0.2) |
IWSLT2017 | Thresholds | Original | Reproduction |
---|---|---|---|
BiLD (unaligned) | (2, 0.6) | 40.33 & 1.43x | 40.33 & 1.46x |
(3, 0.5) | 39.44 & 1.58x | 39.44 & 1.61x | |
BiLD (aligned) | (3, 0.9) | 40.24 & 1.62x | 40.09 & 1.52x |
(4, 0.6) | 39.13 & 1.62x | 39.15 & 1.73x |
- For the BLUE scores, we were able to reproduce the results within 1% of the original scores.
- For the speedup, we were able to reproduce the results within 5% of the reported values.
WMT14 | Thresholds | Original | Reproduction |
---|---|---|---|
BiLD (unaligned) | (2, 0.6) | 31.28 & 1.34× | 31.65 & 1.27× |
(3, 0.5) | 30.47 & 1.43× | 30.83 & 1.34× | |
BiLD (aligned) | (2, 0.8) | 31.26 & 1.47× | 31.65 & 1.37× |
(5, 0.6) | 30.33 & 1.70× | 30.54 & 1.65× |
- For the BLUE scores, we were able to reproduce the results within 1% of the original scores.
- For the speedup, we were able to reproduce the results within 10% of the reported values.
XSUM | Thresholds | Original | Reproduction |
---|---|---|---|
BiLD (unaligned) | (3, 0.5) | 35.12 & 1.48× | 35.12 & 1.40× |
(5, 0.3) | 34.02 & 1.72× | 34.02 & 1.54× | |
BiLD (aligned) | (2, 0.6) | 35.05 & 1.50× | 34.96 & 1.41× |
(5, 0.4) | 33.95 & 1.80× | 33.96 & 1.73× |
- For the ROUGE-L scores, we were able to reproduce the results within 1% of the original scores.
- For the speedup, we were able to reproduce the results within 10% of the reported values.
CNNDM | Thresholds | Original | Reproduction |
---|---|---|---|
BiLD (unaligned) | (3, 0.4) | 41.44 & 1.71× | 41.44 & 1.54× |
(6, 0.2) | 40.57 & 2.05× | 40.56 & 1.87× | |
BiLD (aligned) | (3, 0.3) | 41.52 & 1.85× | 41.33 & 1.48× |
(6, 0.2) | 40.96 & 2.12× | 40.79 & 1.67× |
- For the ROUGE-L scores, we were able to reproduce the results within 1% of the original scores.
- For the speedup, we were able to reproduce the results within 20% of the reported values.
The authors provided the finetuned checkpoints used in the paper.
Dataset | Model | Link |
---|---|---|
IWSLT-2017-De-En | mT5-small | link |
IWSLT-2017-De-En | mT5-small (aligned) | link |
IWSLT-2017-De-En | mT5-large | link |
WMT-2014-De-En | mT5-small | link |
WMT-2014-De-En | mT5-small (aligned) | link |
WMT-2014-De-En | mT5-large | link |
XSUM | T5-small | link |
XSUM | T5-small (aligned) | link |
XSUM | T5-large | link |
CNNDM | T5-small | link |
CNNDM | T5-small (aligned) | link |
CNNDM | T5-large | link |
We trained our own aligned models using the outputs of the authors' large finetuned models on each of the four benchmarks. We prove the links to these aligned models below.
Benchmark for Alignment | Link |
---|---|
IWSLT-2017-De-En | link |
WMT-2014-De-En | link |
XSUM | link |
CNNDM | link |
The general idea of alignment is to align the predictions produced by the small and large models. As part of this process, we require a calibration dataset for each benchmark which represents the output sentence distribution of the large model. Then, we fine-tune the small model on this dataset so that it will better follow the output distribution of the large model. To create the calibration dataset for a particular benchmark, we take the inputs of the benchmark dataset and generate the corresponding output sequence using the large model, which creates the (input, output) dataset samples.
The authors did not open source the calibration datasets. As such, we had to create these ourselves, which we link below.
Calibration Dataset | Link |
---|---|
IWSLT-2017-De-En | link |
WMT-2014-De-En | link |
XSUM | link |
CNNDM | link |
In addition to using model checkpoints provided by the original paper, we also trained our own small models from scratch targeting a more complete reproduction of the performance pattern between aligned and unaligned models. The pattern in the original paper was that aligned models offer a larger speedup while having a marginal performance difference with the unaligned models. Despite only training our small models for 100,000 steps, compared to the 500,000 steps in the original paper, we were largely able to reproduce this pattern. A table showing the complete benchmarks of our small models can be found below. Our self trained models were trained and benchmarked on Nvidia RTX 2080 GPUs provided by the UW NLPG machines.
Speedup | Fallback Percentage | Rollback Percentage | Rouge L / BLEU | |
---|---|---|---|---|
XSum Vanilla | NA | NA | NA | 35.10 (Rouge L) |
XSum Aligned | 1.62x | 10.90% | 4.125% | 34.28 (Rouge L) |
XSum Unaligned | 1.35x | 30.43% | 3.008% | 34.26 (Rouge L) |
WMT14 Vanilla | NA | NA | NA | 31.88 (BLEU) |
WMT14 Aligned | 1.60x | 3.230% | 2.327% | 28.71 (BLEU) |
WMT14 Unaligned | 1.41x | 16.36% | 2.359% | 27.81 (BLEU) |
As part of the reproduction, we additionally conducted several experiments not done in the paper to probe the robustness of the Big-Little decoding architecture.
As an additional experiment, we tested Big Little Decoder on a Code API Generation task availabe on GitHub CodeTrans. The dataset is published on Huggingface and can be found here. We used the pretrained small and large T5 models given by the authors.
# API Generation with BiLD unaligned model
CUDA_VISIBLE_DEVICES=0 bash run_api.sh unaligned