minimal C implementation of speculative decoding on llama2 model.
Speculative decoding is a technique used to speed up autoregressive inference with the help of a lightweight draft model. This project demonstrates this approach with simple pure C code.
what I basically did was fix the llama2.c/run.c
file to support forwarding multiple tokens and implemented speculative_decoding.c
using that.
Special thanks to:
@karpathy for providing llama2.c
as a starting point and inspiration for this project
llama2.c/run.c
was copied along with license notations to this project.
@ggerganov for writing llama.cpp
where I initially got the oppertunity to study and code spec-dec related stuff
- download base/draft models
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
- build and run
make && ./speculative_decoding -m ./models/stories42M.bin -d ./models/stories15M.bin -n 256 -i "Once upon a time"
- orange text: accepted draft model tokens
- black text: base model tokens
to use llama2 models, follow the description written in llama2.c
@inproceedings{leviathan2023fast,
title={Fast inference from transformers via speculative decoding},
author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},
booktitle={International Conference on Machine Learning},
pages={19274--19286},
year={2023},
organization={PMLR}
}
- The generation is constrained by the maximum sequence length of the draft model. Consequently, employing lengthy generation with speculative decoding is unfeasible with the current setup, when utilizing a draft model with a short maximum sequence length.
MIT
I added the original copyright notice to the copied run.c file. Please let me know if I made any mistakes with the licensing.
Any sort of feedback is very welcome :)
More speculative-decoding related C implementations are to come!
I'm thinking of https://github.com/SafeAILab/EAGLE next.