With DéjàVu, we aim to achieve fault-tolerant and resource-efficient serving of LLMs. We observe that distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges:
- Bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing
- GPU memory overprovisioning
- Long recovery times in case of failures
DéjàVu addresses all these challenges using a versatile and efficient KV cache streaming library: DéjàVuLib. Using DéjàVuLib, we propose and implement:
- Efficient prompt-token disaggregation to reduce pipeline bubbles
- Microbatch swapping for efficient GPU memory management
- State replication for fault-tolerance
DéjàVu is implemented on top of NVIDIA FasterTransformer. Like the original FasterTransformer implementation, it supports both tensor and pipeline parallelism.
DéjàVuLib is a library built to handle KV cache streaming to and from GPU We support the following: (currently tested for the GPT, OPT and BLOOM models)
- Streaming of the KV cache to/from CPU memory and flushing local disk
- Streaming of KV cache to/from another GPU (in a different machine) via NCCL
- Streaming of KV cache to local CPU, and then flushing to another machine's CPU over the network, via MPI or BOOST
- Disaggregation of Prompt and Token processing
- Fault Tolerance support with cache replication
- Swapping to CPU for pipeline parallelism
- Installation: Check docs/install.md
- DéjàVuLib documentation and microbenchmarks: Check docs/dejavulib.md
- DéjàVu serving system documentation and benchmarks: Check docs/dejavu.md
- DéjàVu Planner documentation: Check docs/dejavu_planner.md
- DéjàVu simulator: Check docs/dejavu_simulator.md
- For FasterTransformer original documentation: Check docs/original_ft
If you use DéjàVu or DéjàVuLib in your research, please cite our paper:
@misc{strati2024dejavu,
title={D\'ej\`aVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving},
author={Foteini Strati and Sara Mcallister and Amar Phanishayee and Jakub Tarnawski and Ana Klimovic},
year={2024},
eprint={2403.01876},
archivePrefix={arXiv},
primaryClass={cs.DC}
}