WizardCorder/Starcoder Benchmark

Repo to benchmark inference optimizations for WizardCoder / Starcoder model.

Note: All results reported were ran in A100 40GB instance and with WizardLM/WizardCoder-15B-V1.0.

Installation steps:

Install required packages

git clone https://github.com/infinitylogesh/Wizcoder_benchmark.git
cd Wizcoder_benchmark/scripts && make install-vllm && make install-tgi
cd Wizcoder_benchmark/scripts && make install-flash-attn

Download the model weights

text-generation-server download-weights WizardLM/WizardCoder-15B-V1.0

Usage:

To run the benchmark for a specific inference engine:

python3 main.py --batch_size <BATCH-SIZE> --num_tokens <NUM-TOKENS-TO_GENERATE> --inference_engine <INFERENCE-ENGINE>

Values of INFERENCE-ENGINE can be:

hf: Vanilla hf inference
tgi: Flash attention using HF's Text-generation-inference
vllm: Paged Attention using vLLM
hf_pipeling: Inference using huggingface pipeline

To run the complete benchmark:

sh scripts/run_benchmark.sh

Results:

Flash Attention (Implemented from Text-generation-inference) Performs the best in various setting. However, with long sequences(especially with long input sequences), It seems to result into OOM
Paged Attention (via vLLM) - Performs second best in our benchmark runs and It is better at handling long sequences even in the settings where Flash attention fails , vLLM completes the generation without OOM
HF Generate (baseline) - Huggingface's venilla AutoModelForCausalLM taken as a baseline.
HF Pipeline - Huggingface's Pipeline for text-generation performed the worst of all (Results are to be added).

Results with short sequence inputs

Results with long sequence inputs

With Batch size of 64 , HF baseline throwed OOM. Flash attention performed better than Paged attention.
With Batch size of 128, Both HF and Flash attention throwed OOM. Paged attention completed the generations.

CSV results of the benchmark is available here - results/results.csv

TODO ( Future Optimisations ):

For further improvements in throughput,

Performance comparison Quantized model (GPTQ)
Flash Attention + Paged Attention ( Using latest Text-generation-inference)
Falsh attention v2
Continous batching
Other optimizations listed here.

References & Credits :

Flash attention implementation was used from Text-generation-inference. Adapted the TGI wrapper from Bigcode's bigcode-inference-benchmark
Paged Attention from vLLM