WizardCorder/Starcoder Benchmark

Repo to benchmark inference optimizations for WizardCoder / Starcoder model.

Note: All results reported were ran in A100 40GB instance and with WizardLM/WizardCoder-15B-V1.0.

Installation steps:

Install required packages

git clone https://github.com/infinitylogesh/Wizcoder_benchmark.git
cd Wizcoder_benchmark/scripts && make install-vllm && make install-tgi
cd Wizcoder_benchmark/scripts && make install-flash-attn

Download the model weights

text-generation-server download-weights WizardLM/WizardCoder-15B-V1.0

Usage:

To run the benchmark for a specific inference engine:

python3 main.py --batch_size <BATCH-SIZE> --num_tokens <NUM-TOKENS-TO_GENERATE> --inference_engine <INFERENCE-ENGINE>

Values of INFERENCE-ENGINE can be:

  • hf: Vanilla hf inference
  • tgi: Flash attention using HF's Text-generation-inference
  • vllm: Paged Attention using vLLM
  • hf_pipeling: Inference using huggingface pipeline

To run the complete benchmark:

sh scripts/run_benchmark.sh

Results:

  • Flash Attention (Implemented from Text-generation-inference) Performs the best in various setting. However, with long sequences(especially with long input sequences), It seems to result into OOM

  • Paged Attention (via vLLM) - Performs second best in our benchmark runs and It is better at handling long sequences even in the settings where Flash attention fails , vLLM completes the generation without OOM

  • HF Generate (baseline) - Huggingface's venilla AutoModelForCausalLM taken as a baseline.

  • HF Pipeline - Huggingface's Pipeline for text-generation performed the worst of all (Results are to be added).

Results with short sequence inputs

Results with long sequence inputs

  • With Batch size of 64 , HF baseline throwed OOM. Flash attention performed better than Paged attention.
  • With Batch size of 128, Both HF and Flash attention throwed OOM. Paged attention completed the generations.

CSV results of the benchmark is available here - results/results.csv

TODO ( Future Optimisations ):

For further improvements in throughput,

References & Credits :