🔥Quick Start • 📚Documents • 💻LLM code • 📜Citation • 🙏Acknowledgement
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- ✨ HumanEval+: 80x more tests than the original HumanEval!
- ✨ MBPP+: 35x more tests than the original MBPP!
- ✨ Evaluation framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Why EvalPlus?
- ✨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- ✨ Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
- ✨ Pre-generated samples: EvalPlus accelerates LLM4Code research by open-sourcing LLM-generated samples for various models -- no need to re-run the expensive benchmarks!
Want to know more details? Read our NeurIPS'23 paper as well as our Google Slides!
Important
🚧 MBPP+ update (v0.1.0
to v0.2.0
):
We recently improved and stablized MBPP+ dataset by removing some tasks whose test_list
is wrong (brought by the original MBPP dataset itself) to make it more reasonable to solve.
In v0.1.0
MBPP+ has 399 tasks while the new v0.2.0
has 378 tasks.
We also improved the oracle. Therefore, using v0.2.0
you might expect ~4pp pass@1 improvement for both base and plus tests.
Tip
EvalPlus ❤️ bigcode-evaluation-harness! HumanEval+ and MBPP+ have been integrated to bigcode-evaluation-harness that you can also run EvalPlus datasets there!
To quickly perform code generation and evaluation on HumanEval+:
pip install evalplus --upgrade
transformers
backend:
evalplus.evaluate --model "mistralai/Mistral-7B-Instruct-v0.3" \
--dataset [humaneval|mbpp] \
--backend hf \
--greedy
Note
EvalPlus uses different prompts for base and chat models.
By default it is detected by tokenizer.chat_template
when using hf
/vllm
as backend.
For other backends, only chat mode is allowed.
Therefore, if your base models come with a tokenizer.chat_template
,
please add --force-base-prompt
to avoid being evaluated
in a chat mode.
Enable Flash Attention 2:: click to expand ::
# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases
# Run evaluation with FA2
evalplus.evaluate --model "mistralai/Mistral-7B-Instruct-v0.3" \
--dataset [humaneval|mbpp] \
--backend hf \
--attn-implementation [flash_attention_2|sdpa] \
--greedy
vllm
backend:
pip install "evalplus[vllm]" --upgrade # Install vLLM backend
evalplus.evaluate --model "mistralai/Mistral-7B-Instruct-v0.3" \
--dataset [humaneval|mbpp] \
--backend vllm \
--tp [TENSOR_PARALLEL_SIZE] \
--greedy
openai
compatible servers (e.g., vLLM):
# Launch a model server first: e.g., https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "mistralai/Mistral-7B-Instruct-v0.3" \
--dataset [humaneval|mbpp] \
--backend openai \
--base-url http://localhost:8000/v1 \
--greedy
- Access OpenAI APIs from OpenAI Console
export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o" \
--dataset [humaneval|mbpp] \
--backend openai \
--greedy
- Access Anthropic APIs from Anthropic Console
export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
--dataset [humaneval|mbpp] \
--backend anthropic \
--greedy
- Access Gemini APIs from Google AI Studio
export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro" \
--dataset [humaneval|mbpp] \
--backend gemini \
--greedy
You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/
⏬ Install nightly version :: click to expand ::
pip install --upgrade "git+https://github.com/evalplus/evalplus.git" # without vLLM
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus@master" # with vLLM
⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
To learn more about how to use EvalPlus, please refer to:
We also share pre-generated code samples from LLMs we have evaluated:
- HumanEval+: See the attachment of our v0.1.0 release.
- MBPP+: See the attachment of our v0.2.0 release.
Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip
.
You can unzip them to a folder named like ${model_name}_temp_${temperature}
and run the evaluation from scratch with:
evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}
@inproceedings{evalplus,
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
year = {2023},
url = {https://openreview.net/forum?id=1qvx610Cu7},
}