/code-eval

Run evaluation on LLMs using human-eval benchmark

Primary LanguagePythonMIT LicenseMIT

code-eval

What

This is a repo I use to run human-eval on code models, adjust as needed. Some scripts adjusted from wizardcoder repo. The code is duplicated, mostly to handle edge cases around model tokenizing and loading (might eventually clean it up).

Results

model size pass@1 pass@10 screenshot
WizardCoder-15B-V1.0 15B 57% 68.9% wizardcoder
openchat/opencoderplus 15B 27.3% 43.9% opencoder
teknium/Replit-v1-CodeInstruct-3B 3B 25.8% 42.6% replit-codeinstruct-v1
teknium/Replit-v2-CodeInstruct-3B 3B 21.5% 31% replit-codeinstruct-v2
replit-code-v1-3b 3B 15.1% 27.4% replit-code-v1

Setup

Create python environment

python -m venv env && source env/bin/activate

Install dependencies

pip install -r requirements.txt

Run the eval script

# replace script file name for various models:
# eval_wizard.py
# eval_opencode.py
# eval_replit.py
# eval_replit_instruct.py

python eval_wizard.py

Process the jsonl file to extract code samples from model completions

Note: the replit base + instruct model does not go through this process

# replace args for various models:
# --path results/wizard --out_path results/wizard/eval.jsonl
# --path results/opencode --out_path results/opencode/eval.jsonl

python process_eval.py --path results/wizard --out_path results/wizard/processed.jsonl --add_prompt

Then get the results

# replace args for various models:
# results/wizard/processed.jsonl
# results/opencode/processed.jsonl
# results/replit_instruct/eval.jsonl
# results/replit/eval.jsonl

evaluate_functional_correctness results/wizard/processed.jsonl