code-eval
What
This is a repo I use to run human-eval on code models, adjust as needed. Some scripts adjusted from wizardcoder repo. The code is duplicated, mostly to handle edge cases around model tokenizing and loading (might eventually clean it up).
Results
model | size | pass@1 | pass@10 | screenshot |
---|---|---|---|---|
WizardCoder-15B-V1.0 | 15B | 57% | 68.9% | |
openchat/opencoderplus | 15B | 27.3% | 43.9% | |
teknium/Replit-v1-CodeInstruct-3B | 3B | 25.8% | 42.6% | |
teknium/Replit-v2-CodeInstruct-3B | 3B | 21.5% | 31% | |
replit-code-v1-3b | 3B | 15.1% | 27.4% |
Setup
Create python environment
python -m venv env && source env/bin/activate
Install dependencies
pip install -r requirements.txt
Run the eval script
# replace script file name for various models:
# eval_wizard.py
# eval_opencode.py
# eval_replit.py
# eval_replit_instruct.py
python eval_wizard.py
Process the jsonl file to extract code samples from model completions
Note: the replit base + instruct model does not go through this process
# replace args for various models:
# --path results/wizard --out_path results/wizard/eval.jsonl
# --path results/opencode --out_path results/opencode/eval.jsonl
python process_eval.py --path results/wizard --out_path results/wizard/processed.jsonl --add_prompt
Then get the results
# replace args for various models:
# results/wizard/processed.jsonl
# results/opencode/processed.jsonl
# results/replit_instruct/eval.jsonl
# results/replit/eval.jsonl
evaluate_functional_correctness results/wizard/processed.jsonl