code-eval

What

This is a repo I use to run human-eval on code models, adjust as needed. Some scripts adjusted from wizardcoder repo. The code is duplicated, mostly to handle edge cases around model tokenizing and loading (might eventually clean it up).

Results

model	size	pass@1	pass@10
WizardCoder-15B-V1.0	15B	57%	68.9%
openchat/opencoderplus	15B	27.3%	43.9%
teknium/Replit-v1-CodeInstruct-3B	3B	25.8%	42.6%
teknium/Replit-v2-CodeInstruct-3B	3B	21.5%	31%
replit-code-v1-3b	3B	15.1%	27.4%

Setup

Create python environment

python -m venv env && source env/bin/activate

Install dependencies

pip install -r requirements.txt

Run the eval script

# replace script file name for various models:
# eval_wizard.py
# eval_opencode.py
# eval_replit.py
# eval_replit_instruct.py

python eval_wizard.py

Process the jsonl file to extract code samples from model completions

Note: the replit base + instruct model does not go through this process

# replace args for various models:
# --path results/wizard --out_path results/wizard/eval.jsonl
# --path results/opencode --out_path results/opencode/eval.jsonl

python process_eval.py --path results/wizard --out_path results/wizard/processed.jsonl --add_prompt

Then get the results

# replace args for various models:
# results/wizard/processed.jsonl
# results/opencode/processed.jsonl
# results/replit_instruct/eval.jsonl
# results/replit/eval.jsonl

evaluate_functional_correctness results/wizard/processed.jsonl

annejeevan/code-eval

code-eval

What

Results

Setup