Performance of StarCoder on HumanEvalFixDocs
Opened this issue · 9 comments
With StarCoder, I am observing a pass@1 score of 58.9 instead of 43.5 as reported in the OctoCoder paper.
Script used:
accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 2048 \
--precision fp16
Results:
{
"humanevalfixdocs-python": {
"pass@1": 0.589329268292683,
"pass@10": 0.6989868047455075
},
"config": {
"prefix": "",
"do_sample": true,
"temperature": 0.2,
"top_k": 0,
"top_p": 0.95,
"n_samples": 20,
"eos": "<|endoftext|>",
"seed": 0,
"model": "starcoder",
"modeltype": "causal",
"peft_model": null,
"revision": null,
"use_auth_token": false,
"trust_remote_code": true,
"tasks": "humanevalfixdocs-python",
"instruction_tokens": null,
"batch_size": 1,
"max_length_generation": 2048,
"precision": "fp16",
"load_in_8bit": false,
"load_in_4bit": false,
"limit": null,
"limit_start": 0,
"postprocess": true,
"allow_code_execution": true,
"generation_only": false,
"load_generations_path": null,
"load_data_path": null,
"metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_sample_prompt.json",
"save_generations": true,
"save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_sample_prompt.json",
"save_references": false,
"prompt": "starcodercommit",
"max_memory_per_gpu": null,
"check_references": false
}
}
CC: @Muennighoff
A few things are different in the command we ran: We use --bf16
instead of fp16, --max_length_generation 1800
& --batch_size 5
. All of them can slightly affect the score though I would be surprised if by so much.
You can verify the 43.5 we got here https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/evaluation_humanevalfixdocspy_starcoder_temp02.json & the generations here https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/generations_humanevalfixdocspy_starcoder_temp02.json. If you want you can directly compare the generations to yours to see where the discrepancies may be.
Overall, yeah the commit format on the pretrained StarCoder works really well. On the regular HumanEvalFix, StarCoder + Commit Format also outperforms OctoCoder, see the below Table from Appendix G. The problem of the commit format is that it does not work well for code synthesis or explanation.
![Screenshot 2023-09-06 at 12 51 42 PM](https://private-user-images.githubusercontent.com/62820084/265983835-88f44afe-3090-4660-a249-f3cd17e42d7b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk4ODE4MjQsIm5iZiI6MTczOTg4MTUyNCwicGF0aCI6Ii82MjgyMDA4NC8yNjU5ODM4MzUtODhmNDRhZmUtMzA5MC00NjYwLWEyNDktZjNjZDE3ZTQyZDdiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE4VDEyMjUyNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWJjZjU4MzM2YTllNzY5YzliY2U4Y2NmZDdiYTFmZTMxZWMyODE2MTY4M2M1MTM4ZTc0ZjExNGRiOWYzZDI3YTYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Tiub5de81Nl8PKUSXIasPUZD99hnxF2D3Ugf9hSnYrI)
On the regular HumanEvalFix, StarCoder + Commit Format also outperforms OctoCoder, see the below Table from Appendix G.
This is helpful. Thanks! I feel this deserves a mention in Table-2 itself then :)
Could you also share the script you use to obtain https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/evaluation_humanevalfixdocspy_starcoder_temp02.json?
I can try re-running it in the exact same config that you used.
Thanks!
Sure it would be:
accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 1800 \
--precision bf16
Sure it would be:
accelerate launch main.py \ --model $MODEL_DIR \ --tasks humanevalfixdocs-python \ --do_sample True \ --temperature 0.2 \ --n_samples 20 \ --batch_size 5 \ --allow_code_execution \ --save_generations \ --trust_remote_code \ --prompt starcodercommit \ --save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \ --metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \ --max_length_generation 1800 \ --precision bf16
With this script, I observe a pass@1 score of 60.1.
{
"humanevalfixdocs-python": {
"pass@1": 0.6009146341463415,
"pass@10": 0.6974812593960444
},
"config": {
"prefix": "",
"do_sample": true,
"temperature": 0.2,
"top_k": 0,
"top_p": 0.95,
"n_samples": 20,
"eos": "<|endoftext|>",
"seed": 0,
"model": "starcoder",
"modeltype": "causal",
"peft_model": null,
"revision": null,
"use_auth_token": false,
"trust_remote_code": true,
"tasks": "humanevalfixdocs-python",
"instruction_tokens": null,
"batch_size": 5,
"max_length_generation": 1800,
"precision": "bf16",
"load_in_8bit": false,
"load_in_4bit": false,
"limit": null,
"limit_start": 0,
"postprocess": true,
"allow_code_execution": true,
"generation_only": false,
"load_generations_path": null,
"load_data_path": null,
"metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_prompt_bf16.json",
"save_generations": true,
"save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_prompt_bf16.json",
"save_references": false,
"prompt": "starcodercommit",
"max_memory_per_gpu": null,
"check_references": false
}
}
CC: @Muennighoff
You're right, it seems the result in the paper is too low. I reran it & got the below:
{
"humanevalfixdocs-python": {
"pass@1": 0.5878048780487805,
"pass@10": 0.6939082542089792
},
"config": {
"prefix": "",
"do_sample": true,
"temperature": 0.2,
"top_k": 0,
"top_p": 0.95,
"n_samples": 20,
"eos": "<|endoftext|>",
"seed": 0,
"model": "starcoder",
"modeltype": "causal",
"revision": null,
"use_auth_token": false,
"trust_remote_code": true,
"tasks": "humanevalfixdocs-python",
"instruction_tokens": null,
"batch_size": 5,
"max_length_generation": 1800,
"precision": "bf16",
"load_in_8bit": false,
"load_in_4bit": false,
"limit": null,
"limit_start": 0,
"postprocess": true,
"allow_code_execution": true,
"generation_only": false,
"load_generations_path": null,
"load_data_path": null,
"metric_output_path": "evaluation_humanevalfixdocspython_starcoder_temp02_commit.json",
"save_generations": true,
"save_generations_path": "generations_humanevalfixdocspython_starcoder_temp02_commit.json",
"save_references": false,
"prompt": "starcodercommit",
"max_memory_per_gpu": null,
"check_references": false
}
}
I will update the paper soon. Thanks a lot for noting this!
Thanks @Muennighoff :)
Thanks @Muennighoff, this is very helpful! :)