Performance of StarCoder on HumanEvalFixDocs

Question

Performance of StarCoder on HumanEvalFixDocs

Opened this issue a year ago · 9 comments

With StarCoder, I am observing a pass@1 score of 58.9 instead of 43.5 as reported in the OctoCoder paper.

Script used:

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 2048 \
--precision fp16

Results:

{
  "humanevalfixdocs-python": {
    "pass@1": 0.589329268292683,
    "pass@10": 0.6989868047455075
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_generations": true,
    "save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

CC: @Muennighoff

Answer 1 · 2023-09-06T10:54:53.000Z

A few things are different in the command we ran: We use --bf16 instead of fp16, --max_length_generation 1800 & --batch_size 5. All of them can slightly affect the score though I would be surprised if by so much.
You can verify the 43.5 we got here https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/evaluation_humanevalfixdocspy_starcoder_temp02.json & the generations here https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/generations_humanevalfixdocspy_starcoder_temp02.json. If you want you can directly compare the generations to yours to see where the discrepancies may be.

Overall, yeah the commit format on the pretrained StarCoder works really well. On the regular HumanEvalFix, StarCoder + Commit Format also outperforms OctoCoder, see the below Table from Appendix G. The problem of the commit format is that it does not work well for code synthesis or explanation.

Answer 2 · 2023-09-07T06:19:36.000Z

On the regular HumanEvalFix, StarCoder + Commit Format also outperforms OctoCoder, see the below Table from Appendix G.

This is helpful. Thanks! I feel this deserves a mention in Table-2 itself then :)

Could you also share the script you use to obtain https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/evaluation_humanevalfixdocspy_starcoder_temp02.json?
I can try re-running it in the exact same config that you used.

Thanks!

Answer 3 · 2023-09-07T06:35:27.000Z

Sure it would be:

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 1800 \
--precision bf16

Answer 4 · 2023-09-07T08:45:54.000Z

Sure it would be:

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 1800 \
--precision bf16

With this script, I observe a pass@1 score of 60.1.

{
  "humanevalfixdocs-python": {
    "pass@1": 0.6009146341463415,
    "pass@10": 0.6974812593960444
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 5,
    "max_length_generation": 1800,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_prompt_bf16.json",
    "save_generations": true,
    "save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_prompt_bf16.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

Answer 5 · 2023-09-11T05:47:39.000Z

CC: @Muennighoff

Answer 6 · 2023-09-11T21:28:43.000Z

You're right, it seems the result in the paper is too low. I reran it & got the below:

{
  "humanevalfixdocs-python": {
    "pass@1": 0.5878048780487805,
    "pass@10": 0.6939082542089792
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 5,
    "max_length_generation": 1800,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_humanevalfixdocspython_starcoder_temp02_commit.json",
    "save_generations": true,
    "save_generations_path": "generations_humanevalfixdocspython_starcoder_temp02_commit.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

I will update the paper soon. Thanks a lot for noting this!

Answer 7 · 2023-09-12T09:53:44.000Z

Thanks @Muennighoff :)

Answer 8 · 2023-09-16T09:54:21.000Z

Attached is how the new section will look like including the updated results. Thanks again!

Answer 9 · 2023-09-18T07:25:38.000Z

Thanks @Muennighoff, this is very helpful! :)