salesforce/CodeRL

Problems in reproducing the RL fine-tuned results

abhik1505040 opened this issue · 8 comments

Hi, thanks for open-sourcing your amazing work!

I have been trying to reproduce the RL fine-tuned results reported in the paper, but unfortunately, I am encountering some issues. Here is a brief overview of the steps I followed:

  • Fine-tuned the actor model with CE loss for 10 epochs with train_actor.sh and the CodeT5-NTP model. This fine-tuned model gives similar results to the paper (2.86 pass@5 compared to 2.90 in the paper)

  • With some modifications to generate.py, generated 20 candidate samples per problem (following the sample files given in the repo) and greedy baseline codes for the training set with the CE fine-tuned model. The result key required for the corresponding gen_solutions.json and baseline_solutions.json was generated with this snippet.

  • Generated the token level hidden states/critic scores with the released critic model through generate_critic_scores.sh.

  • RL-finetuning with the default hyperparameters present in train_actor_rl.sh, the RL-finetuned model gives very degraded results. (0.84 pass@5)

I would greatly appreciate any suggestions you may have on hyperparameter choices or other settings that could help me reproduce the RL-finetuned results accurately.

Many thanks!

@abhik1505040 Thanks for reporting the observations. The RL finetuning stage can be quite sensitive to hyperparameters. Based on my experience, you should experiment with a larger batch size e.g. 256 samples per training step, and experiment with lower learning rates.

Another trick is to have a new LM head for RL training iterations. We could initialize this head as a clone from the fine-tuned checkpoint of the original LM head following this. This strategy can help to stabilize the finetuning with RL for T5 models. But in some cases e.g. in GPT-J experiments, I found the benefit not too significant.

Yeah. I'm suffering from the same failure cases. I haven't got the number for the fine-tuned model with generated code yet but it should be similar to yours @abhik1505040 . Especially, in many files, it just generates some repetitive texts like MockRecorder over and over (instead of a proper function). For me, the result from the fine-tuned model using ground truth examples (train_actor.sh) is quite similar to yours.

@henryhungle Thank you very much for the pointers. I'll give them a try!

I'm also facing the same issue!

@abhik1505040 Hi, I want to know the result for pass@1 of your fine-tuned model with CE loss for 10 epochs. My result pass@1 It's much lower than the one in the paper, but pass@5 is similar to yours and the results of the paper are similar.

Hi @sssszh, apologies for the late response; I observed similar poor performance for pass@1 as well. The exact score was 0.67

Hi, @abhik1505040
Did you get any better results? And what is your changes to the default hyperparams? I found that using the --clone_rl_head did improve the result a little bit (well strict accuracy >0 ^^).

Hi, folks, I also get pass@1 approximately 1% but pass@5 at 2.4% with CE loss fine-tuned model. After trying a bunch of temperature, 0.2 seems get me the best pass@1 at 1.1%. I wonder does anyone have anyone have any updates on reproducing the CE fine-tuned model? Thanks a lot!! @doviettung96 @abhik1505040 @sssszh