Problems in reproducing the RL fine-tuned results
abhik1505040 opened this issue · 8 comments
Hi, thanks for open-sourcing your amazing work!
I have been trying to reproduce the RL fine-tuned results reported in the paper, but unfortunately, I am encountering some issues. Here is a brief overview of the steps I followed:
-
Fine-tuned the actor model with CE loss for 10 epochs with
train_actor.sh
and the CodeT5-NTP model. This fine-tuned model gives similar results to the paper (2.86 pass@5 compared to 2.90 in the paper) -
With some modifications to
generate.py
, generated 20 candidate samples per problem (following the sample files given in the repo) and greedy baseline codes for the training set with the CE fine-tuned model. Theresult
key required for the correspondinggen_solutions.json
andbaseline_solutions.json
was generated with this snippet. -
Generated the token level hidden states/critic scores with the released critic model through
generate_critic_scores.sh
. -
RL-finetuning with the default hyperparameters present in
train_actor_rl.sh
, the RL-finetuned model gives very degraded results. (0.84 pass@5)
I would greatly appreciate any suggestions you may have on hyperparameter choices or other settings that could help me reproduce the RL-finetuned results accurately.
Many thanks!
@abhik1505040 Thanks for reporting the observations. The RL finetuning stage can be quite sensitive to hyperparameters. Based on my experience, you should experiment with a larger batch size e.g. 256 samples per training step, and experiment with lower learning rates.
Another trick is to have a new LM head for RL training iterations. We could initialize this head as a clone from the fine-tuned checkpoint of the original LM head following this. This strategy can help to stabilize the finetuning with RL for T5 models. But in some cases e.g. in GPT-J experiments, I found the benefit not too significant.
Yeah. I'm suffering from the same failure cases. I haven't got the number for the fine-tuned model with generated code yet but it should be similar to yours @abhik1505040 . Especially, in many files, it just generates some repetitive texts like MockRecorder
over and over (instead of a proper function). For me, the result from the fine-tuned model using ground truth examples (train_actor.sh
) is quite similar to yours.
@henryhungle Thank you very much for the pointers. I'll give them a try!
I'm also facing the same issue!
@abhik1505040 Hi, I want to know the result for pass@1 of your fine-tuned model with CE loss for 10 epochs. My result pass@1 It's much lower than the one in the paper, but pass@5 is similar to yours and the results of the paper are similar.
Hi @sssszh, apologies for the late response; I observed similar poor performance for pass@1 as well. The exact score was 0.67
Hi, @abhik1505040
Did you get any better results? And what is your changes to the default hyperparams? I found that using the --clone_rl_head
did improve the result a little bit (well strict accuracy >0 ^^).
Hi, folks, I also get pass@1 approximately 1% but pass@5 at 2.4% with CE loss fine-tuned model. After trying a bunch of temperature, 0.2 seems get me the best pass@1 at 1.1%. I wonder does anyone have anyone have any updates on reproducing the CE fine-tuned model? Thanks a lot!! @doviettung96 @abhik1505040 @sssszh