Reward model in the reinforcement learning process
bin123apple opened this issue · 0 comments
Hello DeepSeek Team, thanks for your great work!
I fine-tuned your previous DeepSeek-Coder 33B model and got a model which performers well on the HumanEval benchmark. https://github.com/bin123apple/AutoCoder. However, while testing on the HumanEval+ benchmark, the new model's performance is not perfect.
I am thinking it might because that for all the data entries with execution feedback in my dataset, I only covered a small amount of test cases. And I noticed that in your paper, you mentioned that your reward model is trained by using the data provided by the compiler.
Is it possible for you to disclose whether the data used to train the reward model included test cases, or if it only required the code to pass the compiler? If test cases were included, could you please provide how many test cases each data entry typically contains?
Thanks again for your great work!