RL with execution-based reward

Question

RL with execution-based reward

ysymyth opened this issue 2 years ago · 2 comments

Hi, it seems in the code RL trainer is using the critic model to generate rewards, or is it using offline RL with preloaded rewards?

I wonder if there is a way to use online RL with execution-based rewards, or if it's too slow/unstable in practice. Thanks!

ysymyth commented 2 years ago

Thanks!

Answer 1 · 2023-02-22T09:02:24.000Z

Yes the trainer uses the offline RL with pre-generated rewards. The rewards were calculated based on a trained critic and the execution results of generated code samples. We follow the best RL training practice which often freezes the critic model and update the actor. Another reason for our approach is, as you mentioned, the slow and unstable process to obtain execution results on generated code samples (e.g. in an online approach, execution of generated code may accidentally affect the training process). I expect an online approach might be ideal as it will utilize the most updated policy/actor network.