Actor model finetuning code based on reward and policy gradient
parshinsh opened this issue · 1 comments
parshinsh commented
Thanks for the great work! Is it possible that you can share the code of whole RL framework finetuning (Actor & Critic updates based on the reward defined in the paper) for better reproducibility? For example, the code of updating Actor network based on reward and policy gradient is missing.
henryhungle commented
@parshinsh we updated the code for finetuning the actor model with synthetic samples and their return estimates.
Thank you for your patience!