Just me playing around with the latests GPT zoo
https://arxiv.org/pdf/2203.02155.pdf
Explains three steps recipe:
- Supervised modelling to fine-tune a "knowledge" model.. In this case GPT-3
- "A Reward Model" to summarize human understand of what is considered "acceptable"
- Policy trained using PPO and the reward model
https://arxiv.org/abs/1706.03741 Uses pair-wise ranking score function to rank preferences
https://arxiv.org/pdf/1707.06347.pdf Simplified TRPO