iclr-blogposts/2024

2024/blog/rlhf-without-rl/

utterances-bot opened this issue · 0 comments

RLHF without RL - Direct Preference Optimization | ICLR Blogposts 2024

We discuss the RL part of RLHF and its recent displacement by direct preference optimization (DPO). With DPO, a language model can be aligned with human preferences without sampling from an LM, thereby significantly simplifying the training process. By now, DPO has been implemented in many projects and seems to be here to stay.

https://iclr-blogposts.github.io/2024/blog/rlhf-without-rl/