2024/blog/rlhf-without-rl/
utterances-bot opened this issue · 0 comments
utterances-bot commented
RLHF without RL - Direct Preference Optimization | ICLR Blogposts 2024
We discuss the RL part of RLHF and its recent displacement by direct preference optimization (DPO). With DPO, a language model can be aligned with human preferences without sampling from an LM, thereby significantly simplifying the training process. By now, DPO has been implemented in many projects and seems to be here to stay.