CyberAgentAILab/filtered-dpo
Introducing Filtered Direct Preference Optimization (fDPO) that enhances language model alignment with human preferences by discarding lower-quality samples compared to those generated by the learning model
Jupyter NotebookMIT