/filtered-dpo

Introducing Filtered Direct Preference Optimization (fDPO) that enhances language model alignment with human preferences by discarding lower-quality samples compared to those generated by the learning model

Primary LanguageJupyter NotebookMIT LicenseMIT

Stargazers