Code release for Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization.
TL;DR: Compared to DPO loss, MODPO loss includes a margin to steer language models by multiple objectives.
conda create -n modpo python=3.10
conda activate modpo
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
# (optional) pip install flash-attn==2.3.2 --no-build-isolation
This repository includes two MODPO examples:
-
Safety alignment (
scripts/modpo/beavertails
): Balances different values such as safety vs. helpfulness. -
Summarization with length penalty (
scripts/modpo/summarize_w_length_penalty
): Reduces length bias (verbosity) in summarization.
This repository also contains other off-the-shelf tuning recipes:
- SFT (Supervised Fine-tuning):
scripts/examples/sft/run.sh
- RM (Reward Modeling):
scripts/examples/rm/run.sh
- DPO (Direct Preference Optimization):
scripts/examples/dpo/run.sh
To implement new alignment algorithms, please add new trainers at src/trainer
.
For supported datasets, refer to REAL_DATASET_CONFIGS(src/data/configs.py)
.
To train on your datasets, add them under src/data/raw_data
and modify REAL_DATASET_CONFIGS(src/data/configs.py)
accordingly. Please see src/data/raw_data/shp
for an example.
@misc{zhou2023onepreferencefitsall,
title={Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization},
author={Zhanhui Zhou and Jie Liu and Chao Yang and Jing Shao and Yu Liu and Xiangyu Yue and Wanli Ouyang and Yu Qiao},
year={2023},
eprint={2310.03708},
archivePrefix={arXiv},
primaryClass={cs.LG}
}