/APO

Code for ACL2024 paper - Adversarial Preference Optimization (APO).

Primary LanguagePythonApache License 2.0Apache-2.0

Adversarial Preference Optimization

Code License Data License Python 3.8+

This repo contains the implementation of the ACL 2024 paper:

In Adversarial Preference Optimization (APO), we let the reward model (RM) and LLM agent play a min-max game, through which both models can be further enhanced without additional preference annotation.

For an overview, the repo contains:

Environment

We use Python3.8 with the dependencies listed in requirements.txt. To build the appropriate environment, use the following command:

pip3 install -r requirements.txt

Data & Annotation

To separately update RM and LLM, we split the cleaned Helpful&Harmless (HH) dataset into an RM training set and a LLM training set.

Data Type HH-RM Train Set HH-LLM Train Set HH Test Set
Preference Pairs RM training set RM validation set (sampled 10K pairs) RM testing set
Golden Answers APO positive responses
LLM Samples APO negative responses (alpaca_rm_samples) LLM alignment samples (alpaca_llm_samples) LLM testing Queries

On both HH-RM and HH-LLM training sets, we infer four LLM responses for each query as alpaca_rm_samples and alpaca_llm_samples. alpaca_rm_samples is combined with the golden responses on the HH-RM set as APO RM training pairs. alpaca_llm_samples is further scored by RMs and used for LLM alignment. To obtain LLM responses by yourself, run the command:

bash tools/llm_response_gen.sh

RM Training

Base RM Training

We build our RM on the pretrained LLaMA-7B (decapoda-research/llama-7b-hf). To train the base RM for rejection sampling, use the following command:

REPO_DIR=<path_to_this_repo>
DATA_DIR=${REPO_DIR}/data/hh-split
TRAIN_DATA_LIST="${DATA_DIR}/rm_data/hh_split_rm.train.json"
TEST_DATA_LIST="${DATA_DIR}/eval_data/hh_cleaned_origin.test.json\
		${DATA_DIR}/eval_data/hh_split_llm.valid.json"
		
NUM_GPUS=8
BATCH_SIZE=64
MICRO_BATCH_SIZE=1
LEARNING_RATE=1e-6
GRADIENT_ACCUMULATION_STEP=$((BATCH_SIZE / NUM_GPUS / MICRO_BATCH_SIZE))

torchrun --nproc_per_node=${NUM_GPUS} --master_port=6000 ${REPO_DIR}/train.py \
    --task_type hh_split \
    --do_train True \
    --eval_at_start False \
    --model_type reward \
    --model_name_or_path "decapoda-research/llama-7b-hf" \
    --data_type "comparison_pair" \
    --train_data_path ${TRAIN_DATA_LIST} \
    --eval_data_path ${TEST_DATA_LIST} \
    --rm_calibration True \
    --data_suffix rm_base \
    --add_sep_token True \
    --remove_unused_columns false \
    --output_dir <path_to_save_your_RM_checkpoint> \
    --num_train_epochs 1 \
    --per_device_train_batch_size ${MICRO_BATCH_SIZE} \
    --per_device_eval_batch_size ${MICRO_BATCH_SIZE} \
    --gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEP} \
    --evaluation_strategy steps \
    --padding_side right \
    --truncation_side left \
    --pooling_type last \
    --max_length 512 \
    --save_strategy steps \
    --learning_rate ${LEARNING_RATE} \
    --warmup_steps 100 \
    --deepspeed configs/default_offload_opt_param.json \
    --tf32 false --fp16 false

We also trained a testing RM to automatically evaluate the LLM response quality on the testing queries. To train the testing RM, change TRAIN_DATA_LIST=${DATA_DIR}/hh_cleaned_origin.train.json in the above command to learn with all the HH training comparisons.

The RM training data files (values in TRAIN_DATA_LIST) are lists of dictionaries, where each dictionary is an RM training item (--data_type="comparison_pair") including the following keys:

  • text: a list of query-response text, split by a special token <sep>.
  • scores: a list of float numbers, representing the preference scores of the corresponding query-response text.
  • query_id: a unique ID to the RM training item.

APO RM Training

To train the APO RM, first merge LLM samples and golden annotations into APO comparison pairs:

REPO_DIR=<path_to_this_repo>
DATA_DIR="${REPO_DIR}/data/hh-split"

python3 ${REPO_DIR}/tools/apo_data_converter.py \
	--golden_data_path ${DATA_DIR}/rm_data/hh_split_rm.golden.json \
	--sample_data_path ${DATA_DIR}/rm_data/hh_split_rm_alpaca_v0.sample.json \
	--output_dir ${DATA_DIR}/apo_data \
	--apo_data_name "rm_apo_data_v0"

Then use the following command to conduct APO RM finetuning:

REPO_DIR=<path_to_this_repo>
DATA_DIR=${REPO_DIR}/data/hh-split
TRAIN_DATA_LIST="${DATA_DIR}/rm_data/hh_split_rm.train.json \
		 ${DATA_DIR}/apo_data/rm_apo_data_v0_text_scores.json"
NUM_APO_SAMPLES=4

TEST_DATA_LIST="${DATA_DIR}/eval_data/hh_cleaned_origin.test.json \
		${DATA_DIR}/eval_data/hh_split_llm.valid.json"
		
NUM_GPUS=8
BATCH_SIZE=64
MICRO_BATCH_SIZE=1
LEARNING_RATE=1e-6
APO_COEFF=0.1
GRADIENT_ACCUMULATION_STEP=$((BATCH_SIZE / NUM_GPUS / MICRO_BATCH_SIZE))


torchrun --nproc_per_node=${NUM_GPUS} --master_port=6000 ${REPO_DIR}/train.py \
    --task_type apo \
    --do_train True \
    --eval_at_start False \
    --model_type reward \
    --model_name_or_path "decapoda-research/llama-7b-hf" \
    --data_type "comparison_pair" \
    --train_data_path ${TRAIN_DATA_LIST} \
    --eval_data_path ${TEST_DATA_LIST} \
    --rm_calibration True \
    --data_suffix rm_apo_v1 \
    --add_sep_token True \
    --remove_unused_columns false \
    --output_dir <path_to_save_your_APO_RM_checkpoint> \
    --num_train_epochs 1 \
    --apo_loss_coeff ${APO_COEFF} \
    --apo_sample_num ${NUM_APO_SAMPLES} \
    --per_device_train_batch_size ${MICRO_BATCH_SIZE} \
    --per_device_eval_batch_size ${MICRO_BATCH_SIZE} \
    --gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEP} \
    --evaluation_strategy steps \
    --padding_side right \
    --truncation_side left \
    --pooling_type last \
    --max_length 512 \
    --save_strategy steps \
    --save_total_limit 10 \
    --learning_rate ${LEARNING_RATE} \
    --warmup_steps 100 \
    --deepspeed configs/default_offload_opt_param.json \
    --tf32 false --fp16 false

RM Scoring

After finishing the RM training, we can use the following command to scoring new LLM samples:

REPO_DIR=<path_to_this_repo>
DATA_DIR=${REPO_DIR}/data/hh-split/llm_data
DATA_PATH="${DATA_DIR}/hh_split_llm_alpaca_v0.sample.json"

MODEL_PATH=<path_to_your_RM_checkpoint>
MODEL_NAME="base_rm" # or "apo_rm"

NUM_GPUS=8
MICRO_BATCH_SIZE=16

torchrun --nproc_per_node=${NUM_GPUS} --master_port=6000 ${REPO_DIR}/train.py \
    --task_type inference \
    --do_train False \
    --eval_at_start True \
    --model_type reward \
    --model_name_or_path ${MODEL_PATH} \
    --data_type "reject_sample" \
    --eval_data_path ${DATA_PATH} \
    --rm_calibration False \
    --data_suffix ${MODEL_NAME} \
    --add_sep_token True \
    --remove_unused_columns false \
    --output_dir <path_to_save_your_inference_results> \
    --per_device_eval_batch_size ${MICRO_BATCH_SIZE} \
    --evaluation_strategy steps \
    --padding_side right \
    --truncation_side left \
    --pooling_type last \
    --max_length 512 \
    --deepspeed configs/default_offload_opt_param.json \
    --tf32 false --fp16 false


# rejection sampling
SCORE_PATH=${DATA_PATH}_pred_${MODEL_NAME}_results.json
OUTPUT_FILE_NAME=${DATA_PATH}_rjs_${MODEL_NAME}.json

python3 ${REPO_DIR}/tools/rejection_sampling.py \
	--data_path ${DATA_DIR} \
	--score_path ${SCORE_PATH} \
	--output_dir ${DATA_DIR} \
	--rm_scorer  ${MODEL_NAME} \
	--output_file_name ${OUTPUT_FILE_NAME}

# remove tmp inference files
rm ${DATA_DIR}/*rank*.jsonl

After inference process, we obtain a RM scoring file ${DATA_PATH}_rjs_${MODEL_NAME}.json. Then we can update the Alpaca model with the training pipeline here.

Citation

@inproceedings{cheng2024adversarial,
  title={Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game},
  author={Cheng, Pengyu and Yang, Yifan and Li, Jian and Dai, Yong and Hu, Tianhao and Cao, Peixin and Du, Nan and Li, Xiaolong},
  booktitle={Findings of the Association for Computational Linguistics},
  year={2024}
}