Sparse Feedback (Accepted to ICLR 2024!)

Paper 🤗Data 🤗Rating Reward Model 🤗Ranking Reward Model

Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Authors: Hritik Bansal, John Dang, Aditya Grover

Installation

pip install -r requirements.txt

Large-scale Feedback Data from GPT-3.5-Turbo

You can access the large-scale ratings and rankings feedback data at 🤗 hf datasets

Feedback Acquistion

Collect Instruction Data

The first step would be to collect instructions data in the instructions.json format.
The file should be formatted as:

[
{"instruction": instruction_1, "input": input_1 (Optional as many instances may not have any input)}, 
{"instruction": instruction_2, "input": input_2}, 
.
.
]

In our work, we merge three instructions datasets: a subset of Dolly, UserOrient Self Instruct, and a subset of SuperNI-V2. Feel free to create your custom instructions dataset. Our subset of SuperNI-V2 is present here.

Response Generation

Once the data.json is ready, we use Alpaca-7B model to generate five responses for each instance.
Use Alpaca original repo to setup the environment and download the model checkpoints.
Use this script inference_alpaca to generate CSV file with instructions, input, response1, response2, response3, response4, response5 in it.
Sample Alpaca generation is present here.

Prompting LLMs for feedback

Ratings

Convert the json file to a csv file using the conversion code here

python utils/convert_data_to_csv_ratings.py --input data/alpaca_generation/sample_generation.json --output data/alpaca_generation/ratings_sample_generation.csv

Run the llm_feedback_ratings.py using the following sample command:

OPENAI_API_KEY=<Your OAI API KEY> python llm_feedback_ratings.py --input_csv ../data/alpaca_generation/ratings_sample_generation.csv --save_feedback_csv ../data/alpaca_generation/feedback_ratings_sample_generation.csv

Rankings

Convert the json file to a csv file using the conversion code here

python utils/convert_data_to_csv_rankings.py --input data/alpaca_generation/sample_generation.json --output data/alpaca_generation/rankings_sample_generation.csv

Run the llm_feedback_rankings.py using the following sample command:

 OPENAI_API_KEY=<Your OAI API KEY> python scripts/llm_feedback_rankings.py --input_csv  data/alpaca_generation/rankings_sample_generation.csv --save_feedback_csv data/alpaca_generation/feedback_rankings_sample_generation.csv

Consistency

Run the consistency.py with the following command:

python consistency.py --ratings_csv <ratings csv w/ AI feedback> --rankings_csv <rankings csv w/ AI feedback>

Reward Modeling

We use Alpaca-7B as the backbone for the ratings and rankings reward models. please setup Alpaca checkpoints before moving forward. We use single A6000 GPU to train our reward models. Since, we use HF trainer, it should be straightforward to extend the code to multi-GPUs.

Some of the code is adopted from Alpaca-LoRA, thanks to its developers!

Ratings Reward Model

Split your ratings feedback data into train.csv and val.csv
Setup wandb dashboard for logging purposes.
Sample command to get started with the reward model:

 CUDA_VISIBLE_DEVICES=4 python train.py  --output_dir <dir> --train_input_csv <train.csv> --test_input_csv <val.csv> --model_name <path to alpaca 7b>

You should be able to see the trained checkpoints being stored in your output_dir after every epoch.

Our pretrained Checkpoint is present 🤗 here. Please note that you need to make a few changes to the code to load this checkpoint as mentioned here.

Rankings Reward Model

Firstly, we convert the rankings feedback data into a suitable format for reward model training. Run convert_data_for_ranking_rm.py.

 python convert_data_for_ranking_rm.py --input ../data/alpaca_generation/feedback_rankings_sample_generation.csv --output ../data/alpaca_generation/feedback_rankings_for_rm_sample.json

The output of the above code would be:

[
    'i_1': {'sentences': [[a1, a2], [a3, a4]...]}
    ...
    'i_n': {'sentences': [[b1, b2], [b3, b4]...]}
]

where [a1, a2] and [a3, a4] are pair of responses for the instruction i1 such that a1 is preferred over a2 and a3 is preferred over a4.

Split the created json into train.json and val.json. Sample command to launch the training:

CUDA_VISIBLE_DEVICES=4 python train.py --per_device_train_batch_size 1 --output_dir <output_dir> --train_input <train.json> --test_input <val.json> --model_name <path to alpaca 7b> --learning_rate 1e-4 --run_name test --gradient_accumulation_steps 4

You should be able to see the trained checkpoints being stored in your output_dir after every epoch.
Our pretrained Checkpoint is present 🤗 here. Please note that you need to make a few changes to the code to load this checkpoint as mentioned here.

Best-of-64 Policy (Re-Ranking)

We utilize AlpacaEval (thanks to the developers!) as the outputs of the reference model text-davinci-003 on unseen instructions data.
Firstly, generate 64 responses for the 553 unseen instructions from Alpaca-7B model. Run this command to do so:

python inference/eval_inference_alpaca.py --device_id 3 --input_data data/inference_model_outputs/davinci003.json --save_generations data/inference_model_outputs/alpaca7b.json --model_path <path to alpaca-7b>

This code should generate a file that looks like alpaca.json 3. We will re-rank the 64 responses from the model using the reward models trained in the previous step. Run this command to do so:

python inference/reranking.py --device_id 3 --input_data data/inference_model_outputs/alpaca7b_temp1_64.json --save_generations data/inference_model_outputs/ratings_alpaca.json --reward_model_path <path to checkpoint-xxx> --alpaca_model_path <path to alpaca 7B> --reward_model_name ratings_alpaca

Evaluation

Rating-based Evaluation

Create a csv files from the json of responses using the following command

python utils/convert_data_for_ratings_eval.py --input data/inference_model_outputs/davinci003.json --output data/inference_model_outputs/davinci_for_llm_eval.csv
python utils/convert_data_for_ratings_eval.py --input data/inference_model_outputs/ratings_alpaca.json --output data/inference_model_outputs/ratings_for_llm_eval.csv

Run the following command to rate the outputs using LLM:

OPENAI_API_KEY=<YOUR OPENAI KEY> python scripts/llm_feedback_ratings.py --input_csv data/inference_model_outputs/davinci_for_llm_eval.csv --save_feedback_csv data/inference_model_outputs/feedback_ratings_davinci_for_llm_eval.csv 
OPENAI_API_KEY=<YOUR OPENAI KEY> python scripts/llm_feedback_ratings.py --input_csv data/inference_model_outputs/ratings_for_llm_eval.csv --save_feedback_csv data/inference_model_outputs/feedback_ratings_ratings_for_llm_eval.csv

Run the following command to get the win-rate for the aligned LLM against the reference model:

python scripts/calc_win_rate_from_ratings_eval.py --input1 data/inference_model_outputs/feedback_ratings_davinci_for_llm_eval.csv --input2 data/inference_model_outputs/feedback_ratings_ratings_for_llm_eval.csv

Ranking-based Evaluation

Create a single csv file from the json files containing the model outputs -- reference and aligned LLM.

python utils/convert_data_for_rankings_eval.py --davinci_file data/inference_model_outputs/davinci003.json --alpaca_file data/inference_model_outputs/ratings_alpaca.json --output data/inference_model_outputs/rankings_for_llm_eval.csv

Run the following command to get the rankings for the pairwise judgments from the LLM

 OPENAI_API_KEY=<YOUR OPENAI API KEY> python scripts/llm_feedback_rankings.py --input_csv data/inference_model_outputs/rankings_for_llm_eval.csv --save_feedback_csv data/inference_model_outputs/feedback_rankings_davincivsratings_for_llm_eval.csv

Run the following command to get the win-rate for the aligned LLM against the reference model:

python scripts/calc_win_rate_from_rankings_eval.py --input data/inference_model_outputs/feedback_rankings_davincivsratings_for_llm_eval.csv

Human Feedback Data

TODO

Hritikbansal/sparse_feedback