This is the official codebase for Contrastive Preference Learning: Learning From Human Feedback without RL by Joey Hejna, Rafael Rafailov*, Harshit Sikchi*, Chelsea Finn, Scott Niekum, W. Bradley Knox, and Dorsa Sadigh.
Below we include instructions for reproducing results found in the paper. This repository is based on a frozen version of research-lightning. For detailed information about how to use the repository, refer to that repository.
If you find our paper or code insightful, feel free to cite us with the following bibtex:
@InProceedings{hejna23contrastive,
title = {Contrastive Preference Learning: Learning From Human Feedback without RL},
author = {Hejna, Joey and Rafailov, Rafael and Sikchi, Harshit and Finn, Chelsea and Niekum, Scott and Knox, W. Bradley and Sadigh, Dorsa},
booktitle = {ArXiv preprint},
year = {2023},
url = {https://arxiv.org/abs/2310.13639}
}
Complete the following steps:
- Clone the repository to your desired location using
git clone https://github.com/jhejna/cpl
. - Create the conda environment using
conda env create -f environment_<cpu or gpu>.yaml
. Note that the correct MetaWorld version must be used. - Install the repository research package via
pip install -e .
. - Modify the
setup_shell.sh
script by updating the appropriate values as needed. Thesetup_shell.sh
script should load the environment, move the shell to the repository directory, and additionally setup any external dependencies. All the required flags should be at the top of the file. This is necessary for support with the SLURM launcher, which we used to run experiments. - Download the metaworld datasets here. Extract the files into a
datasets
folder in the repository root. This should match the paths in the config files.
When using the repository, you should be able to setup the environment by running . path/to/setup_shell.sh
.
To train a model, simply run python scripts/train.py --config path/to/config --path path/to/save/folder
after activating the environment.
Multiple experiments can be run at a single time using a .json
sweep file. To run a sweep, first create one, then run a sweep command using either tools/run_slurm.py
or tools/run_local.py
. Specify the slurm config and output directory with --arguments config=path/to/config path=path/to/save/folder
. For example sweep files, check out the Inverse Preference Learning repository.
To test Contrastive Preference Learning (CPL) on real human data, we use a fork of the Direct Preference-based Policy Optimization without Reward Modeling codebase. The implementation for these experiments can be found in this repository.
To move from DPPO to CPL, we just change the loss function and move to a probabilistic policy. We leave all hyperparameters of the preference model the same.
This code has an MIT license, found in the LICENSE file.