/cpl

Code for Contrastive Preference Learning (CPL)

Primary LanguagePythonMIT LicenseMIT

Contrastive Preference Learning: Learning from Human Feedback without RL

This is the official codebase for Contrastive Preference Learning: Learning From Human Feedback without RL by Joey Hejna, Rafael Rafailov*, Harshit Sikchi*, Chelsea Finn, Scott Niekum, W. Bradley Knox, and Dorsa Sadigh.

Below we include instructions for reproducing results found in the paper. This repository is based on a frozen version of research-lightning. For detailed information about how to use the repository, refer to that repository.

If you find our paper or code insightful, feel free to cite us with the following bibtex:

@InProceedings{hejna23contrastive,
  title = {Contrastive Preference Learning: Learning From Human Feedback without RL},
  author = {Hejna, Joey and Rafailov, Rafael and Sikchi, Harshit and Finn, Chelsea and Niekum, Scott and Knox, W. Bradley and Sadigh, Dorsa},
  booktitle = {ArXiv preprint},
  year = {2023},
  url = {https://arxiv.org/abs/2310.13639}
}

Installation

Complete the following steps:

  1. Clone the repository to your desired location using git clone https://github.com/jhejna/cpl.
  2. Create the conda environment using conda env create -f environment_<cpu or gpu>.yaml. Note that the correct MetaWorld version must be used.
  3. Install the repository research package via pip install -e ..
  4. Modify the setup_shell.sh script by updating the appropriate values as needed. The setup_shell.sh script should load the environment, move the shell to the repository directory, and additionally setup any external dependencies. All the required flags should be at the top of the file. This is necessary for support with the SLURM launcher, which we used to run experiments.
  5. Download the metaworld datasets here. Extract the files into a datasets folder in the repository root. This should match the paths in the config files.

When using the repository, you should be able to setup the environment by running . path/to/setup_shell.sh.

Usage

To train a model, simply run python scripts/train.py --config path/to/config --path path/to/save/folder after activating the environment.

Multiple experiments can be run at a single time using a .json sweep file. To run a sweep, first create one, then run a sweep command using either tools/run_slurm.py or tools/run_local.py. Specify the slurm config and output directory with --arguments config=path/to/config path=path/to/save/folder. For example sweep files, check out the Inverse Preference Learning repository.

D4RL Results with Human Feedback

To test Contrastive Preference Learning (CPL) on real human data, we use a fork of the Direct Preference-based Policy Optimization without Reward Modeling codebase. The implementation for these experiments can be found in this repository.

To move from DPPO to CPL, we just change the loss function and move to a probabilistic policy. We leave all hyperparameters of the preference model the same.

License

This code has an MIT license, found in the LICENSE file.