/Uni-RLHF-Platform

Uni-RLHF platform for "Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback" (ICLR2024)

Primary LanguagePythonMIT LicenseMIT



arXiv GitHub License

Project Website · Paper · Datasets · Clean Offline RLHF

This is the Uni-RLHF platform implementation of the paper Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback by Yifu Yuan, Jianye Hao, Yi Ma, Zibin Dong, Hebin Liang, Jinyi Liu, Zhixin Feng, Kai Zhao, Yan Zheng. Uni-RLHF aims to provide a complete workflow from real human feedback, fostering progress in the development of RLHF in decision making domain. Here we develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated dataset (≈15 million steps). Also, we provide offline RLHF baselines using collected feedback datasets and various design choice in the Clean Offline RLHF.



Table of Contents
  1. Getting Started
  2. Usage
  3. Roadmap
  4. Contributing
  5. License
  6. Contact
  7. Acknowledgments

🛠️ Getting Started

The Uni-RLHF platform consists of a vue front-end and a flask back-end. Also, we support a wide range of mainstream RL environments for annotation.

Installation

Platform

  1. Clone the repo
    git clone https://github.com/TJU-DRL-LAB/Uni-RLHF.git
    cd Uni-RLHF
  2. Install virtualenv
    conda create -n rlhf python==3.9
    conda activate rlhf
    pip install -r requirements.txt
  3. Install NPM packages
    npm install --prefix ./uni_rlhf/vue_part
  4. Configure a MySQL Database

Datasets

Uni-RLHF supports the following classic datasets, a full list of all tasks is available here. Uni-RLHF also supports the uploading of customizaton datasets, as long as the dataset contains observations and terminals keys.

  • Install D4RL dependencies. Note that we made some small changes to the camera view for better visualisations.

    cd d4rl
    pip install -e .
  • Install Atari dependencies.

    pip install git+https://github.com/takuseno/d4rl-atari
  • Install V-D4RL dependencies. Note that v-d4rl provide image datasets and full datasets can be found on GoogleDrive. These must be downloaded before running the code. And the right file structure is:

     uni_rlhf
     └───datasets
     │   └───dataset_resource
     |       └───vd4rl
     |       |   └───cheetah
     |       |   │   └───cheetah_run_medium
     |       |   │   └───cheetah_run_medium_expert
     |       |   └───humanoid
     |       |   |   |───humanoid_walk_medium
     |       |   │   └───humanoid_walk_medium_expert
     |       |   └───walker
     |       |       |───walker_walk_medium
     |       |       └───walker_walk_medium_expert
     |       └───smarts
     |          └───cruise
     |          └───curin
     |          └───left_c
     └───vue_part
     │   ...
     └───controllers
     │   ...
  • Install MiniGrid dependencies. There are the same dependencies as the D4RL datasets.

  • Install SMARTS dependencies. We employed online reinforcement learning algorithms to train two agents for datasets collection, each designed specifically for the respective scenario. The first agent demonstrates medium driving proficiency, achieving a success rate ranging from 40% to 80% in its designated scenario. In contrast, the second agent exhibits expert-level performance, attaining a success rate of 95% or higher in the same scenario. For dataset construction, 800 driving trajectories were collected using the intermediate agent, while an additional 200 were gathered via the expert agent. By integrating the two datasets, we compiled a mixed dataset encompassing 1,000 driving trajectories. We upload full datasets containing image (for rendering) and vector (for training) on GoogleDrive. These must be downloaded before running the code. And the right file structure is the same as v-d4rl dataset.

  • Upload customization datasets. The customization datasets must be h5df format and contain observations and terminal keys:

    observations: An N by observation dimensional array of observations.
    terminals: An N dimensional array of episode termination flags. 

(back to top)

Setup

To run the platform, you should configure SQLALCHEMY_DATABASE in the uni_rlhf/config.py, then run with:

python run.py

App is running at:

http://localhost:5001

You can kill all relative process with:

python scripts/kill_process.py

💻 Usage

Overview



  • Specially tailored pipelines and tasks for reinforcement learning and decision-making problem.
  • A clean pipeline designed for employer-annotators coordination
  • Supports multi-user synchronised labeling and export with no conflict.
  • Supports a large number of mainstream decision-making datasets and easily cumstomize and upload your own datasets.
  • Supports serveral mainstream feedback types for decision making problem and provide configurable label formats let you combining new ways of giving feedback.

Supported Tasks

We support serveral build-in environments and datasets. See config for expected name formatting for full domains and tasks.

Supported Feedbacks Format



We support five common feedback types, propose a standardized feedback encoding format how annotators interact with these types and how they can be encoded. Additionally, we briefly outline the potential forms and applications of reinforcement learning that integrate various forms of human feedback in the Uni-RLHF paper.

Offline RLHF Datasets and Benchmark

Thanks to Uni-RLHF, we establish a systematic pipeline of crowdsourced annotations, resulting in an open-source and reuseable large-scale annotated dataset (≈15 million steps). Then, we conduct offline RL baselines using collected feedback datasets, we refer to offline RLHF baselines in the sister repository. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions for decision making based on realistic human feedback.

For more examples, please refer to the Documentation

(back to top)

🧭 Roadmap

  • Support auto reward model training process
  • Fix online training bug
  • Adapting the sampler in the new code framework

See the open issues for a full list of proposed features (and known issues).

(back to top)

🙏 Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

🏷️ License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

✉️ Contact

For any questions, please feel free to email yuanyf@tju.edu.cn.

(back to top)

📝 Citation

If you find our work useful, please consider citing:

@inproceedings{anonymous2023unirlhf,
    title={Uni-{RLHF}: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback},
    author={Yuan, Yifu and Hao, Jianye and Ma, Yi and Dong, Zibin and Liang, Hebin and Liu, Jinyi and Feng, Zhixin and Zhao, Kai and Zheng, Yan}
    booktitle={The Twelfth International Conference on Learning Representations, ICLR},
    year={2024},
    url={https://openreview.net/forum?id=WesY0H9ghM},
}

(back to top)