/cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features

Primary LanguagePythonMIT LicenseMIT

CleanRL (Clean Implementation of RL Algorithms)

Mailing List : cleanrl Meeting Recordings : cleanrl

CleanRL dedicates to be the most user-friendly Reinforcement Learning library. The implementation is clean, simple, and self-contained; you don't have to look through dozens of files to understand what is going on. Just read, print out a few things and you can easily customize.

At the same time, CleanRL tries to supply many research-friendly features such as cloud experiment management, support for continuous and discrete observation and action spaces, video recording of the game play, etc. These features will be very helpful for doing research, especially the video recording feature that allows you to visually inspect the agents' behavior at various stages of the training.

Good luck have fun 🚀

This project is WIP currently at 0.2.1 release, expect breaking changes.

The highlight features of this repo are:

  • Our implementation is self-contained in a single file. Everything about an algorithm is right there! Easy to understand and do research with.
  • Easy logging of training processes using Tensorboard and Integration with wandb.com to log experiments on the cloud. Check out https://cleanrl.costa.sh.
  • Hackable and being able to debug directly in Python’s interactive shell (Especially if you use the Spyder editor from Anaconda :) ).
  • Simple use of command line arguments for hyper-parameters tuning; no need for arcane configuration files.

Benchmarked Implementation

Our implementation is benchmarked to ensure quality. We log all of our benchmarked experiments using wandb so that you can check the hyper-parameters, videos of the agents playing the game, and the exact commands to reproduce it. See https://cleanrl.costa.sh.

The current dashboard of wandb does not allow us to show the agents performance in all the games at the same panel, so you have to click each panel in https://cleanrl.costa.sh to check the benchmarked performance, which can be inconvenient sometimes. So we additionally post the benchmarked performance for each game using seaborn as follows (the result is created by using benchmark/plot_benchmark.py

Benchmarked Learning Curves Atari
Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Atari
 
Benchmarked Learning Curves Mujoco
Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Mujoco
Benchmarked Learning Curves Pybullet
Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/PyBullet-and-Other-Continuous-Action-Tasks
Benchmarked Learning Curves Classic Control
Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Classic-Control
Benchmarked Learning Curves Others (Experimental Domains)
Metrics, logs, and recorded videos are at                                  cleanrl.benchmark/reports/Others
This is a rather challenging continuous action tasks that usually require 100M+ timesteps to solve.
This is a self-play environment from https://github.com/hardmaru/slimevolleygym, so its episode reward should not steadily increase. Check out the video for the agent's actual performance (i.e. go check out cleanrl.benchmark/reports/Others )
This is a MicroRTS environment to build as many combat units as possible, see https://github.com/vwxyzjn/gym-microrts. These runs are created by https://github.com/vwxyzjn/gym-microrts/blob/master/experiments/ppo.py, which additionally implements invalid action masking and handling of multi-discrete action space for PPO.

Get started

To run experiments locally, give the following a try:

$ git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl
$ pip install -e .
$ cd cleanrl
$ python ppo.py \
    --seed 1 \
    --gym-id CartPole-v0 \
    --total-timesteps 50000 \
# open another temrminal and enter `cd cleanrl/cleanrl`
$ tensorboard --logdir runs

demo.gif

To use wandb integration, sign up an account at https://wandb.com and copy the API key. Then run

$ cd cleanrl
$ pip install wandb
$ wandb login ${WANBD_API_KEY}
$ python ppo.py \
    --seed 1 \
    --gym-id CartPole-v0 \
    --total-timesteps 50000 \
    --prod-mode \
    --wandb-project-name cleanrltest 
# Then go to https://app.wandb.ai/${WANDB_USERNAME}/cleanrltest/

Checkout the demo sites at https://app.wandb.ai/costa-huang/cleanrltest

demo2.gif

Algorithms Implemented

Support and get involved

We have a Slack Community for support. Feel free to ask questions. Posting in Github Issues and PRs are also welcome.

In addition, we also have a monthly development cycle to implement new RL algorithms. Feel free to participate or ask questions there, too. You can sign up for our mailing list at our Google Groups to receive event RVSP which contains the Hangout video call address every week. Our past video recordings are available at YouTube

User's Guide for Researcher (Please read this if consider using CleanRL)

CleanRL focuses on early and mid stages of RL research, where one would try to understand ideas and do hacky experimentation with the algorithms. If your goal does not include messing with different parts of RL algorithms, perhaps library like stable-baselines, ray, or catalyst would be more suited for your use cases since they are built to be highly optimized, concurrent and fast.

CleanRL, however, is built to provide a simplified and streamlined approach to conduct RL experiment. Let's give an example. Say you are interested in implementing the GAE (Generalized Advantage Estimation) technique to see if it improves the A2C's performance on CartPole-v0. The workflow roughly looks like this:

  1. Make a copy of cleanrl/cleanrl/a2c.py to cleanrl/cleanrl/experiments/a2c_gae.py
  2. Implement the GAE technique. This should relatively simple because you don't have to navigate into dozens of files and find the some function named compute_advantages()
  3. Run python cleanrl/cleanrl/experiments/a2c_gae.py in the terminal or using an interactive shell like Spyder. The latter gives you the ability to stop the program at any time and execute arbitrary code; so you can program on the fly.
  4. Open another terminal and type tensorboard --logdir cleanrl/cleanrl/experiments/runs and checkout the episode_rewards, losses/policy_loss, etc. If something appears not right, go to step 2 and continue.
  5. If the technique works, you want to see if it works with other games such as Taxi-v3 or different parameters as well. Execute
    $ wandb login ${WANBD_API_KEY}
    $ for seed in {1..2}
        do
            (sleep 0.3 && nohup python a2c_gae.py \
            --seed $seed \
            --gym-id CartPole-v0 \
            --total-timesteps 30000 \
            --wandb-project-name myRLproject \
            --prod-mode
            ) >& /dev/null &
        done
    $ for seed in {1..2}
        do
            (sleep 0.3 && nohup python a2c_gae.py \
            --seed $seed \
            --gym-id Taxi-v3 \   # different env
            --total-timesteps 30000 \
            --gamma 0.8 \ # a lower discount factor
            --wandb-project-name myRLproject \
            --prod-mode
            ) >& /dev/null &
        done
    
    And then you can monitor the performances and keep good track of all the parameters used in your experiments
  6. Continue this process

This pipline described above should give you an idea of how to use CleanRL for your research.

Feature TODOs:

References

I have been heavily inspired by the many repos and blog posts. Below contains a incomplete list of them.

The following ones helped me a lot with the continuous action space handling: