/CORL

High-quality single-file implementations of SOTA Offline RL algorithms: AWAC, BC, CQL, DT, EDAC, IQL, SAC-N, TD3+BC

Primary LanguagePythonApache License 2.0Apache-2.0

CORL (Clean Offline Reinforcement Learning)

Code style: black Imports: isort

🧵 CORL is an Offline Reinforcement Learning library that provides high-quality and easy-to-follow single-file implementations of SOTA ORL algorithms. Each implementation is backed by a research-friendly codebase, allowing you to run or tune thousands of experiments. Heavily inspired by cleanrl for online RL, check them out too!

  • 📜 Single-file implementation
  • 📈 Benchmarked Implementation for N algorithms
  • 🖼 Weights and Biases integration

Getting started

git clone https://github.com/tinkoff-ai/CORL.git && cd CORL
pip install -r requirements/requirements_dev.txt

# alternatively, you could use docker
docker build -t <image_name> .
docker run gpus=all -it --rm --name <container_name> <image_name>

Algorithms Implemented

Algorithm Variants Implemented Wandb Report
✅ Behavioral Cloning
(BC)
any_percent_bc.py Gym-MuJoCo, Maze2D
✅ Behavioral Cloning-10%
(BC-10%)
any_percent_bc.py Gym-MuJoCo, Maze2D
Conservative Q-Learning for Offline Reinforcement Learning
(CQL)
cql.py Gym-MuJoCo, Maze2D
Accelerating Online Reinforcement Learning with Offline Datasets
(AWAC)
awac.py Gym-MuJoCo, Maze2D
Offline Reinforcement Learning with Implicit Q-Learning
(IQL)
iql.py Gym-MuJoCo, Maze2D
A Minimalist Approach to Offline Reinforcement Learning
(TD3+BC)
td3_bc.py Gym-MuJoCo, Maze2D
Decision Transformer: Reinforcement Learning via Sequence Modeling
(DT)
dt.py Gym-MuJoCo, Maze2D
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble
(SAC-N)
sac_n.py Gym-MuJoCo, Maze2D
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble
(EDAC)
edac.py Gym-MuJoCo, Maze2D

D4RL Benchmarks

For learning curves and all the details, you can check the links above. Here, we report reproduced final and best scores. Note that thay differ by a big margin, and some papers may use different approaches not making it always explicit which one reporting methodology they chose.

Last Scores

Gym-MuJoCo

Task-Name BC BC-10% TD3 + BC CQL IQL AWAC SAC-N EDAC DT
halfcheetah-medium-v2 42.40±0.21 42.29±0.40 48.10±0.21 46.64±0.24 48.31±0.11 49.78±0.42 68.20±1.48 67.70±1.20 41.44±0.39
halfcheetah-medium-expert-v2 55.95±8.49 91.45±2.57 90.78±6.98 87.10±11.41 94.55±0.21 95.56±1.09 98.96±10.74 104.76±0.74 84.39±4.27
halfcheetah-medium-replay-v2 35.66±2.68 29.65±2.11 44.84±0.68 44.67±0.28 43.53±0.43 44.95±0.86 60.70±1.17 62.06±1.27 27.50±5.49
hopper-medium-v2 53.51±2.03 51.16±12.98 60.37±4.03 56.88±4.46 62.75±6.02 65.06±5.97 40.82±11.44 101.70±0.32 48.41±6.11
hopper-medium-expert-v2 52.30±4.63 105.17±7.12 101.17±10.48 86.95±17.45 106.24±6.09 105.38±7.31 101.31±13.43 105.19±11.64 83.20±26.68
hopper-medium-replay-v2 29.81±2.39 23.89±11.61 64.42±24.84 84.21±18.27 84.57±13.49 98.15±2.85 100.33±0.90 99.66±0.94 42.83±22.92
walker2d-medium-v2 63.23±18.76 58.56±4.14 82.71±5.51 80.58±3.80 84.03±5.42 69.39±31.97 87.47±0.76 93.36±1.60 69.15±6.76
walker2d-medium-expert-v2 98.96±18.45 108.45±0.30 110.03±0.41 110.23±0.48 111.68±0.56 111.65±1.74 114.93±0.48 114.75±0.86 92.64±3.35
walker2d-medium-replay-v2 21.80±11.72 41.99±17.77 85.62±4.63 82.16±2.32 82.55±8.00 80.43±3.95 78.99±0.58 87.10±3.21 16.93±19.57
locomotion average 50.40 61.40 76.45 75.49 79.80 80.04 83.52 92.92 56.28

Maze2d

Task-Name BC BC-10% TD3 + BC CQL IQL AWAC SAC-N EDAC DT
maze2d-umaze-v1 0.36±10.03 -2.98±6.68 29.41±14.22 -6.97±17.41 37.69±1.99 60.09±19.09 131.08±19.36 90.74±6.51 -14.55±0.15
maze2d-medium-v1 0.79±3.76 2.04±3.52 59.45±41.86 2.77±7.24 35.45±0.98 79.42±50.93 88.55±21.68 62.36±9.76 -0.38±7.26
maze2d-large-v1 2.26±5.07 3.14±4.77 97.10±29.34 1.29±7.11 49.64±22.02 217.44±4.93 205.13±1.33 108.17±25.02 -0.45±1.51
maze2d average 1.13 0.74 61.99 -0.97 40.92 118.98 141.59 87.09 -5.13

Antmaze

Task-Name BC BC-10% TD3 + BC CQL IQL AWAC SAC-N EDAC DT
antmaze-umaze-v0 51.50±8.81 0.00±0.00 93.25±1.50 67.00±6.24 74.50±11.03 63.50±9.33 TBD±TBD TBD±TBD 52.75±11.47
antmaze-medium-play-v0 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 68.00±12.77 0.00±0.00 TBD±TBD TBD±TBD 0.00±0.00
antmaze-large-play-v0 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 45.00±11.53 0.00±0.00 TBD±TBD TBD±TBD 0.00±0.00
antmaze average 17.17 0.00 31.08 22.33 62.50 21.17 TBD TBD 17.58

Best Scores

Gym-MuJoCo

Task-Name BC BC-10% TD3 + BC CQL IQL AWAC SAC-N EDAC DT
halfcheetah-medium-v2 43.60±0.16 43.74±0.18 48.93±0.13 47.26±0.23 48.77±0.06 50.79±0.19 72.21±0.35 69.72±1.06 42.63±0.09
halfcheetah-medium-expert-v2 79.69±3.58 93.98±0.18 96.59±1.01 95.82±0.31 95.83±0.38 96.85±0.32 111.73±0.55 110.62±1.20 87.34±0.65
halfcheetah-medium-replay-v2 40.52±0.22 41.45±0.10 45.84±0.30 45.97±0.32 45.06±0.16 46.56±0.27 67.29±0.39 66.55±1.21 32.20±2.50
hopper-medium-v2 69.04±3.35 66.91±2.30 70.44±1.37 69.09±0.85 80.74±1.27 99.25±0.87 101.79±0.23 103.26±0.16 61.95±4.63
hopper-medium-expert-v2 90.63±12.68 113.05±0.17 113.22±0.50 111.01±1.93 111.79±0.47 113.25±0.50 111.24±0.17 111.80±0.13 107.01±3.28
hopper-medium-replay-v2 68.88±11.93 53.82±8.10 98.12±1.34 102.10±0.41 102.33±0.44 101.68±0.38 103.83±0.61 103.28±0.57 59.65±13.50
walker2d-medium-v2 80.64±1.06 80.46±1.41 86.91±0.32 84.76±0.15 87.99±0.83 85.98±4.43 90.17±0.63 95.78±1.23 75.54±0.53
walker2d-medium-expert-v2 109.95±0.72 109.57±0.33 112.21±0.07 111.70±0.28 113.19±0.33 113.30±2.51 116.93±0.49 116.52±0.86 96.30±1.18
walker2d-medium-replay-v2 48.41±8.78 71.54±1.16 91.17±0.83 88.02±1.18 91.85±2.26 86.79±0.96 85.18±1.89 89.69±1.60 67.23±6.73
locomotion average 70.15 74.95 84.83 83.97 86.40 88.27 95.60 96.36 69.98

Maze2d

Task-Name BC BC-10% TD3 + BC CQL IQL AWAC SAC-N EDAC DT
maze2d-umaze-v1 16.09±1.00 16.85±0.60 99.33±18.66 18.82±0.63 44.04±3.02 137.96±12.50 151.28±8.14 144.30±5.60 -14.19±0.56
maze2d-medium-v1 19.16±1.44 24.81±4.09 150.93±4.50 17.96±5.24 92.25±40.74 152.11±23.00 90.04±20.74 150.82±2.76 45.13±6.25
maze2d-large-v1 20.75±7.69 35.66±6.40 197.64±6.07 12.27±5.34 138.70±44.70 227.79±1.99 207.10±1.46 179.90±2.41 3.94±2.24
maze2d average 18.67 25.77 149.30 16.35 91.66 172.62 149.47 158.34 11.63

Antmaze

Task-Name BC BC-10% TD3 + BC CQL IQL AWAC SAC-N EDAC DT
antmaze-umaze-v0 71.25±9.07 0.00±0.00 97.75±1.50 78.33±4.73 87.00±2.94 74.75±8.77 TBD±TBD TBD±TBD 65.50±8.96
antmaze-medium-play-v0 4.75±2.22 0.00±0.00 6.00±2.00 2.00±1.41 85.33±2.08 14.00±11.80 TBD±TBD TBD±TBD 1.00±2.00
antmaze-large-play-v0 0.75±0.50 0.00±0.00 0.33±0.58 0.00±0.00 56.00±4.00 0.00±0.00 TBD±TBD TBD±TBD 0.00±0.00
antmaze average 25.58 0.00 34.69 26.78 76.11 29.58 TBD TBD 22.17

Citing CORL

If you use CORL in your work, please use the following bibtex

@misc{corl2022,
  author={Tarasov, Denis and Nikulin, Alexander and Akimov, Dmitriy and Kurenkov, Vladislav and Sergey Kolesnikov},
  title={CORL: Research-oriented Deep Offline Reinforcement Learning Library},
  year={2022},
  url={https://github.com/tinkoff-ai/CORL},
}