/deep_control

Deep Reinforcement Learning for Continuous Control in PyTorch

Primary LanguagePython

Deep Control

Simple PyTorch implementations of Deep RL algorithms for continuous control research

This repository contains re-implementations of Deep RL algorithms for continuous action spaces. Some highlights:

  1. Code is readable, and written to be easy to modify for future research.
  2. Train and Test on different environments (for generalization research).
  3. Built-in Tensorboard logging, parameter saving.
  4. Support for offline (batch) RL.
  5. Quick setup for benchmarks like Gym Mujoco, Atari, Pybullet, and DeepMind Control Suite.
  6. Separate training and learning routines, which make it easy to mix and match techniques that improve the training process with techniques that improve the learning update.

What's included?

Deep Deterministic Policy Gradient (DDPG)

Paper: Continuous control with deep reinforcement learning, Lillicrap et al., 2015.

Description: a baseline model-free, offline, actor-critic method that forms the template for many of the other algorithms here.

Code: deep_control.ddpg (with extra comments for an intro to deep actor-critics) Examples: examples/basic_control/ddpg_gym.py

Twin Delayed DDPG (TD3)

Paper: Addressing Function Approximation Error in Actor-Critic Methods, Fujimoto et al., 2018.

Description: Builds off of DDPG and makes several changes to improve the critic's learning and performance (Clipped Double Q Learning, Target Smoothing, Actor Delay). Also includes the TD regularization term from "TD-Regularized Actor-Critic Methods."

Code: deep_control.td3 Examples: examples/basic_control/td3_gym.py

Other References: author's implementation

Soft Actor Critic (SAC)

Paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al., 2018.

Description: Samples actions from a stochastic actor rather than relying on added exploration noise during training. Uses a TD3-like double critic system. We do implement the learnable entropy coefficient approach described in the follow-up paper. This version also supports discrete action spaces and can avoid using target networks by applying the self-regularized crticic updates from GRAC (see below).

Code: deep_control.sac Examples: examples/dmc/sac.py, examples/sacd_demo.py

Other References: Yarats and Kostrikov's implementation, author's implementation.

Pixel SAC with Data Augmentation (SAC+AUG)

Paper: Measuring Visual Generalization in Continuous Control from Pixels

Description: This is a pixel-specific version of SAC with a few tricks/hyperparemter settings to improve performance. We include many different data augmentation techniques, including those used in RAD, DrQ and Network Randomization. The DrQ augmentation is turned on by default, and has a huge impact on performance.

Code: deep_control.sac_aug Examples: examples/dmcr/sac_aug.py

Other References: SAC+AE code, RAD Procgen code.

Self-Guided and Self-Regularized Actor-Critic (GRAC)

Paper: GRAC: Self-Regularized Actor-Critic, Shao et al., 2020.

Description: GRAC is a combination of a stochastic policy with TD3-like stability improvements and CEM-based action selection like you'd see in Qt-Opt or CAQL.

Code: deep_control.grac Examples: examples/dmc/grac.py

Other References: author's implementation

Randomized Ensemble Double Q-Learning (REDQ)

Paper: Randomized Ensemble Double Q-Learning: Learning Fast Without a Model

Description: Extends the double Q trick to random subsets of a larger critic ensemble. Reduced Q function bias allows for a much higher replay ratio. REDQ is sample efficient but slow (compared to other model-free methods). We implement the SAC version.

Code: deep_control.redq Examples: examples/dmc/redq.py

Distributional Correction (DisCor)

Paper: DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction, Kumar et al., 2020.

Description: Reduce the effect of inaccurate target values propagating through the Q-function by learning to estimate the target networks' inaccuracies and adjusting the TD error accordingly. Implemented on top of standard SAC.

Code: deep_control.discor Examples: examples/dmc/discor.py

Simple Unified Framework for Ensemble Learning (SUNRISE)

Paper: SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning, Lee et al., 2020.

Description: Extends SAC using an ensemble of actors and critics. Adds UCB-based exploration, ensembled inference, and a simpler weighted bellman backup. This version does not use the replay buffer masks from the original.

Code: deep_control.sunrise Examples: examples/dmc/sunrise.py

Stochastic Behavioral Cloning (SBC)

Description: A simple approach to offline RL that trains the actor network to emulate the action choices of the demonstration dataset. Uses the stochastic actor from SAC and some basic ensembling to make this a reasonable baseline.

Code: deep_control.sbc Examples: examples/d4rl/sbc.py

Advantage Weighted Actor Critic (AWAC) and Critic Regularized Regression (CRR)

Paper: Accelerating Online Reinforcement Learning with Offline Datasets, Nair et al., 2020. & Critic Regularized Regression, Wang et al., 2020.

Description: TD3 with a stochastic policy and a modified actor update that makes better use of offline experience before finetuning in the online environment. The current implementation is a mix between AWAC and CRR. We allow for online finetuning and use standard critic networks as in AWAC, but add the binary advantage function, and max/mean advantage estimates from CRR.

Code: deep_control.awac Examples: examples/d4rl/awac.py

Model Based Policy Optimization (MBPO)

Paper: When to Trust Your Model: Model-Based Policy Optimization, Janner et al., 2019.

Warning: in alpha

Description: Improves SAC's sample efficiency by training the policy on transitions generated by a learned world model.

Code: deep_control.mbpo

Other References: author's implementation.

Installation

git clone https://github.com/jakegrigsby/deep_control.git
cd deep_control
pip install -e .

Examples

see the examples folder for a look at how to train agents in environments like the DeepMind Control Suite and OpenAI Gym.

Roadmap

Things that will hopefully be included by the end of 2020:

  1. CAQL
  2. Quick setup support for Robosuite and CARLA.