PyTorch and Tensorflow 2.0 implementation of state-of-the-art model-free reinforcement learning algorithms on both Openai gym environments and a self-implemented Reacher environment.
Algorithms include Soft Actor-Critic (SAC), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt (including Cross-entropy (CE) Method), PointNet, Transporter, Recurrent Policy Gradient, Soft Decision Tree, etc.
Please note that this repo is more of a personal collection of algorithms I implemented and tested during my research and study period, rather than an official open-source library/package for usage. However, I think it could be helpful to share it with others and I'm expecting useful discussions on my implementations. But I didn't spend much time on cleaning or structuring the code. As you may notice that there may be several versions of implementation for each algorithm, I intentionally show all of them here for you to refer and compare. Also, this repo contains only PyTorch Implementation.
For official libraries of RL algorithms, I provided the following two with TensorFlow 2.0 + TensorLayer 2.0:
-
RL Tutorial (Status: Released) contains RL algorithms implementation as tutorials with simple structures.
-
RLzoo (Status: Released) is a baseline implementation with high-level API supporting a variety of popular environments, with more hierarchical structures for simple usage.
Since Tensorflow 2.0 has already incorporated the dynamic graph construction instead of the static one, it becomes a trivial work to transfer the RL code between TensorFlow and PyTorch.
-
Two versions of Soft Actor-Critic (SAC) are implemented.
SAC Version 1:
sac.py
: using state-value function.paper: https://arxiv.org/pdf/1801.01290.pdf
SAC Version 2:
sac_v2.py
: using target Q-value function instead of state-value function. -
Deep Deterministic Policy Gradient (DDPG):
ddpg.py
: implementation of DDPG. -
Twin Delayed DDPG (TD3):
td3.py
: implementation of TD3. -
Proximal Policy Optimization (PPO):
For continuous environments, two versions are implemented:
Version 1:
ppo_continuous.py
andppo_continuous_multiprocess.py
Version 2:
ppo_continuous2.py
andppo_continuous_multiprocess2.py
For discrete environment:
ppo_gae_discrete.py
: with Generalized Advantage Estimation (GAE) -
Actor-Critic (AC) / A2C:
ac.py
: extensible AC/A2C, easy to change to be DDPG, etc.A very extensible version of vanilla AC/A2C, supporting for all continuous/discrete deterministic/non-deterministic cases.
-
QT-Opt:
Two versions are implemented here.
-
PointNet for landmarks generation from images with unsupervised learning is implemented here. This method is also used for image-based reinforcement learning as a SOTA algorithm, called Transporter.
original paper: Unsupervised Learning of Object Landmarksthrough Conditional Image Generation
paper for RL: Unsupervised Learning of Object Keypointsfor Perception and Control
-
Recurrent Policy Gradient:
rdpg.py
: DDPG with LSTM policy.td3_lstm.py
: TD3 with LSTM policy.sac_v2_lstm.py
: SAC with LSTM policy.References:
Memory-based control with recurrent neural networks
Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
-
Soft Decision Tree as function approximator for PPO:
sdt_ppo_gae_discrete.py
: replace the network layers of policy in PPO to be a Soft Decision Tree, for achieving explainable RL. -
Maximum a Posteriori Policy Optimisation (MPO):
todo
-
Advantage-Weighted Regression (AWR):
todo
paper: Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
python ***.py --train
python ***.py --test
If you meet problem "Not imlplemented Error", it may be due to the wrong gym version. The newest gym==0.14 won't work. Install gym==0.7 or gym==0.10 with pip install -r requirements.txt
.
- SAC for gym Pendulum-v0:
SAC with automatically updating variable alpha for entropy:
SAC without automatically updating variable alpha for entropy:It shows that the automatic-entropy update helps the agent to learn faster.
- TD3 for gym Pendulum-v0:
TD3 with deterministic policy:
TD3 with non-deterministic/stochastic policy:It seems TD3 with deterministic policy works a little better, but basically similar.
- AC for gym CartPole-v0:
However, vanilla AC/A2C cannot handle the continuous case like gym Pendulum-v0 well.
To cite this repository:
@misc{rlalgorithms,
author = {Zihan Ding},
title = {SOTA-RL-Algorithms},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/quantumiracle/SOTA-RL-Algorithms}},
}