key_papers_in_deep_rl

What follows is a list of papers in deep RL that are worth reading. This is far from comprehensive, but should provide a useful starting point for someone looking to do research in the field.

content from:

https://spinningup.openai.com/en/latest/spinningup/keypapers.html

1.Model-Free RL

a. Deep Q-Learning

1. Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013. Algorithm: DQN. (origin address)
2. Deep Recurrent Q-Learning for Partially Observable MDPs, Hausknecht and Stone, 2015. Algorithm: Deep Recurrent Q-Learning. (origin address)
3. Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, 2015. Algorithm: Dueling DQN. (origin address)
4. Deep Reinforcement Learning with Double Q-learning, Hasselt et al 2015. Algorithm: Double DQN. (origin address)
5. Prioritized Experience Replay, Schaul et al, 2015. Algorithm: Prioritized Experience Replay (PER). (origin address)
6. Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel et al, 2017. Algorithm: Rainbow DQN. (origin address)

b. Policy Gradients

c. Deterministic Policy Gradients

d.Distributional RL

e. Policy Gradients with Action-Dependent Baselines

f. Path-Consistency Learning

g. Other Directions for Combining Policy-Learning and Q-learning

h. Evolutionary Algorithms

31. Evolution Strategies as a Scalable Alternative to Reinforcement Learning, Salimans et al, 2017. Algorithm: ES. (origin address)

2. Exploration

a. Intrinsic Motivation

b. Unsupervised RL

3. Transfer and Multitask RL

4. Hierarchy

5. Memory

6. Model-Based RL

a. Model is Learned

b. Model is Given

[66] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Silver et al, 2017. Algorithm: AlphaZero.
[67] Thinking Fast and Slow with Deep Learning and Tree Search, Anthony et al, 2017. Algorithm: ExIt.

7. Meta-RL

[68] RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning, Duan et al, 2016. Algorithm: RL^2.
[69] Learning to Reinforcement Learn, Wang et al, 2016.
[70] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Finn et al, 2017. Algorithm: MAML.
[71] A Simple Neural Attentive Meta-Learner, Mishra et al, 2018. Algorithm: SNAIL.

8. Scaling RL

[72] Accelerated Methods for Deep Reinforcement Learning, Stooke and Abbeel, 2018. Contribution: Systematic analysis of parallelization in deep RL across algorithms.
[73] IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures, Espeholt et al, 2018. Algorithm: IMPALA.
[74] Distributed Prioritized Experience Replay, Horgan et al, 2018. Algorithm: Ape-X.
[75] Recurrent Experience Replay in Distributed Reinforcement Learning, Anonymous, 2018. Algorithm: R2D2.
[76] RLlib: Abstractions for Distributed Reinforcement Learning, Liang et al, 2017. Contribution: A scalable library of RL algorithm implementations. Documentation link.

9. RL in the Real World

[77] Benchmarking Reinforcement Learning Algorithms on Real-World Robots, Mahmood et al, 2018.
[78] Learning Dexterous In-Hand Manipulation, OpenAI, 2018.
[79] QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation, Kalashnikov et al, 2018. Algorithm: QT-Opt.
[80] Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform, Gauci et al, 2018.

10. Safety

[81] Concrete Problems in AI Safety, Amodei et al, 2016. Contribution: establishes a taxonomy of safety problems, serving as an important jumping-off point for future research. We need to solve these!
[82] Deep Reinforcement Learning From Human Preferences, Christiano et al, 2017. Algorithm: LFP.
[83] Constrained Policy Optimization, Achiam et al, 2017. Algorithm: CPO.
[84] Safe Exploration in Continuous Action Spaces, Dalal et al, 2018. Algorithm: DDPG+Safety Layer.
[85] Trial without Error: Towards Safe Reinforcement Learning via Human Intervention, Saunders et al, 2017. Algorithm: HIRL.
[86] Leave No Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning, Eysenbach et al, 2017. Algorithm: Leave No Trace.

11. Imitation Learning and Inverse Reinforcement Learning

[87] Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy, Ziebart 2010. Contributions: Crisp formulation of maximum entropy IRL.
[88] Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization, Finn et al, 2016. Algorithm: GCL.
[89] Generative Adversarial Imitation Learning, Ho and Ermon, 2016. Algorithm: GAIL.
[90] DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills, Peng et al, 2018. Algorithm: DeepMimic.
[91] Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow, Peng et al, 2018. Algorithm: VAIL.
[92] One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL, Le Paine et al, 2018. Algorithm: MetaMimic.

12. Reproducibility, Analysis, and Critique

[93] Benchmarking Deep Reinforcement Learning for Continuous Control, Duan et al, 2016. Contribution: rllab.
[94] Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control, Islam et al, 2017.
[95] Deep Reinforcement Learning that Matters, Henderson et al, 2017.
[96] Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods, Henderson et al, 2018.
[97] Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?, Ilyas et al, 2018.
[98] Simple Random Search Provides a Competitive Approach to Reinforcement Learning, Mania et al, 2018.
[99] Benchmarking Model-Based Reinforcement Learning, Wang et al, 2019.

13. Bonus:Classic Papers in RL Theory or Review

[100] Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al, 2000. Contributions: Established policy gradient theorem and showed convergence of policy gradient algorithm for arbitrary policy classes.
[101] An Analysis of Temporal-Difference Learning with Function Approximation, Tsitsiklis and Van Roy, 1997. Contributions: Variety of convergence results and counter-examples for value-learning methods in RL.
[102] Reinforcement Learning of Motor Skills with Policy Gradients, Peters and Schaal, 2008. Contributions: Thorough review of policy gradient methods at the time, many of which are still serviceable descriptions of deep RL methods.
[103] Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford, 2002. Contributions: Early roots for monotonic improvement theory, later leading to theoretical justification for TRPO and other algorithms.
[104] A Natural Policy Gradient, Kakade, 2002. Contributions: Brought natural gradients into RL, later leading to TRPO, ACKTR, and several other methods in deep RL.
[105] Algorithms for Reinforcement Learning, Szepesvari, 2009. Contributions: Unbeatable reference on RL before deep RL, containing foundations and theoretical background.

darr/key_papers_in_deep_rl