This repository provides the RL learning roadmap mentioned in the blog post How to Learn Reinforcement Learning: A Step-by-step Guide.
For complimentary MATLAB coding exercises with solutions, see RL Course MATLAB.
Highly recommend you work through the roadmap in order. After the first 4 chapters, you should have enough foundation to mix up the roadmap.
- Make sure you fully understand the required concepts through learning materials
- Implement the algorithm in your favorite framework. Learning happens when you implement and debug it yourself.
- Test it out with some RL problems. My favorites are cart-pole, inverted pendulum, walking robot, pong.
Chapter | Algorithm | Required Concepts | Learning Materials |
---|---|---|---|
1 | Dynamic Programming • Policy Evaluation • Policy Improvement • Value Iteration |
• Markov Decision Process • Expected return • Discount factor • State, Observation • Action • Reward • State value function V(s) • State-action value function Q(s,a) |
• MATLAB Tech Talk Part 1: What is RL? • MATLAB Tech Talk Part 2: Understanding the Environment and Rewards • RL Textbook - Chapter 3+4: Finite MDP + Dynamic Programming • WildML – Dynamic Programming exercises • David Silver’s Lecture 1+2 |
2 | Temporal-Difference (TD) Learning • Q-Learning • SARSA |
• TD Error • On-policy vs off-policy • Epsilon greedy |
• RL Textbook - Chapter 6: Temporal Difference Learning • WildML – SARSA, Q-Learning exercises |
3 | Function Approximation (replace table with neural network) • Deep Q-Learning |
RL • Why tables cannot scale • Relationship with supervised learning • Replay memory • Target network • Partially observable environment • Frame stacking for ATARI game environment • Typical DQN network • Double Q Learning Deep Learning • Supervised Learning • Feedforward network • Convolution neural network |
RL • David Silver’s Lecture 6: Value function approximation • WildML – Q-Learning with Linear Function Approximation • DeepMind DQN paper • WildML – Deep Q-Learning for Atari Games • Arthur Juliani’s series Part 4 – Deep Q-Networks • Pytorch DQN Tutorial Deep Learning • Deep Learning Specialization Course 1+2 |
4 | Policy gradient • REINFORCE (vanilla policy gradient) • Actor Critic |
• Actor • Critic • Stochastic policy • Statistics: distribution (focus on normal/Gaussian distribution), sample from a distribution, entropy, probability density function • How to model discrete stochastic policy vs continuous stochastic policy • Importance sampling • KL divergence |
• RL Textbook – Chapter 13: Policy Gradient Methods • WildML – Policy Gradient exercises • OpenAI Spinning Up – Vanilla Policy Gradient • Deep RL Berkeley – Policy Gradients + Actor-Critic Algorithms |
5 | Advanced Policy Gradient • Deep Deterministic Policy Gradient (DDPG) • Twin Delayed DDPG (TD3) • Proximal Policy Optimization (PPO) • Trust Region Policy Optimization (TRPO) |
• Continuous action space • Deterministic policy • Deterministic policy gradient |
• Deep RL Berkeley – Advanced Policy Gradients • Original papers • OpenAI Spinning Up – PPO, TRPO, DDPG and TD3 |
6 | Partially Observable Environment • Modify existing algorithms to work with recurrent neural network (RNN) |
• Recurrent neural network (RNN) • Backpropagation through time • Observation stacking • How to sample data out of replay memory for RNN update |
• Arthur Juliani’s series Part 6 – Partial Observability and DRQN • Deep Recurrent Q-Learning for Partially Observable MDPs • Memory-based control with recurrent neural networks |
7 | Model-based • Modify existing algorithms to utilize a model of the environment to simulate and plan |
• Motivation: environment can be on actual hardware (high cost) • Model: an approximation of the environment • Environment step vs model step • Model-based planning • Model-based learning • Parallelization for on-policy vs off-policy algorithms • Gradient parallelization • Experience parallelization |
• RL Textbook – Chapter 8: Planning and Learning with Tabular Methods (8.1-8.4) • Deep RL Berkeley – Model-based Planning • Deep RL Berkeley – Model-based Reinforcement Learning |
8 | Parallelization • A2C • A3C • IMPALA |
• Parallelization for on-policy vs off-policy algorithms • Gradient parallelization • Experience parallelization |
• Deep RL Berkeley – Distributed RL |
9 | Exploration | • Explore through sampling • Intrinsic motivation • Imitation learning |
• Deep RL Berkeley – Exploration |
• Reinforcement Learning Toolbox, The MathWorks
• Reinforcement Learning: An Introduction (textbook), Sutton and Barto
• Deep Reinforcement Learning (course), UC Berkeley
• OpenAI Spinning Up(textbook/blog)
• WildML Learning Reinforcement Learning (python course with exercises/solutions), Denny Britz
• MATLAB RL Tech Talks (videos), The MathWorks
• David Silver’s RL course
• Simple Reinforcement Learning (blog), Arthur Juliani
• Deep Learning Specialization Coursera (course), Andrew Ng (you can audit for free, highly recommend course 1 + 2 to get Deep Learning foundations)