/reinforcement-learning

🏆 My RL notebooks

Primary LanguageJupyter Notebook

Deep Reinforcement learning

Q-Value Methods Policy Methods
Network Input State State
Network Output Predict action reward (all posibilites) Predict action values (probabilities & continuous)
Large action space ✔️
Continuous action space ✔️
Stochastic policies ✔️
Training loss function Temporal Difference Loss ?
Training speed TD is faster 🙂 Slower 🙁

Output neurons for Atari

Q-Value Methods (18) Policy Methods (3)
  1. Do nothing
  2. ⬆️
  3. ↗️
  4. ➡️
  5. ↘️
  6. ⬇️
  7. ↙️
  8. ⬅️
  9. ↖️
  10. 🔴
  11. ⬆️+🔴
  12. ↗️+🔴
  13. ➡️+🔴
  14. ↘️+🔴
  15. ⬇️+🔴
  16. ↙️+🔴
  17. ⬅️+🔴
  18. ↖️+🔴
  1. 🔴 Probability of pressing button (between 0 and 1)
  2. ↔️ Action space on the x axis (between -1 and 1)
  3. ↕️ Action space on the y axis (between -1 and 1)

Source: What is the relation between Q-learning and policy gradients methods?

Part 1: Q-Value Methods

Name Paper
Baseline DQN: Deep Q Learning 2013
Improv. 1 Double DQN (DDQN) 2015
Improv. 2 Prioritized DQN 2015
Improv. 3 Dueling DQN 2015
Improv. 4 A3C 2016
Improv. 5 Noisy DQN 2017
Improv. 6Distributional DQN (C51)2017
Combine 6 Rainbow 2017

Part 2: Policy Methods

Name Paper
VPG: Vanilla Policy Gradient (aka REINFORCE) 1992
TRPO: Trust Region Policy Optimization 2015
DDPG: Deep Deterministic Policy Gradients 2015
A2C: Advantage Actor Critic ⭐
A3C: Asynchronous Advantage Actor Critic ⭐ 2016
PPO: Proximal Policy Optimization 2017
TD3: Twin Delayed Deep Deterministic Policy Gradients 2018
SAC: Soft Actor-Critic 2018
SAC-Discrete: Soft Actor-Critic for Discrete Actions 2019

What are Policy Gradient Methods?

  • Policy methods search directly for the optimal policy, without simultaneously maintaining a value function.
  • Policy gradient methods are a subtype of policy methods that estimate the optimal policy through gradient ascent.

Problem: Maximize the expected return U(θ) = ∑ P(τ,θ) R(τ)

  • τ: Is the Trajectory, a state-action sequence.
  • R(τ): Is the Reward at each time step. (How good was my action)
  • P(τ,θ): Is the Probability of picking that action at that time step. (How confident i was)

Is like the loss of deep learning, but insted of minimizing it, you have to maximize it with gradient ascent.

VPG: Vanilla Policy Gradient (aka REINFORCE)

  1. Use the policy π (network) to collect N trajectories τ (episodes)
  2. Use the trajectories to estimate the gradient of the expected return U(θ)
  3. Update the weights of the network (gradient ascent: θ = θ+α∇U(θ))
  4. Loop over steps 1-3.

PPO: Proximal Policy Optimization

Part 3: Multi agent RL

Extra: AlphaGo → AlphaGo Zero → AlphaZero → MuZero

Extra 2: Dopamine

Afruitful relationship between neuroscience and AI

Actions

  • Discrete: (action probabilities)
    • Only one: Sofmax
    • Multiple: Sigmoid
    • Action picking:
      • Deterministic: The most probable always.
      • Stochastic: Random according probabilities.
  • Continuous: (action values)
    • [0,1]: Sigmoid (ej: acelerador)
    • [-1,1]: Tanh (ej: volante)
    • [0, inf]: ReLU
    • [-inf, inf]: Nothing

References