Deep Reinforcement learning
|
Q-Value Methods |
Policy Methods |
Network Input |
State |
State |
Network Output |
Predict action reward (all posibilites) |
Predict action values (probabilities & continuous) |
Large action space |
❌ |
✔️ |
Continuous action space |
❌ |
✔️ |
Stochastic policies |
❌ |
✔️ |
Training loss function |
Temporal Difference Loss |
? |
Training speed |
TD is faster 🙂 |
Slower 🙁 |
Q-Value Methods (18) |
Policy Methods (3) |
- Do nothing
- ⬆️
- ↗️
- ➡️
- ↘️
- ⬇️
- ↙️
- ⬅️
- ↖️
- 🔴
- ⬆️+🔴
- ↗️+🔴
- ➡️+🔴
- ↘️+🔴
- ⬇️+🔴
- ↙️+🔴
- ⬅️+🔴
- ↖️+🔴
|
- 🔴 Probability of pressing button (between 0 and 1)
- ↔️ Action space on the x axis (between -1 and 1)
- ↕️ Action space on the y axis (between -1 and 1)
|
Source: What is the relation between Q-learning and policy gradients methods?
|
Name |
Paper |
![](https://github.com/img/rainbow.png) |
Baseline | DQN: Deep Q Learning | 2013 |
Improv. 1 | Double DQN (DDQN) | 2015 |
Improv. 2 | Prioritized DQN | 2015 |
Improv. 3 | Dueling DQN | 2015 |
Improv. 4 | A3C | 2016 |
Improv. 5 | Noisy DQN | 2017 |
Improv. 6 | Distributional DQN (C51) | 2017 |
Combine 6 | Rainbow | 2017 |
Name |
Paper |
VPG: Vanilla Policy Gradient (aka REINFORCE) |
1992 |
TRPO: Trust Region Policy Optimization |
2015 |
DDPG: Deep Deterministic Policy Gradients |
2015 |
A2C: Advantage Actor Critic ⭐ |
|
A3C: Asynchronous Advantage Actor Critic ⭐ |
2016 |
PPO: Proximal Policy Optimization |
2017 |
TD3: Twin Delayed Deep Deterministic Policy Gradients |
2018 |
SAC: Soft Actor-Critic |
2018 |
SAC-Discrete: Soft Actor-Critic for Discrete Actions |
2019 |
What are Policy Gradient Methods?
- Policy methods search directly for the optimal policy, without simultaneously maintaining a value function.
- Policy gradient methods are a subtype of policy methods that estimate the optimal policy through gradient ascent.
Problem: Maximize the expected return U(θ) = ∑ P(τ,θ) R(τ)
τ
: Is the Trajectory, a state-action sequence.
R(τ)
: Is the Reward at each time step. (How good was my action)
P(τ,θ)
: Is the Probability of picking that action at that time step. (How confident i was)
Is like the loss of deep learning, but insted of minimizing it, you have to maximize it with gradient ascent.
VPG: Vanilla Policy Gradient (aka REINFORCE)
- Use the policy π (network) to collect N trajectories τ (episodes)
- Use the trajectories to estimate the gradient of the expected return U(θ)
- Update the weights of the network (gradient ascent: θ = θ+α∇U(θ))
- Loop over steps 1-3.
PPO: Proximal Policy Optimization
Extra: AlphaGo → AlphaGo Zero → AlphaZero → MuZero
Afruitful relationship between neuroscience and AI
- Discrete: (action probabilities)
- Only one: Sofmax
- Multiple: Sigmoid
- Action picking:
- Deterministic: The most probable always.
- Stochastic: Random according probabilities.
- Continuous: (action values)
[0,1]
: Sigmoid (ej: acelerador)
[-1,1]
: Tanh (ej: volante)
[0, inf]
: ReLU
[-inf, inf]
: Nothing