javiabellan/reinforcement-learning

🏆 My RL notebooks

Jupyter Notebook

Deep Reinforcement learning

	Q-Value Methods	Policy Methods
Network Input	State	State
Network Output	Predict action reward (all posibilites)	Predict action values (probabilities & continuous)
Large action space	❌	✔️
Continuous action space	❌	✔️
Stochastic policies	❌	✔️
Training loss function	Temporal Difference Loss	?
Training speed	TD is faster 🙂	Slower 🙁

Output neurons for Atari

Q-Value Methods (18)	Policy Methods (3)
Do nothing ⬆️ ↗️ ➡️ ↘️ ⬇️ ↙️ ⬅️ ↖️ 🔴 ⬆️+🔴 ↗️+🔴 ➡️+🔴 ↘️+🔴 ⬇️+🔴 ↙️+🔴 ⬅️+🔴 ↖️+🔴	🔴 Probability of pressing button (between 0 and 1) ↔️ Action space on the x axis (between -1 and 1) ↕️ Action space on the y axis (between -1 and 1)

Source: What is the relation between Q-learning and policy gradients methods?

Part 1: Q-Value Methods

	Name	Paper
Baseline	DQN: Deep Q Learning	2013
Improv. 1	Double DQN (DDQN)	2015
Improv. 2	Prioritized DQN	2015
Improv. 3	Dueling DQN	2015
Improv. 4	A3C	2016
Improv. 5	Noisy DQN	2017
Improv. 6	Distributional DQN (C51)	2017
Combine 6	Rainbow	2017

Part 2: Policy Methods

Name	Paper
VPG: Vanilla Policy Gradient (aka REINFORCE)	1992
TRPO: Trust Region Policy Optimization	2015
DDPG: Deep Deterministic Policy Gradients	2015
A2C: Advantage Actor Critic ⭐
A3C: Asynchronous Advantage Actor Critic ⭐	2016
PPO: Proximal Policy Optimization	2017
TD3: Twin Delayed Deep Deterministic Policy Gradients	2018
SAC: Soft Actor-Critic	2018
SAC-Discrete: Soft Actor-Critic for Discrete Actions	2019

What are Policy Gradient Methods?

Policy methods search directly for the optimal policy, without simultaneously maintaining a value function.
Policy gradient methods are a subtype of policy methods that estimate the optimal policy through gradient ascent.

Problem: Maximize the expected return `U(θ) = ∑ P(τ,θ) R(τ)`

τ: Is the Trajectory, a state-action sequence.
R(τ): Is the Reward at each time step. (How good was my action)
P(τ,θ): Is the Probability of picking that action at that time step. (How confident i was)

Is like the loss of deep learning, but insted of minimizing it, you have to maximize it with gradient ascent.

VPG: Vanilla Policy Gradient (aka REINFORCE)

Use the policy π (network) to collect N trajectories τ (episodes)
Use the trajectories to estimate the gradient of the expected return U(θ)
Update the weights of the network (gradient ascent: θ = θ+α∇U(θ))
Loop over steps 1-3.

PPO: Proximal Policy Optimization

Part 3: Multi agent RL

Extra: AlphaGo → AlphaGo Zero → AlphaZero → MuZero

The Evolution of AlphaGo to MuZero

Extra 2: Dopamine

Afruitful relationship between neuroscience and AI

Actions

Discrete: (action probabilities)
- Only one: Sofmax
- Multiple: Sigmoid
- Action picking:
  - Deterministic: The most probable always.
  - Stochastic: Random according probabilities.
Continuous: (action values)
- [0,1]: Sigmoid (ej: acelerador)
- [-1,1]: Tanh (ej: volante)
- [0, inf]: ReLU
- [-inf, inf]: Nothing

References