GAC²EIR

Clean and Self Contained implementations of Reinforcement Learning Agents for solving GAC²E.

PPO (Probabilistic)
TD3 (Deterministic)
SAC (Probabilistic)

Usage

Cleanup:

$ make clean wipe kill

Will clean model directory, wipe run directory and kill all spectre instances.

Notes on Algorithms

My personal notes on implementing the algorithms.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization Algorithm

Keep track of small, fixed length batch of trajectories (s,a,r,d,v,l)
Multiple epochs for each batch
batch sized chunks of memories
Critic only criticises states
Actor outputs probabilities for taking an action (probabilistic)

Hyper Parameters

Memory Size
Batch size
Number of Epochs

Network Updates

Actor

Conservative Policy Iteration (CPI):

$$ L^{CPI} (\theta) = E_{t} ( \frac{\pi_{\theta} (a_{t} | s_{t})}{\pi_{\theta,old} (a_{t} | s_{t})} \cdot A_{t} ) = E_{t} (r_{t}(\theta) \cdot A_{t}) $$

Where

A: Advantage
E: Expectation
π Actor Network returning Probability of an action a for a given state s at a given time t
θ: Current network parameters

$$ L^{CLIP} = E_{t} ( min(r_{t}(\theta) \cdot A_{t}, clip(r_{t}(\theta), 1 - \epsilon, 1 + \epsilon) \cdot A_{t} ) ) $$

Where

ε ≈ 0.2

Pessimistic lower bound of loss

Advantage

Gives benefit of new state over previous state

$$ A_{t} = \delta_{t} + (\gamma \lambda) \cdot \delta_{t + 1} + ... + (\gamma \lambda)^{T - (t + 1)} \cdot \delta_{T - 1} $$ with

$$ \delta_{t} = r_{t} + \gamma \cdot V(s_{t + 1}) - V(s_{t}) $$

Where

V(s_t): Critic output, aka Estimated Value (stored in memory)
γ ≈ 0.95

Critic

return = advantage + value

Where value is critic output stored in memory

$$ L^{VF} = MSE(return - value) $$

Total Loss

$$ L^{CLIP + VF + S}{t} (\theta) = E{t} [ L^{CLIP}{t} (\theta) - c{1} \cdot L^{VF}{t} (\theta) + c{2} \cdot S\pi_{\theta} ] $$

Gradient Ascent, not Descent!

S: only used for shared AC Network
c1 = 0.5

Twin Delayed Double Dueling Policy Gradient (TD3)

Addressing Function Approximation Error in Actor-Critic Methods

Hyper Parameters

Update Intervall
Number of Epochs
Number of Samples

Loss

$$ \nabla_{\phi} J_(\phi) = \frac{1}{N} \sum \nabla_{a} Q_{\theta 1} (s,a) |{a = \pi{\phi} (s)} \cdot \nabla_{\phi} \pi_{\phi} (s) $$

Where

π: Policy Network with parameters φ
Gradient of first critic w.r.t. actions chosen by critic
Gradient of policy network w.r.t. it's own parameters

Chain rule applied to loss function

Network Updates

Initialize Target Networks with parameters from online networks.

$$ \theta \leftarrow \tau \cdot \theta_{i} + (1 - \tau) \cdot \theta_{i}' $$ $$ \phi \leftarrow \tau \cdot \phi{i} + (1 - \tau) \cdot \phi{i}' $$

Where

τ ≈ 0.005

Soft update with heavy weight on current target parameters vs. heavily discounted parameters of online network.

Not every step, only after actor update.

Actor

Randomly sample trajectories from replay buffer (s,a,r,s')
Use actor to determine actions for sampled states (don't use actions from memory)
Use sampled states and newly found actions to get values from critic
- Only the first critic, never the second!
Take gradient w.r.t. actor network parameters
Every nth step (hyper parameter of algorithm)

Critic

Randomly sample trajectories from replay buffer (s,a,r,s')
New states run Ï�'(s') where Ï�' is target actor
Add noise and clip

$$ a^{~} \leftarrow \pi_{\phi'} (s') + \epsilon $$

with

$$ \epsilon ~ clip(N(0, \sigma), -c, c) $$

Where

σ ≈ 0.2, noise standard deviation
c ≈ 0.5, noise clipping
γ ≈ 0.99, discount factor

$$ y \leftarrow r + \gamma \cdot min( Q'{\theta1}(s', a^{~}), Q'{\theta1}(s', a^{~})) $$

$$ \theta_{i} \leftarrow argmin_{\theta i} ( N^{-1} \cdot \sum ( y - Q_{\theta i} (s,a))^{2} ) $$

Soft Actor Critic (SAC)

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Note: Entropy in this case means something like Randomness of actions, and is modeled by reward scaling.

$$ log( \pi (a|s) ) = log (\mu (a|s)) - \sum^{D}{i=1} log ( 1 - tanh^{2} (a{i}) ) $$

Where

μ: Sample of a distribution (NOT MEAN)
π: Probability of selecting this particular action a given state s

Hyper Parameters

Target smoothing coefficient
target update interval
replay buffer size
gradient steps

Actor Update

$$ J = N^{-1} \cdot \sum log(\pi (a_{t} | s_{t})) - Q_{min}(s_{t}, a_{t}) $$

Where

s_t is sampled from replay buffer / memory
a_t is generated with actor network given sampled states
Qmin is minimum of 2 critics

Value Update

$$ J = N^{-1} \cdot \sum \frac{1}{2} \cdot ( V(s_{t}) - Qmin (s_{t}, a_{t}) - log(\pi (a_{t} | s_{t})) ) $$

Where

V(s_t): sampled values from memory
s_t: sampled states from memory
a_t: newly computed actions

Critic

$$ J_{1} = N^{-1} \sum \frac{1}{2} \cdot ( Q_{1}(s_{t}, a_{t}) - Q'{1}(s{t}, a_{t}))^{2} $$

$$ J_{2} = N^{-1} \sum \frac{1}{2} \cdot ( Q_{2}(s_{t}, a_{t}) - Q'{2}(s{t}, a_{t}))^{2} $$

$$ Q'= r_{scaled} + \gamma \cdot V'(s_{t + 1}) $$

Where

Both critics get updated
Both actions and states are sampled from memory

Network Updates

$$ \psi \leftarrow \tau \cdot \psi + (1 - \tau) \cdot \psi' $$

Where

τ ≈ 0.005

Things TODO

Implement replay buffer / memory as algebraic data type
Include step count in reward
Try Discrete action spaces
Normalize and/or reduce observation space
consider previous reward
return trained models instead of loss
handling of done for parallel envs
higher reward for finishsing earlier

AugustUnderground/gaceir

GAC²EIR

Usage

Notes on Algorithms

Proximal Policy Optimization (PPO)

Hyper Parameters

Network Updates

Actor

Advantage

Critic

Total Loss

Twin Delayed Double Dueling Policy Gradient (TD3)

Hyper Parameters

Loss

Network Updates

Actor

Critic

Soft Actor Critic (SAC)

Hyper Parameters

Actor Update

Value Update

Critic

Network Updates

Things TODO