🤖 Basic RL Algorithms

Writing code to obtain the results of the simulations from Kaparthy's .../reinforcejs/gridworld_dp.html.

📚 Table of Contents

🤖 Basic RL Algorithms

📐 Description of the Mathematical Components

Component	Description
$S$	State in the enviroment
$A$	Action of a policy
$Q(S,A)$	Estimated value of taking action $A$ in state $S$
$Q(S_{t+1},A_{t+1})$	Estimated value of taking the next action $A_{t+1}$ in the next state $S_{t+1}$ under the current policy $\pi$
$\pi$	Policy of the RL agent, which maps states $S$ to actions $A$
$\pi(A_{t+1} \| S_{t+1})$	Probability of selecting action $A_{t+1}$ in state $S_{t+1}$ under the current policy $\pi$
$R_t$	Immediate reward received from taking action $A$ in state $S$
$R_{t+1}$	Received reward after taking action $A_{t+1}$ in state $S_{t+1}$
$\alpha$	Learning rate
$\gamma$	Discount factor
$\sum_a$	Sum over all possible actions $A$ in state $S$

🚜 Dynamic Programming

$$ V(S_t) \leftarrow \mathbb{E}{\pi} [ r{t+1} + \gamma V(S_t) ] $$

Implementing code for the simulations on GridWorld: DP

Policy Iteration

converges to $v_*(s)$

Steps: Iteratively

Getting $v_i$ by doing Policy Evaluation of policy $\pi_{i-1}$ using $v_{i-1}$.
Updating the policy to $\pi_i$ by doing Policy Improvement using $v_i$.

Value Iteration

converges to $v_*(s)$

Steps:

Finding the optimal value function $v_*$ using only the action which leads to the successor state with maximal value.
One update of the policy using the optimal found policy from 1.

Investigations: Doing iteratively first policy evaluation followed by policy improvement and then using that improvement policy in the next evaluation step leads to the same value function for each iteration as in value iteration. I.e. Directly updating the policy and then evaluation leads to the same as always only taking the action which leads to the successor state with maximal value.

🚙 Monte Carlo

$$ V(S_t) \leftarrow V(S_t) + \alpha [ R_t -V(S_t) ] $$

learn state-value function for a given policy
update only when end of episode is reached
converges to $v_{\pi}(s)$ as the number of visits to $s$ goes to infinity
only sample experience is available, no complete probability distributions of all possible transitions are required

Idea: Estimate the value of a state by experience by averaging the returns observed after visiting that state. As more returns are observed, the average should converge to the expected value.

visit to $s$ = each occurence of state $s$ in an episode.
First-visit MC method = estimates $v_{\pi}(s)$ as the average of the returns following first visits to $s$.
Every-visit MC method = estimates $v_{\pi}(s)$ as the average of the returns following all visits to $s$.

🚃 Temporal Difference (TD) Learning

update online, after each step

Implementing code for the simulations on GridWorld: TD

TD(0)

converges to $v_{\pi}(s)$

$$ V(S_t) \leftarrow V(S_t) + \alpha \left[ r_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right] $$

learn state-value function for a given policy

Investigations: TD(0) converges to the correct answer (estimated value function for a given policy) but not to the optimal policy!

SARSA

used to estimate optimal action-value function $Q_*(s, a)$
converges to $Q_*(s, a)$

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ] $$

Expected SARSA

used to estimate optimal action-value function $Q_*(s, a)$
converges to $Q_*(s, a)$

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi(A_{t+1} | S_{t+1}) Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ] $$

similar to SARSA
but takes expected value over all possible actions $a$ instead of using the actual next action to estimate the next state-action value

Q-Learning

converges to $Q_*(s, a)$

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \max_a Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ] $$

🎯 Summaries

Problems

Problem	Goal	Examples
Prediction	evaluate a given policy How much reward are we going to get for a given policy?	Iterative Policy Evaluation, TD(lambda) First-Visit MC, Every-Visit MC
Control	find the optimal policy What is the most total reward we are getting out of our MDP?	Policy Iteration, Value Iteration, SARSA, Q-Learning, MC Exploring Starts, On-Policy first-visit MC control

Algorithms

Algorithm	Update Equation	Type	Description
Iterative Policy Evaluation	$V(s) \leftarrow \sum_a \pi(a \| s) \sum_{s', r} p(s', r \| s, a) [ r + \gamma V(s') ]$	Synchronous DP	evaluate a given $\pi$ - there is an explicit policy
Policy Iteration	1. Policy Evaluation $V(s) \leftarrow \sum_a \pi(a \| s) \sum_{s', r} p(s', r \| s, \pi(s)) [ r + \gamma V(s') ]$ 2. Policy Improvement $\pi{s} \leftarrow \max_a \sum_{s', r} p(s', r \| s, a)[ r + \gamma V(s') ]$	Synchronous DP	evaluate a given $\pi$ via Bellmann Expectation Eq. + update $\pi$ - there is an explicit policy
Value Iteration	$V(s) \leftarrow \max_a \sum_{s', r} p(s', r \| s, a)[ r + \gamma V(s') ]$	Synchronous DP	evaluate a given a $\pi$ via Bellmann Optimality Eq. - there is no explicit policy
First-Visit-MC	...	MC-Learning	estimates $v(s)$ as the average of the returns following first-visits to $s$
Every-Visit-MC	...	MC-Learning	estimates $v(s)$ as the average of the returns following every-visits to $s$
TD(0)	$V(S_t) \leftarrow V(S_t) + \alpha \left[ r_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$	TD-Learning
n-step TD	...	TD-Learning
SARSA	$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ]$	TD-Learning	estimate $q_{\pi}$ following $\pi$ + update $\pi$ - performs on-policy updates - randomly select $A_{t+1}$
Q-Learning	$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \max_a Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ]$	TD-Learning	estimate $q_{\pi}$ following optimal next state-actions - performs off-policy updates (approx. $q^*$ ind. of policy) - select $argmax_a Q(S_{t+1}, A_{t+1})$
Expected SARSA	$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi(a \| S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t) ]$	TD-Learning	estimate $q_{\pi}$ using expected value of next state-actions - performs off-policy updates - randomly select $A_{t+1}$ - moves deterministically in same direction as SARSA moves in expectation

📆 ToDo

Optional

PeeteKeesel/basic-rl-algorithms