/basic-rl-algorithms

:robot: Implementation and short explanation of basic RL algorithms, reproducing the simulations from Andrej Kaparthy's REINFORCEjs library.

Primary LanguagePython

🤖 Basic RL Algorithms

Writing code to obtain the results of the simulations from Kaparthy's .../reinforcejs/gridworld_dp.html.

📚 Table of Contents

📐 Description of the Mathematical Components

Component Description
$S$ State in the enviroment
$A$ Action of a policy
$Q(S,A)$ Estimated value of taking action $A$ in state $S$
$Q(S_{t+1},A_{t+1})$ Estimated value of taking the next action $A_{t+1}$ in the next state $S_{t+1}$ under the current policy $\pi$
$\pi$ Policy of the RL agent, which maps states $S$ to actions $A$
$\pi(A_{t+1} | S_{t+1})$ Probability of selecting action $A_{t+1}$ in state $S_{t+1}$ under the current policy $\pi$
$R_t$ Immediate reward received from taking action $A$ in state $S$
$R_{t+1}$ Received reward after taking action $A_{t+1}$ in state $S_{t+1}$
$\alpha$ Learning rate
$\gamma$ Discount factor
$\sum_a$ Sum over all possible actions $A$ in state $S$

🚜 Dynamic Programming

$$ V(S_t) \leftarrow \mathbb{E}{\pi} [ r{t+1} + \gamma V(S_t) ] $$

Implementing code for the simulations on GridWorld: DP

Policy Iteration

  • converges to $v_*(s)$

Steps: Iteratively

  1. Getting $v_i$ by doing Policy Evaluation of policy $\pi_{i-1}$ using $v_{i-1}$.

  2. Updating the policy to $\pi_i$ by doing Policy Improvement using $v_i$.

Value Iteration

  • converges to $v_*(s)$

Steps:

  1. Finding the optimal value function $v_*$ using only the action which leads to the successor state with maximal value.

  2. One update of the policy using the optimal found policy from 1.

Investigations: Doing iteratively first policy evaluation followed by policy improvement and then using that improvement policy in the next evaluation step leads to the same value function for each iteration as in value iteration. I.e. Directly updating the policy and then evaluation leads to the same as always only taking the action which leads to the successor state with maximal value.

🚙 Monte Carlo

$$ V(S_t) \leftarrow V(S_t) + \alpha [ R_t -V(S_t) ] $$

  • learn state-value function for a given policy
  • update only when end of episode is reached
  • converges to $v_{\pi}(s)$ as the number of visits to $s$ goes to infinity
  • only sample experience is available, no complete probability distributions of all possible transitions are required

Idea: Estimate the value of a state by experience by averaging the returns observed after visiting that state. As more returns are observed, the average should converge to the expected value.

  • visit to $s$ = each occurence of state $s$ in an episode.

  • First-visit MC method = estimates $v_{\pi}(s)$ as the average of the returns following first visits to $s$.

  • Every-visit MC method = estimates $v_{\pi}(s)$ as the average of the returns following all visits to $s$.

🚃 Temporal Difference (TD) Learning

  • update online, after each step

Implementing code for the simulations on GridWorld: TD

TD(0)

  • converges to $v_{\pi}(s)$

$$ V(S_t) \leftarrow V(S_t) + \alpha \left[ r_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right] $$

  • learn state-value function for a given policy

Investigations: TD(0) converges to the correct answer (estimated value function for a given policy) but not to the optimal policy!

SARSA

  • used to estimate optimal action-value function $Q_*(s, a)$
  • converges to $Q_*(s, a)$

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ] $$

Expected SARSA

  • used to estimate optimal action-value function $Q_*(s, a)$
  • converges to $Q_*(s, a)$

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi(A_{t+1} | S_{t+1}) Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ] $$

  • similar to SARSA
  • but takes expected value over all possible actions $a$ instead of using the actual next action to estimate the next state-action value

Q-Learning

  • converges to $Q_*(s, a)$

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \max_a Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ] $$


🎯 Summaries

Problems

Problem Goal Examples
Prediction evaluate a given policy
How much reward are we going to get for a given policy?
Iterative Policy Evaluation, TD(lambda)
First-Visit MC, Every-Visit MC
Control find the optimal policy
What is the most total reward we are getting out of our MDP?
Policy Iteration, Value Iteration,
SARSA, Q-Learning,
MC Exploring Starts, On-Policy first-visit MC control

Algorithms

Algorithm Update Equation Type Description
Iterative Policy Evaluation $V(s) \leftarrow \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [ r + \gamma V(s') ]$ Synchronous DP evaluate a given $\pi$
- there is an explicit policy
Policy Iteration 1. Policy Evaluation
$V(s) \leftarrow \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, \pi(s)) [ r + \gamma V(s') ]$
2. Policy Improvement
$\pi{s} \leftarrow \max_a \sum_{s', r} p(s', r | s, a)[ r + \gamma V(s') ]$
Synchronous DP evaluate a given $\pi$ via Bellmann Expectation Eq. + update $\pi$
- there is an explicit policy
Value Iteration $V(s) \leftarrow \max_a \sum_{s', r} p(s', r | s, a)[ r + \gamma V(s') ]$ Synchronous DP evaluate a given a $\pi$ via Bellmann Optimality Eq.
- there is no explicit policy
First-Visit-MC ... MC-Learning estimates $v(s)$ as the average of the returns following first-visits to $s$
Every-Visit-MC ... MC-Learning estimates $v(s)$ as the average of the returns following every-visits to $s$
TD(0) $V(S_t) \leftarrow V(S_t) + \alpha \left[ r_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$ TD-Learning
n-step TD ... TD-Learning
SARSA $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ]$ TD-Learning estimate $q_{\pi}$ following $\pi$ + update $\pi$
- performs on-policy updates
- randomly select $A_{t+1}$
Q-Learning $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \max_a Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ]$ TD-Learning estimate $q_{\pi}$ following optimal next state-actions
- performs off-policy updates (approx. $q^*$ ind. of policy)
- select $argmax_a Q(S_{t+1}, A_{t+1})$
Expected SARSA $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi(a | S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t) ]$ TD-Learning estimate $q_{\pi}$ using expected value of next state-actions
- performs off-policy updates
- randomly select $A_{t+1}$
- moves deterministically in same direction as SARSA moves in expectation

📆 ToDo

  • Policy Iteration
  • Value Iteration
  • First-Visit Monte Carlo
  • Every-Visit Monte Carlo
  • TD-0
  • TD(lambda)
  • SARSA
  • Q-Learning
  • Include eligibility traces
  • Update readme
    • include formulas and descriptions
  • Include picture of the grid world
  • Make separate main which runs the specific agent simulation
  • Investigate the slowliness of SARSA

Optional

  • Write unit-tests
    • Policy Iteration
    • Value Iteration
    • First- and Every-Visit Monte-Carlo
    • TD-0
    • SARSA
    • Q-Learning