Writing code to obtain the results of the simulations from Kaparthy's .../reinforcejs/gridworld_dp.html.
Component | Description |
---|---|
State in the enviroment | |
Action of a policy | |
Estimated value of taking action |
|
Estimated value of taking the next action |
|
Policy of the RL agent, which maps states |
|
Probability of selecting action |
|
Immediate reward received from taking action |
|
Received reward after taking action |
|
Learning rate | |
Discount factor | |
Sum over all possible actions |
$$ V(S_t) \leftarrow \mathbb{E}{\pi} [ r{t+1} + \gamma V(S_t) ] $$
Implementing code for the simulations on GridWorld: DP
- converges to
$v_*(s)$
Steps: Iteratively
-
Getting
$v_i$ by doing Policy Evaluation of policy$\pi_{i-1}$ using$v_{i-1}$ . -
Updating the policy to
$\pi_i$ by doing Policy Improvement using$v_i$ .
- converges to
$v_*(s)$
Steps:
-
Finding the optimal value function
$v_*$ using only the action which leads to the successor state with maximal value. -
One update of the policy using the optimal found policy from 1.
Investigations: Doing iteratively first policy evaluation followed by policy improvement and then using that improvement policy in the next evaluation step leads to the same value function for each iteration as in value iteration. I.e. Directly updating the policy and then evaluation leads to the same as always only taking the action which leads to the successor state with maximal value.
- learn state-value function for a given policy
- update only when end of episode is reached
- converges to
$v_{\pi}(s)$ as the number of visits to$s$ goes to infinity - only sample experience is available, no complete probability distributions of all possible transitions are required
Idea: Estimate the value of a state by experience by averaging the returns observed after visiting that state. As more returns are observed, the average should converge to the expected value.
-
visit to
$s$ = each occurence of state$s$ in an episode. -
First-visit MC method = estimates
$v_{\pi}(s)$ as the average of the returns following first visits to$s$ . -
Every-visit MC method = estimates
$v_{\pi}(s)$ as the average of the returns following all visits to$s$ .
- update online, after each step
Implementing code for the simulations on GridWorld: TD
- converges to
$v_{\pi}(s)$
- learn state-value function for a given policy
Investigations: TD(0) converges to the correct answer (estimated value function for a given policy) but not to the optimal policy!
- used to estimate optimal action-value function
$Q_*(s, a)$ - converges to
$Q_*(s, a)$
- used to estimate optimal action-value function
$Q_*(s, a)$ - converges to
$Q_*(s, a)$
- similar to SARSA
- but takes expected value over all possible actions
$a$ instead of using the actual next action to estimate the next state-action value
- converges to
$Q_*(s, a)$
Problem | Goal | Examples |
---|---|---|
Prediction | evaluate a given policy How much reward are we going to get for a given policy? |
Iterative Policy Evaluation, TD(lambda) First-Visit MC, Every-Visit MC |
Control | find the optimal policy What is the most total reward we are getting out of our MDP? |
Policy Iteration, Value Iteration, SARSA, Q-Learning, MC Exploring Starts, On-Policy first-visit MC control |
Algorithm | Update Equation | Type | Description |
---|---|---|---|
Iterative Policy Evaluation | Synchronous DP | evaluate a given - there is an explicit policy |
|
Policy Iteration | 1. Policy Evaluation 2. Policy Improvement |
Synchronous DP | evaluate a given - there is an explicit policy |
Value Iteration | Synchronous DP | evaluate a given a - there is no explicit policy |
|
First-Visit-MC | ... | MC-Learning | estimates |
Every-Visit-MC | ... | MC-Learning | estimates |
TD(0) | TD-Learning | ||
n-step TD | ... | TD-Learning | |
SARSA | TD-Learning | estimate - performs on-policy updates - randomly select |
|
Q-Learning | TD-Learning | estimate - performs off-policy updates (approx. - select |
|
Expected SARSA | TD-Learning | estimate - performs off-policy updates - randomly select - moves deterministically in same direction as SARSA moves in expectation |
- Policy Iteration
- Value Iteration
- First-Visit Monte Carlo
- Every-Visit Monte Carlo
- TD-0
- TD(lambda)
- SARSA
- Q-Learning
- Include eligibility traces
- Update readme
- include formulas and descriptions
- Include picture of the grid world
- Make separate main which runs the specific agent simulation
- Investigate the slowliness of SARSA
Optional
- Write unit-tests
- Policy Iteration
- Value Iteration
- First- and Every-Visit Monte-Carlo
- TD-0
- SARSA
- Q-Learning