/rl-book-challenge

self-studying the Sutton & Barto the hard way

Primary LanguagePythonMIT LicenseMIT

In this repo

  1. Python replication of all the plots from Reinforcement Learning: An Introduction
  2. Solution for all of the exercises
  3. Anki flashcards summary of the book

1. Replicate all the figures

To reproduce a figure, say figure 2.2, do:

cd chapter2
python figures.py 2.2

Chapter 2

  1. Figure 2.2: Average performance of epsilon-greedy action-value methods on the 10-armed testbed
  2. Figure 2.3: Optimistic initial action-value estimates
  3. Figure 2.4: Average performance of UCB action selection on the 10-armed testbed
  4. Figure 2.5: Average performance of the gradient bandit algorithm
  5. Figure 2.6: A parameter study of the various bandit algorithms

Chapter 4

  1. Figure 4.2: Jack’s car rental problem (value function, policy)
  2. Figure 4.3: The solution to the gambler’s problem (value function, policy)

Chapter 5

  1. Figure 5.1: Approximate state-value functions for the blackjack policy
  2. Figure 5.2: The optimal policy and state-value function for blackjack found by Monte Carlo ES
  3. Figure 5.3: Weighted importance sampling
  4. Figure 5.4: Ordinary importance sampling with surprisingly unstable estimates
  5. Figure 5.5: A couple of right turns for the racetrack task (1, 2, 3)

Chapter 6

  1. Figure 6.1: Changes recommended in the driving home example by Monte Carlo methods (left) and TD methods (right)
  2. Example 6.2: Random walk (comparison)
  3. Figure 6.2: Performance of TD(0) and constant MC under batch training on the random walk task
  4. Example 6.5: Windy Gridworld
  5. Example 6.6: Cliff Walking
  6. Figure 6.3: Interim and asymptotic performance of TD control methods (comparison)
  7. Figure 6.5: Comparison of Q-learning and Double Q-learning

Chapter 7

  1. Figure 7.2: Performance of n-step TD methods on 19-state random walk (comparison)
  2. Figure 7.4: Gridworld example of the speedup of policy learning due to the use of n-step methods

Chapter 8

  1. Figure 8.2: Average learning curves for Dyna-Q agents varying in their number of planning steps
  2. Figure 8.3: Policies found by planning and nonplanning Dyna-Q agents
  3. Figure 8.4: Average performance of Dyna agents on a blocking task
  4. Figure 8.5: Average performance of Dyna agents on a shortcut task
  5. Example 8.4: Prioritized sweeping significantly shortens learning time on the Dyna maze task
  6. Figure 8.7: Comparison of efficiency of expected and sample updates
  7. Figure 8.8: Relative efficiency of different update distributions

Chapter 9

  1. Figure 9.1: Gradient Monte Carlo algorithm on the 1000-state random walk task
  2. Figure 9.2: Semi-gradient n-steps TD algorithm on the 1000-state random walk task
  3. Figure 9.5: Fourier basis vs polynomials on the 1000-state random walk task (comparison)
  4. Figure 9.10: State aggregation vs. Tile coding on 1000-state random walk task (comparison)

Chapter 10

  1. Figure 10.1: The cost-to-go function for Mountain Car task in one run (428 steps; 12, 104, 1000, 9000 episodes)
  2. Figure 10.2: Learning curves for semi-gradient Sarsa on Mountain Car task
  3. Figure 10.3: One-step vs multi-step performance of semi-gradient Sarsa on the Mountain Car task
  4. Figure 10.4: Effect of the alpha and n on early performance of n-step semi-gradient Sarsa
  5. Figure 10.5: Differential semi-gradient Sarsa on the access-control queuing task

Chapter 11

  1. Figure 11.2: Demonstration of instability on Baird’s counterexample
  2. Figure 11.5: The behavior of the TDC algorithm on Baird’s counterexample
  3. Figure 11.6: The behavior of the ETD algorithm in expectation on Baird’s counterexample

Chapter 12

  1. Figure 12.3: Off-line λ-return algorithm on 19-state random walk
  2. Figure 12.6: TD(λ) algorithm on 19-state random walk
  3. Figure 12.8: True online TD(λ) algorithm on 19-state random walk
  4. Figure 12.10: Sarsa(λ) with replacing traces on Mountain Car
  5. Figure 12.11: Summary comparison of Sarsa(λ) algorithms on Mountain Car

Chapter 13

  1. Figure 13.1: REINFORCE on the short-corridor grid world
  2. Figure 13.2: REINFORCE with baseline on the short-corridor grid-world

2. Solution to all of the exercises (text answers)

To reproduce the results of an exercise, say exercise 2.5 do:

cd chapter2
python figures.py ex2.5

Chapter 2

  1. Exercise2.5: Difficulties that sample-average methods have for nonstationary problems

  2. Exercise2.11: Figure analogous to Figure 2.6 for the nonstationary case

Chapter 4

  1. Exercise 4.7: Modified Jack's car rental problem (value function, policy)

  2. Exercise 4.9: Gambler’s problem with ph = 0.25 (value function, policy) and ph = 0.55 (value function, policy)

Chapter 5

  1. Exercise 5.14: Modified MC Control on the racetrack (1, 2)

Chapter 6

  1. Exercise 6.4: Wider range of values alpha
  2. Exercise 6.5: High alpha, 99ffect of initialization
  3. Exercise 6.9: Windy Gridworld with King’s Moves
  4. Exercise 6.10: Stochastic Wind
  5. Exercise 6.13: Double Expected Sarsa vs. Expected Sarsa

Chapter 7

  1. Exercise7.2: Sum of TD error vs. n-step TD on 19-states random walk
  2. Exercise7.3: 19 states vs. 5 states, left-side outcome of -1
  3. Exercise7.7: Off-policy action-value prediction on a not-so-random walk
  4. Exercise7.10: Off-policy action-value prediction on a not-so-random walk

Chapter 8

  1. Exercise8.1: n-step sarsa on the maze task
  2. Exercise8.4: Gridworld experiment to test the exploration bonus

Chapter 11

  1. Exercise11.3: One-step semi-gradient Q-learning to Baird’s counterexample

Appendix

Dependencies

numpy
matplotlib
seaborn

Credits

All of the code and answers are mine, except for mountain car's tile coding (url in the book).

This README is inspired from ShangtongZhang's repo.

Design choices

  1. All of the chapters are self-contained.
  2. The environments use a gym-like API with methods:
s = env.reset()
s_p, r, d, dict = env.step(a)

How long did it take

The entire thing (plots, exercises, anki cards (including reviewing)) took about 400h of focused work.