RL Sketchpad

Implementations of various RL Algorithms

Contents:

Reinforcement Learning: An Introduction (2nd ed, 2018) by Sutton and Barto

Implementation of selected algorithms from the book. I tried to make code snippets minimal and faithful to the book.

Part I: Tabular Solution Methods
Chapter 2: Multi-armed Bandits 2.4: Simple Bandit - fig. 2.1, 2.2 2.6: Tracking Bandit - fig. 2.3
Chapter 4: Dynamic Programming 4.1: Iterative Policy Evaluation - FrozenLake-v0 4.3: Policy Iteration - FrozenLake-v0 4.4: Value Iteration - FrozenLake-v0
Chapter 5: Monte Carlo Methods 5.1: First-Visit MC Prediction - Blackjack-v0, fig. 5.1 5.3: Monte Carlo ES Control - Blackjack-v0, fig. 5.2 5.4: On-Policy First-Visit MC Control - Blackjack-v0
Chapter 6: Temporal-Difference Learning 6.1: TD Prediction - Blackjack-v0, example 6.2 Also: Running-Mean MC Prediction 6.4: Sarsa - WindyGridworld, example 6.5 6.5: Q-Learning - CliffWalking, example 6.6
Part II: Approximate Solution Methods
Chapter 9: On-Policy Prediction with Approximation 9.3a: Gradient Monte Carlo - example 9.1, fig. 9.1 9.3b: Semi-Gradient TD - example 9.2, fig. 9.2 (left) 9.5a: Linear Models - Polynomial and Fourier Bases - fig. 9.5 9.5b: Linear Models - Tile Coding - fig. 9.10 9.7: Neural Network with Memory Reply
Chapter 10: On-Policy Control with Approximation 10.1: Episodic Semi-Gradient Sarsa - MountainCar, fig 10.1, 10.2

A bit more in-depth explanation of selected concepts from David Sivler lectures and Sutton and Barto book.

Lecture 3 - Dynamic Programming
- Dynamic Programming - Iterative Policy Evaluation, Policy Iteration, Value Iteration
Lecture 4 - Model Free Prediction
- MC and TD Prediction
- N-Step and TD(λ) Prediction - Forward TD(λ) and Backward TD(λ) with Eligibility Traces
Lecture 4 - Model-Free Control
- On-Policy Control - MC, TD, N-Step, Forward TD(λ), Backward TD(λ) with Eligibility Traces
- Off-Policy Control - Expectation Based - Q-Learning, Expected SARSA, Tree Backup
- Off-Policy Control - Importance Sampling - I.S. SARSA, N-Step I.S. SARSA, Off-Policy MC Control

ANN and Correlated Data - simplest possible example showing why memory reply is necessary
Minimal TF Keras - fit sine wave