Overview

Reinforcement Learning algorithms implementations.

Multi-armed Bandits
- Bandits environment implementation (k-armed w/ optional non-stationarity)
- ε-greedy policy
- Upper Confidence Bound policy
- Policy gradient
Dynamic Programming
- Policy Iteration (Policy Evaluation + Policy Improvement)
- Value Iteration
Monte-Carlo
- MC on-policy value function estimation
- MC on-policy first-visit ε-greedy
- MC off-policy every-visit w/ weighted important sampling
Temporal Difference
- SARSA
- Q-Learning
- Expected SARSA
- Double Q-Learning
n-step Bootstrapping (TODO)
Planning and Learning
- Maze environment implementation
- Dyna-Q
- Dyna-Q w/ prioritized sweeping
- Dyna-Q+

Prerequisites