My solution of Practical Reinforcement Learning by National Research University Higher School of Economics via Coursera.
The GitHub version of the course.
- Reinforcement Learning
- Multi-armed bandit
- Decision process & applications
- Black box optimization
- Markov decision process (MDP)
- Crossentropy method
- rollout experiences with stochastic policy
$\pi_i$ - store <s,a> pair of better reward as elite
- update policy
$\pi_{i+1}$ proportional to the occurrences of <s,a> pair in elite group
- rollout experiences with stochastic policy
- Approximate crossentropy method *
- Evolution strategies
- Practice basic interface of OpenAI gym environment.
- Solved 'Taxi-v3' with crossentropy method with a table representation of policy.
- Approximate the policy of a continuous state space game 'CartPole-v0' by multi-layer neural network (MLP)
- Train the MLP with elite <s,a> pair selected by crossentropy method
- Reward
- Reward design
- Discount reward
- Bellman equations
- Value functions
- State action value function
- Generalized policy iteration
- Policy evaluation & improvement
- Policy and value iteration
- Implemented value iteration algorithm to solve MDP problem
$V_{i+1}(s) = max_a \Sigma_{s'} P(s'|s,a)\cdot [r(s,a,s')+\gamma V_i(s')]$
- Model-free learning
- Monte-Carlo & temporal difference; Q-learning
- Exploration vs exploitation
- On-policy vs off-policy
- implement vanilla Q-learning algorithm
$Q(s,a) \leftarrow (1-\alpha)Q(s,a) + alpha *(r+\gamma V(s'))$
- on both discrete state space environment 'Taxi-v3' and discretized continuous state space environment 'CartPole-v0'
- implement Expected Value SARSA algorithm
- sample <s,a,r,s'> from environment
-
$\hat{Q}(s,a) = r(s,a) + r\mathbb{E}_{a_i \sim \pi(a|s')}Q(s',a_i)$ with probability of$a_i \sim \pi(a|s')$ being$(1-\epsilon)$ optimal action and$\epsilon$ random action - update $Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha\hat{Q}(s,a)$
- compare the learning curve of Q-learning and SARSA
- implement experiment replay for off-line policy, Q-learning in this implementation
- Limitations of tabular methods
- Deep Q-network
- implement a neural network to approximate the action-value function
$Q(s,a)$ - solving continuous state space environment 'CartPole-v0'
- implement DQN for 'BreakoutNoFrameskip-v4' environment
- image processing
- frame buffer
- deep Q-network
- experience replay
- target networks
- Policy-based RL vs value-based RL
- policy gradient
- REINFORCE
$\nabla_\theta\hat{J}(\theta) \approx \frac{1}{N}\Sigma_{s_i,a_i}\nabla_\theta \log\pi_\theta(a_i|s_i) \cdot G_t(s_i,a_i)$
- Actor-critic method
- advantage actor-critic
- implement policy gradient method REINFORCE to solve 'CartPole-v0'
- Measuring the quality of exploration
- Regret
- Uncertainty-based exploration
- Thompson sampling: sample while taking into account actual distribution of rewards
- Optimism in the face of uncertainty
- UCB-1: select among actions that are uncertain or have potential to be optimal
- Bayesian UCB
- Planning with Monte Carlo tree search
- MCTS
- selection: from root state recursively select node (next state by action) with exploration strategies (tree policy)
- expansion: expand the node with feasible actions
- simulation: estimate the value of the node by a rollout policy or some estimation model
- backpropagation: propagate the estimation of action-state value to upper node utile root
- MCTS
- implement exploration strategies for bernoulli bandit
- epsilon-greedy agent
- UCB agent
- Thompson sampling
- implement Monte Carlo Tree Search algorithm to plan and solve the 'CartPole-v0' task
- selection (tree policy) is explored by UCB1
- simulation is rollout by random selection of actions