An introduction about Observable, model-free, single agent, gradient-based and non-hierarchical RL algorithms.
This course combines UCL's(D. Silver), UC Berkley's lectures.
- class intro:
- RL vs. ML
- RL history:
- MDP: basic frame, lect2
- DP: model-based solution, lect3
- Model-based prediction / control
- Model-free prediction
- Model-free control
- Functional approximator
- Policy gradient
- invRL
- DeepRL
- Basic concepts: state, action, model, policy, reward, return, value
- Classification: target-based, learning-based, model-based
- Markov Decision Process
- Markov property
$\rightarrow <S, P> \rightarrow <S, P, R, \gamma> \rightarrow <S,P,R,\gamma, A>$ - Example: Students taking class.
- Example: Swapping Robot.
- Policy, State Value(V), Action-State Value(Q)
- Bellman Expectation:
- Linear equation: analytic solution
- DP: BFS
- MC: DFS
- TD: half-DFS
- Optimal Principle: theoretical principle
- Bellman Optimal Equation:
target
.
- Bellman Optimal Equation:
- Markov property
- Dynamic Programming: Due to Optimal Principle, Model-based
- Value Iteration: basic idea, limit theorem
- Policy Iteration: evaluation + improvement
- Generalized Policy Iteration: one improvement, multi-evaluation.
- Model-free prediction learning: learn to compute one
$Q$ or$V$ - Monte Carlo: high var, unbiased
- incremental MC with fixed update stepsize
- Temporal Difference: low var, biased, TD target, TD error
- batch learning of MC and TD
- DP vs MC vs TD: bias-var tradeoff, partial vs complete traj, markov property
- DP is BFS, bootstrap
- MC is DFS, sampling
- TD is partial DFS, sampling, bootstrap
-
$TD(n) \rightarrow TD(\lambda)$ : change TD target from one step to mean multi-step reward. - Eligibility traces: not quite understand backtrace makes
$TD(\lambda)$ online
- Monte Carlo: high var, unbiased
- Model-free control learning: learn to compute optimal
$Q$ or$V$ - online learning: not require complete trajactory.
-
$\epsilon-greedy$ + GLIE - MC control
- Sarsa: on-policy TD control
- Importance sampling:
- off-policy vs on-policy
- principle: importance sampling
- off-policy MC, off-policy TD
- Q-Learning: off-policy TD control
- Double Q Learning:
- Why? Jeson Ineq
- Solution? One Q for evaluation, one Q for promotion
- Value gradient: value function approximator
- why? large-scale RL
- approximator type
- DP by approximator
- Prediction Learning by approximator:
- LMS value iteration: linear programming only
- SGD value iteration
- Control learning = prediction learning +
$\epsilon-greedy$
- Policy gradient
- GLIE Monte Carlo Control learning
- Temporal Difference: Richard Sutton
- Sarsa: Rummery online q-learning using connectionist systems
- Q Learning: Chris Watkins
- Double Q Learning: H van Hasselt