/Practical_RL

A course in reinforcement learning in the wild

Primary LanguageJupyter NotebookMIT LicenseMIT

Practical_RL

A course on reinforcement learning in the wild. Taught on-campus at HSE and YSDA(russian) and maintained to be friendly to online students (both english and russian).

Manifesto:

  • Optimize for the curious. For all the materials that aren’t covered in detail there are links to more information and related materials (D.Silver/Sutton/blogs/whatever). Assignments will have bonus sections if you want to dig deeper.
  • Practicality first. Everything essential to solving reinforcement learning problems is worth mentioning. We won't shun away from covering tricks and heuristics. For every major idea there should be a lab that makes you to “feel” it on a practical problem.
  • Git-course. Know a way to make the course better? Noticed a typo in a formula? Found a useful link? Made the code more readable? Made a version for alternative framework? You're awesome! Pull-request it!

Course info

Additional materials

Syllabus

The syllabus is approximate: the lectures may occur in a slightly different order and some topics may end up taking two weeks.

  • week1 RL as blackbox optimization

    • Lecture: RL problems around us. Decision processes. Stochastic optimization, Crossentropy method. Parameter space search vs action space search.
    • Seminar: Welcome into openai gym. Tabular CEM for Taxi-v0, deep CEM for box2d environments.
    • Homework description - see week1/README.md.
    • ** YSDA Deadline: 2018.02.26 23.59**
    • ** HSE Deadline: 2018.01.28 23:59**
  • week2 Value-based methods

    • Lecture: Discounted reward MDP. Value-based approach. Value iteration. Policy iteration. Discounted reward fails.
    • Seminar: Value iteration.
    • Homework description - see week2/README.md.
    • ** HSE Deadline: 2018.02.11 23:59**
  • week3 Model-free reinforcement learning

    • Lecture: Q-learning. SARSA. Off-policy Vs on-policy algorithms. N-step algorithms. TD(Lambda).
    • Seminar: Qlearning Vs SARSA Vs Expected Value SARSA
    • Homework description - see week3/README.md.
    • HSE Deadline: 2018.02.15 23:59
  • week4_recap - deep learning recap

    • Lecture: Deep learning 101
    • Seminar: Simple image classification with convnets
    • HSE Deadline: 2018.02.26 23:59
  • week4 Approximate reinforcement learning

    • Lecture: Infinite/continuous state space. Value function approximation. Convergence conditions. Multiple agents trick; experience replay, target networks, double/dueling/bootstrap DQN, etc.
    • Seminar: Approximate Q-learning with experience replay. (CartPole, Atari)
  • week5 Exploration in reinforcement learning

    • Lecture: Contextual bandits. Thompson Sampling, UCB, bayesian UCB. Exploration in model-based RL, MCTS. "Deep" heuristics for exploration.
    • Seminar: bayesian exploration for contextual bandits. UCB for MCTS.
  • week6 Policy gradient methods I

    • Lecture: Motivation for policy-based, policy gradient, logderivative trick, REINFORCE/crossentropy method, variance reduction(baseline), advantage actor-critic (incl. GAE)
    • Seminar: REINFORCE, advantage actor-critic
  • week7_recap Recurrent neural networks recap

    • Lecture: Problems with sequential data. Recurrent neural netowks. Backprop through time. Vanishing & exploding gradients. LSTM, GRU. Gradient clipping
    • Seminar: character-level RNN language model
  • week7 Partially observable MDPs

    • Lecture: POMDP intro. POMDP learning (agents with memory). POMDP planning (POMCP, etc)
    • Seminar: Deep kung-fu & doom with recurrent A3C and DRQN
  • week8 Applications II

    • Lecture: Reinforcement Learning as a general way to optimize non-differentiable loss. G2P, machine translation, conversation models, image captioning, discrete GANs. Self-critical sequence training.
    • Seminar: Simple neural machine translation with self-critical sequence training
  • week9 Policy gradient methods II

    • Lecture: Trust region policy optimization. NPO/PPO. Deterministic policy gradient. DDPG. Bonus: DPG for discrete action spaces.
    • Seminar: Approximate TRPO for simple robotic tasks.
  • Some after-course bonus materials

Course staff

Course materials and teaching by

Contributions