Practical_RL

A course on reinforcement learning in the wild. Taught on-campus in HSE and Yandex SDA (russian) and maintained to be friendly to online students (both english and russian).

Manifesto:

Optimize for the curious. For all the materials that aren’t covered in detail there are links to more information and related materials (D.Silver/Sutton/blogs/whatever). Assignments will have bonus sections if you want to dig deeper.
Practicality first. Everything essential to solving reinforcement learning problems is worth mentioning. We won't shun away from covering tricks and heuristics. For every major idea there should be a lab that allows to “feel” it on a practical problem.
Git-course. Know a way to make the course better? Noticed a typo in a formula? Made the code more readable? Made a version for alternative framework? You're awesome! Pull-request it!

Coordinates and useful links

Lectures: https://yadi.sk/d/loPpY45J3EAYfU
HSE classes are on mondays at 18-10 in Room 505
YSDA classes are on thursdays at 18-00 in "Princeton" classroom
Online student survival guide
Installing the libraries - guide and issues thread
Magical button that creates VM: (may be down time to time
Telegram chat room (russian)
English chat -
How to submit homeworks[HSE and YSDA only]: anytask instructions and grading rules
E-mail for everything else : practicalrl17@gmail.com (please don't submit homeworks via e-mail)
Anonymous feedback form for everything that didn't go through e-mail.
About the course
A large list of RL materials - awesome rl

Announcements

16.02.17 - Lectures moved
16.02.17 - HSE homework 3 added
14.02.17 - HSE deadlines for weeks 1-2 extended!
14.02.17 - anytask invites moved here
14.02.17 - if you're from HSE track and we didn't reply to your week0 homework submission, raise panic!
11.02.17 - week2 success thresholds are now easier: get >+50 for LunarLander or >-180 for MountainCar. Solving env will yield bonus points.
13.02.17 - Added invites for anytask.org
10.02.17 - from now on, we'll formally describe homework and add useful links via ./week*/README.md files. Example.
9.02.17 - YSDA track started
7.02.17 - HWs checked up
6.02.17 - week2 uploaded
27.01.17 - merged fix by omtcyfz, thanks!
27.01.17 - added course mail for homework submission: practicalrl17@gmail.com
23.01.17 - first class happened
23.01.17 - created repo

Syllabus

week0 Welcome to the MDP
Lecture: RL problems around us. Markov decision process. Simple solutions through combinatoric optimization.
Seminar: Frozenlake with genetic algorithms
- Homework description - ./week0/README.md
- HSE Homework deadline: 23.59 1.02.17
- YSDA Homework deadline: 23.59 19.02.17
week1 Monte-carlo methods
Lecture: Crossentropy method in general and for RL. Extension to continuous state & action space. Limitations.
Seminar: Tabular CEM for Taxi-v0, deep CEM for box2d environments.
- HSE homework deadline: 23.59 15.02.17
week2 Temporal Difference
Lecture: Discounted reward MDP. Value iteration. Q-learning. Temporal difference Vs Monte-Carlo.
Seminar: Tabular q-learning
- Homework description - see ./week2/README.md
- HSE homework deadline: 23.59 15.02.17
week3 Value-based algorithms
Lecture: SARSA. Off-policy Vs on-policy algorithms. N-step algorithms. Eligibility traces.
Seminar: Qlearning Vs SARSA Vs expected value sarsa in the wild
Homework description
- HSE homework deadline 23.59 22.02.17

Future lectures:

week4 Approximate reinforcement learning
Lecture: Infinite/continuous state space. Value function approximation. Convergence conditions. Multiple agents trick.
Seminar: Approximate Q-learning. (CartPole, MountainCar, Breakout)

somewhere here introduction to theano

week i+1 Deep reinforcement learning
Lecture: Deep Q-learning/sarsa/whatever. Heuristics & motivation behind them: experience replay, target networks, double/dueling/bootstrap DQN, etc.
Seminar: Playing atari with deep reinforcement learning. Experience replay. (classwork = doombasic)
week i+1 Policy-based methods
Lecture: Motivation for policy-based, policy gradient, logderivative trick, REINFORCE/crossentropy method, variance theorem(advantage), advantage actor-critic (incl.n-step advantage), off-policy actor-critic (off-PAC), natural gradients(briefly), continuous action space(teaser).
Seminar: a2c Vs qlearning for MountainCar/Doom, entropy regularization & tricks.
week i+1 Trust Region Policy Optimization.
Lecture: Trust region policy optimization in detail.
approximate TRPO vs approximate Q-learning for gym box2d envs (robotics-themed)
week i+1 Large/Continuous action space. Case study: recsys.
Lecture: Continuous action space MDPs. Model-based approach (NAF). Actor-critic approach (dpg, svg). Trust Region Policy Optimization. Large discrete action space problem. Action embedding.
Seminar: Classic Control and BipedalWalker with ddpg Vs qNAF. https://gym.openai.com/envs/BipedalWalker-v2 .

somewhere here RNN crash-course

week i+1 Partially observable MDPs
Lecture: POMDP intro. Model-based solvers. RNN solvers. RNN tricks: attention, problems with normalization methods, pre-training.
Seminar: Deep kung-fu with recurrent A2C vs feedforward A2C
week i+1 Advanced exploration methods: intrinsic motivation
Lecture: Augmented rewards. Heuristics (UNREAL,density-based models), formal approach: information maximizing exploration. Model-based tricks(also refer mcts).
Seminar: Vime vs epsilon-greedy for Go9x9 (bonus 19x19)
week i+1 Advanced exploration methods: probablistic approach.
Lecture: Improved exploration methods (quantile-based, etc.). Bayesian approach. Case study: Contextual bandits for RTB.
Seminar: Bandits
week i+1 Case studies I
Lecture: Reinforcement Learning as a general way to optimize non-differentiable loss. KL(p||q) vs KL(q||p). Case study: machine ranslation, speech synthesis, conversation models.
Seminar: Optimizing Levenstein for word transcription
week i+1 Hierarchical MDP
Lecture: MDP Vs real world. Sparse and delayed rewards. When Q-learning fails. Hierarchical MDP. Hierarchy as temporal abstraction. MDP with symbolic reasoning.
Seminar: Hierarchical RL for atari games with rare rewards (starting from pre-trained DQN)
week i+1 Case studies II
Lecture: Direct policy optimization: finance. Inverse Reinforcement Learning: personalized medial treatment, robotics.
Seminar: Portfolio optimization as POMDP.

Course staff

Course materials and teaching by

Fedor Ratnikov - lectures, seminars, hw checkups
Alexander Fritsler - lectures, seminars, hw checkups
Oleg Vasilev - seminars, hw checkups, technical stuff
Pavel Shvechikov - lectures, seminars, HW checkups

Contributors

Using pictures from http://ai.berkeley.edu/home.html
Other contributions: omtcyfz dmittov arogozhnikov

GolovanovSrg/Practical_RL