Deep Reinforcement Learning
Lectures
0 - Introduction
1 - Tabular RL
Details | |
---|---|
1.1 - Sampling and Bandits n-armed bandits, the simplest RL setting that can be solved by sampling. |
Slide, Book |
1.2 - Markov Decision Processes and Dynamic Programming MDPs are the basic RL framework. The value functions and the Bellman equations fully characterize a MDP. Dynamic programming is a model-based method allowing to iteratively solve the Bellman equations. |
Slide, Book |
1.3 - Monte Carlo control Monte Carlo control estimates value functions through sampling of complete episodes and infers the optimal policy using action selection, either on- or off-policy. |
Slide, Book |
1.4 - Temporal Difference TD algorithms allow the learning of value functions using single transitions. Q-learning is the famous off-policy variant. |
Slide, Book |
1.5 - Function Approximation Value functions can actually be approximated by any function approximator, allowing to apply RL to continuous state of action spaces. |
Slide, Book |
1.6 - Deep Neural Networks Quick overview of the main neural network architectures needed for the rest of the course. |
Slide, Book |
2 - Model-free RL
Details | |
---|---|
2.1 - DQN: Deep Q-Network DQN (Mnih et al. 2013) was the first successful application of deep networks to the RL problem. It has been applied to Atari video games and started the interest for deep RL methods. |
Slide, Book |
2.2 - Beyond DQN Various extensions to the DQN algorithms have been proposed in the following years: distributional learning, parameter noise, distributed learning or recurrent architectures. |
Slide, Book |
2.3 - PG: Policy Gradient Policy gradient methods allow to directly learn the policy without requiring action selection over value functions. |
Slide, Book |
2.4 - AC: Actor-Critic A3C (Mnih et al., 2016) is an actor-critic architecture estimating the policy gradient from multiple parallel workers. |
Slide, Book |
2.5 - DPG: Deterministic Policy Gradient DDPG (Lillicrap et al., is an off-policy actor-critic architecture particularly suited for continuous control problems such as robotics. |
Slide, Book |
2.6 - PPO: Proximal Policy Optimization PPO (Schulman et al., 2017) allows stable learning by estimating trust regions for the policy updates. |
Slide, Book |
2.7 - ACER: Actor-Critic with Experience Replay The natural gradient methods presented previously (TRPO, PPO) are stochastic actor-critic methods, therefore strictly on-policy. |
Slide, Book |
2.8 - SAC: Soft Actor-Critic Maximum Entropy RL modifies the RL objective by learning optimal policies that also explore the environment as much as possible.. SAC (Haarnoja et al., 2018) is an off-policy actor-critic architecture for soft RL. |
Slide, Book |
3 - Model-based RL
Details | |
---|---|
3.1 - Model-based RL Model-based RL uses a world model to emulate the future. Dyna-like architectures use these rollouts to augment MF algorithms. |
Slide, Book |
3.2 - Planning with learned World models Learning a world model from data and planning the optimal sequence of actions using model-predictive control is much easier than learning the optimal policy directly. Modern model-based algorithms (World models, PlaNet, Dreamer) make use of this property to reduce the sample complexity. |
Slide, Book |
3.3 - Planning (MPC, TDM)\ | |
Model-based learning is not only used to augment MF methods with imaginary rollouts. | Slide, Book |
3.4 - World models, Dreamer The neural networks used in deep RL are usually small, as rewards do not contain enough information to train huge networks. |
Slide, Book |
3.5 - AlphaGo AlphaGo surprised the world in 2016 by beating Lee Seedol, the world champion of Go. It combines model-free learning through policy gradient and self-play with model-based planning using MCTS (Monte Carlo Tree Search). |
Slide, Book |
4 - Outlook
Recommended readings
- Richard Sutton and Andrew Barto (2017). Reinforcement Learning: An Introduction. MIT press.
http://incompleteideas.net/book/the-book-2nd.html
- CS294 course of Sergey Levine at Berkeley.
http://rll.berkeley.edu/deeprlcourse/
- Reinforcement Learning course by David Silver at UCL.