Deep Reinforcement Learning

Lectures

0 - Introduction

Details
0 - Introduction
Introduction to the main concepts of reinforcement learning and showcasing of the current applications.
Slide, Book

1 - Tabular RL

Details
1.1 - Sampling and Bandits
n-armed bandits, the simplest RL setting that can be solved by sampling.
Slide, Book
1.2 - Markov Decision Processes and Dynamic Programming
MDPs are the basic RL framework. The value functions and the Bellman equations fully characterize a MDP. Dynamic programming is a model-based method allowing to iteratively solve the Bellman equations.
Slide, Book
1.3 - Monte Carlo control
Monte Carlo control estimates value functions through sampling of complete episodes and infers the optimal policy using action selection, either on- or off-policy.
Slide, Book
1.4 - Temporal Difference
TD algorithms allow the learning of value functions using single transitions. Q-learning is the famous off-policy variant.
Slide, Book
1.5 - Function Approximation
Value functions can actually be approximated by any function approximator, allowing to apply RL to continuous state of action spaces.
Slide, Book
1.6 - Deep Neural Networks
Quick overview of the main neural network architectures needed for the rest of the course.
Slide, Book

2 - Model-free RL

Details
2.1 - DQN: Deep Q-Network
DQN (Mnih et al. 2013) was the first successful application of deep networks to the RL problem. It has been applied to Atari video games and started the interest for deep RL methods.
Slide, Book
2.2 - Beyond DQN
Various extensions to the DQN algorithms have been proposed in the following years: distributional learning, parameter noise, distributed learning or recurrent architectures.
Slide, Book
2.3 - PG: Policy Gradient
Policy gradient methods allow to directly learn the policy without requiring action selection over value functions.
Slide, Book
2.4 - AC: Actor-Critic
A3C (Mnih et al., 2016) is an actor-critic architecture estimating the policy gradient from multiple parallel workers.
Slide, Book
2.5 - DPG: Deterministic Policy Gradient
DDPG (Lillicrap et al., is an off-policy actor-critic architecture particularly suited for continuous control problems such as robotics.
Slide, Book
2.6 - PPO: Proximal Policy Optimization
PPO (Schulman et al., 2017) allows stable learning by estimating trust regions for the policy updates.
Slide, Book
2.7 - ACER: Actor-Critic with Experience Replay
The natural gradient methods presented previously (TRPO, PPO) are stochastic actor-critic methods, therefore strictly on-policy.
Slide, Book
2.8 - SAC: Soft Actor-Critic
Maximum Entropy RL modifies the RL objective by learning optimal policies that also explore the environment as much as possible.. SAC (Haarnoja et al., 2018) is an off-policy actor-critic architecture for soft RL.
Slide, Book

3 - Model-based RL

Details
3.1 - Model-based RL
Model-based RL uses a world model to emulate the future. Dyna-like architectures use these rollouts to augment MF algorithms.
Slide, Book
3.2 - Planning with learned World models
Learning a world model from data and planning the optimal sequence of actions using model-predictive control is much easier than learning the optimal policy directly. Modern model-based algorithms (World models, PlaNet, Dreamer) make use of this property to reduce the sample complexity.
Slide, Book
3.3 - Planning (MPC, TDM)\
Model-based learning is not only used to augment MF methods with imaginary rollouts. Slide, Book
3.4 - World models, Dreamer
The neural networks used in deep RL are usually small, as rewards do not contain enough information to train huge networks.
Slide, Book
3.5 - AlphaGo
AlphaGo surprised the world in 2016 by beating Lee Seedol, the world champion of Go. It combines model-free learning through policy gradient and self-play with model-based planning using MCTS (Monte Carlo Tree Search).
Slide, Book

4 - Outlook

Details
5.1 - Outlook
Current RL research investigates many different directions: inverse RL, intrinsic motivation, hierarchical RL, meta RL, offline RL, multi-agent RL (MARL), etc.
Slide, Book