[Survey] Safe-Optimal Control for Motional Planning based on RL
May 18, 2022
—
Table of contents
Optimal Control:
Dynamic Programming
Linear Programming
Tree-Based Planning
ExpectiMinimax
Optimal strategy in games with chance nodes, Melkó E., Nagy B. (2007).Sparse sampling
A sparse sampling algorithm for near-optimal planning in large Markov decision processes, Kearns M. et al. (2002).MCTS
Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search, Rémi Coulom, SequeL (2006).UCT
Bandit based Monte-Carlo Planning, Kocsis L., Szepesvári C. (2006).- Bandit Algorithms for Tree Search, Coquelin P-A., Munos R. (2007).
OPD
Optimistic Planning for Deterministic Systems, Hren J., Munos R. (2008).OLOP
Open Loop Optimistic Planning, Bubeck S., Munos R. (2010).SOOP
Optimistic Planning for Continuous-Action Deterministic Systems, Buşoniu L. et al. (2011).OPSS
Optimistic planning for sparsely stochastic systems, L. Buşoniu, R. Munos, B. De Schutter, and R. Babuska (2011).HOOT
Sample-Based Planning for Continuous ActionMarkov Decision Processes, Mansley C., Weinstein A., Littman M. (2011).HOLOP
Bandit-Based Planning and Learning inContinuous-Action Markov Decision Processes, Weinstein A., Littman M. (2012).BRUE
Simple Regret Optimization in Online Planning for Markov Decision Processes, Feldman Z. and Domshlak C. (2014).LGP
Logic-Geometric Programming: An Optimization-Based Approach to Combined Task and Motion Planning, Toussaint M. (2015). 🎞️AlphaGo
Mastering the game of Go with deep neural networks and tree search, Silver D. et al. (2016).AlphaGo Zero
Mastering the game of Go without human knowledge, Silver D. et al. (2017).AlphaZero
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Silver D. et al. (2017).TrailBlazer
Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning, Grill J. B., Valko M., Munos R. (2017).MCTSnets
Learning to search with MCTSnets, Guez A. et al. (2018).ADI
Solving the Rubik’s Cube Without Human Knowledge, McAleer S. et al. (2018).OPC/SOPC
Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values, Buşoniu L., Pall E., Munos R. (2018).- Real-time tree search with pessimistic scenarios: Winning the NeurIPS 2018 Pommerman Competition, Osogami T., Takahashi T. (2019)
Control Theory
- (book) The Mathematical Theory of Optimal Processes, L. S. Pontryagin, Boltyanskii V. G., Gamkrelidze R. V., and Mishchenko E. F. (1962).
- (book) Constrained Control and Estimation, Goodwin G. (2005).
PI²
A Generalized Path Integral Control Approach to Reinforcement Learning, Theodorou E. et al. (2010).PI²-CMA
Path Integral Policy Improvement with Covariance Matrix Adaptation, Stulp F., Sigaud O. (2010).iLQG
A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems, Todorov E. (2005). :octocat:iLQG+
Synthesis and stabilization of complex behaviors through online trajectory optimization, Tassa Y. (2012).
Model Predictive Control
Safe Control :
Robust Control
- Minimax analysis of stochastic problems, Shapiro A., Kleywegt A. (2002).
Robust DP
Robust Dynamic Programming, Iyengar G. (2005).- Robust Planning and Optimization, Laumanns M. (2011). (lecture notes)
- Robust Markov Decision Processes, Wiesemann W., Kuhn D., Rustem B. (2012).
- Safe and Robust Learning Control with Gaussian Processes, Berkenkamp F., Schoellig A. (2015). 🎞️
Tube-MPPI
Robust Sampling Based Model Predictive Control with Sparse Objective Information, Williams G. et al. (2018). 🎞️- Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning, Lukas Bronke et al. (2021). :octocat:
Risk-Averse Control
- A Comprehensive Survey on Safe Reinforcement Learning, García J., Fernández F. (2015).
RA-QMDP
Risk-averse Behavior Planning for Autonomous Driving under Uncertainty, Naghshvar M. et al. (2018).StoROO
X-Armed Bandits: Optimizing Quantiles and Other Risks, Torossian L., Garivier A., Picheny V. (2019).- Worst Cases Policy Gradients, Tang Y. C. et al. (2019).
- Model-Free Risk-Sensitive Reinforcement Learning, Delétang G. et al. (2021).
- Optimal Thompson Sampling strategies for support-aware CVaR bandits, Baudry D., Gautron R., Kaufmann E., Maillard O. (2021).
Value-Constrained Control
ICS
Will the Driver Seat Ever Be Empty?, Fraichard T. (2014).SafeOPT
Safe Controller Optimization for Quadrotors with Gaussian Processes, Berkenkamp F., Schoellig A., Krause A. (2015). 🎞️ :octocat:SafeMDP
Safe Exploration in Finite Markov Decision Processes with Gaussian Processes, Turchetta M., Berkenkamp F., Krause A. (2016). :octocat:RSS
On a Formal Model of Safe and Scalable Self-driving Cars, Shalev-Shwartz S. et al. (2017).CPO
Constrained Policy Optimization, Achiam J., Held D., Tamar A., Abbeel P. (2017). :octocat:RCPO
Reward Constrained Policy Optimization, Tessler C., Mankowitz D., Mannor S. (2018).BFTQ
A Fitted-Q Algorithm for Budgeted MDPs, Carrara N. et al. (2018).SafeMPC
Learning-based Model Predictive Control for Safe Exploration, Koller T, Berkenkamp F., Turchetta M. Krause A. (2018).CCE
Constrained Cross-Entropy Method for Safe Reinforcement Learning, Wen M., Topcu U. (2018). :octocat:LTL-RL
Reinforcement Learning with Probabilistic Guarantees for Autonomous Driving, Bouton M. et al. (2019).- Safe Reinforcement Learning with Scene Decomposition for Navigating Complex Urban Environments, Bouton M. et al. (2019). :octocat:
- Batch Policy Learning under Constraints, Le H., Voloshin C., Yue Y. (2019).
- Value constrained model-free continuous control, Bohez S. et al (2019). 🎞️
- Safely Learning to Control the Constrained Linear Quadratic Regulator, Dean S. et al (2019).
- Learning to Walk in the Real World with Minimal Human Effort, Ha S. et al. (2020) 🎞️
- Responsive Safety in Reinforcement Learning by PID Lagrangian Methods, Stooke A., Achiam J., Abbeel P. (2020). :octocat:
Envelope MOQ-Learning
A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation, Yang R. et al (2019).
State-Constrained Control and Stability
Uncertain Dynamical Systems
Game Theory:
Sequential Learning:
Multi-Armed Bandit:
TS
On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples, Thompson W. (1933).- Exploration and Exploitation in Organizational Learning, March J. (1991).
UCB1 / UCB2
Finite-time Analysis of the Multiarmed Bandit Problem, Auer P., Cesa-Bianchi N., Fischer P. (2002).Empirical Bernstein / UCB-V
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Audibert J-Y, Munos R., Szepesvari C. (2009).- Empirical Bernstein Bounds and Sample Variance Penalization, Maurer A., Ponti M. (2009).
- An Empirical Evaluation of Thompson Sampling, Chapelle O., Li L. (2011).
kl-UCB
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond, Garivier A., Cappé O. (2011).KL-UCB
Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation, Cappé O. et al. (2013).IDS
Information Directed Sampling and Bandits with Heteroscedastic Noise Kirschner J., Krause A. (2018).
Contextual
Best Arm Identification:
Successive Elimination
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, Even-Dar E. et al. (2006).LUCB
PAC Subset Selection in Stochastic Multi-armed Bandits, Kalyanakrishnan S. et al. (2012).UGapE
Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence, Gabillon V., Ghavamzadeh M., Lazaric A. (2012).Sequential Halving
Almost Optimal Exploration in Multi-Armed Bandits, Karnin Z. et al (2013).M-LUCB / M-Racing
Maximin Action Identification: A New Bandit Framework for Games, Garivier A., Kaufmann E., Koolen W. (2016).Track-and-Stop
Optimal Best Arm Identification with Fixed Confidence, Garivier A., Kaufmann E. (2016).LUCB-micro
Structured Best Arm Identification with Fixed Confidence, Huang R. et al. (2017).
Black-box Optimization:
GP-UCB
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, Srinivas N., Krause A., Kakade S., Seeger M. (2009).HOO
X–Armed Bandits, Bubeck S., Munos R., Stoltz G., Szepesvari C. (2009).DOO/SOO
Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness, Munos R. (2011).StoOO
From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning, Munos R. (2014).StoSOO
Stochastic Simultaneous Optimistic Optimization, Valko M., Carpentier A., Munos R. (2013).POO
Black-box optimization of noisy functions with unknown smoothness, Grill J-B., Valko M., Munos R. (2015).EI-GP
Bayesian Optimization in AlphaGo, Chen Y. et al. (2018)
Reinforcement Learning:
Theory:
- Expected mistake bound model for on-line reinforcement learning, Fiechter C-N. (1997).
UCRL2
Near-optimal Regret Bounds for Reinforcement Learning, Jaksch T. (2010). PSRL
Why is Posterior Sampling Better than Optimism for Reinforcement Learning?, Osband I., Van Roy B. (2016). UCBVI
Minimax Regret Bounds for Reinforcement Learning, Azar M., Osband I., Munos R. (2017). Q-Learning-UCB
Is Q-Learning Provably Efficient?, Jin C., Allen-Zhu Z., Bubeck S., Jordan M. (2018). LSVI-UCB
Provably Efficient Reinforcement Learning with Linear Function Approximation, Jin C., Yang Z., Wang Z., Jordan M. (2019). - Lipschitz Continuity in Model-based Reinforcement Learning, Asadi K. et al (2018).
- On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces, Yang Z., Jin C., Wang Z., Wang M., Jordan M. (2021)
Generative Model
Policy Gradient
Linear Systems
- PAC Adaptive Control of Linear Systems, Fiechter C.-N. (1997)
OFU-LQ
Regret Bounds for the Adaptive Control of Linear Quadratic Systems, Abbasi-Yadkori Y., Szepesvari C. (2011).TS-LQ
Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems, Abeille M., Lazaric A. (2018).- Exploration-Exploitation with Thompson Sampling in Linear Systems, Abeille M. (2017). (phd thesis)
Coarse-Id
On the Sample Complexity of the Linear Quadratic Regulator, Dean S., Mania H., Matni N., Recht B., Tu S. (2017).- Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator, Dean S. et al (2018).
- Robust exploration in linear quadratic reinforcement learning, Umenberger J. et al (2019).
- Online Control with Adversarial Disturbances, Agarwal N. et al (2019).
- Logarithmic Regret for Online Control, Agarwal N. et al (2019).
Value-based:
Policy-based:
Policy gradient
Actor-critic
AC
Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton R. et al. (1999).NAC
Natural Actor-Critic, Peters J. et al. (2005).DPG
Deterministic Policy Gradient Algorithms, Silver D. et al. (2014).DDPG
Continuous Control With Deep Reinforcement Learning, Lillicrap T. et al. (2015). 🎞️ 1 | 2 | 3 | 4MACE
Terrain-Adaptive Locomotion Skills Using Deep Reinforcement Learning, Peng X., Berseth G., van de Panne M. (2016). 🎞️ | 🎞️A3C
Asynchronous Methods for Deep Reinforcement Learning, Mnih V. et al 2016. 🎞️ 1 | 2 | 3SAC
Soft Actor-Critic : Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja T. et al. (2018). 🎞️MPO
Maximum a Posteriori Policy Optimisation, Abdolmaleki A. et al (2018).- A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms, Zhang S., Laroche R. et al. (2020).
Derivative-free
Model-based:
Dyna
Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming, Sutton R. (1990).PILCO
PILCO: A Model-Based and Data-Efficient Approach to Policy Search, Deisenroth M., Rasmussen C. (2011). (talk)DBN
Probabilistic MDP-behavior planning for cars, Brechtel S. et al. (2011).GPS
End-to-End Training of Deep Visuomotor Policies, Levine S. et al. (2015). 🎞️DeepMPC
DeepMPC: Learning Deep Latent Features for Model Predictive Control, Lenz I. et al. (2015). 🎞️SVG
Learning Continuous Control Policies by Stochastic Value Gradients, Heess N. et al. (2015). 🎞️FARNN
Nonlinear Systems Identification Using Deep Dynamic Neural Networks, Ogunmolu O. et al. (2016). :octocat:- Optimal control with learned local models: Application to dexterous manipulation, Kumar V. et al. (2016). 🎞️
BPTT
Long-term Planning by Short-term Prediction, Shalev-Shwartz S. et al. (2016). 🎞️ 1 | 2- Deep visual foresight for planning robot motion, Finn C., Levine S. (2016). 🎞️
VIN
Value Iteration Networks, Tamar A. et al (2016). 🎞️VPN
Value Prediction Network, Oh J. et al. (2017).DistGBP
Model-Based Planning with Discrete and Continuous Actions, Henaff M. et al. (2017). 🎞️ 1 | 2- Prediction and Control with Temporal Segment Models, Mishra N. et al. (2017).
Predictron
The Predictron: End-To-End Learning and Planning, Silver D. et al. (2017). 🎞️MPPI
Information Theoretic MPC for Model-Based Reinforcement Learning, Williams G. et al. (2017). :octocat: 🎞️- Learning Real-World Robot Policies by Dreaming, Piergiovanni A. et al. (2018).
- Coupled Longitudinal and Lateral Control of a Vehicle using Deep Learning, Devineau G., Polack P., Alchté F., Moutarde F. (2018) 🎞️
PlaNet
Learning Latent Dynamics for Planning from Pixels, Hafner et al. (2018). 🎞️NeuralLander
Neural Lander: Stable Drone Landing Control using Learned Dynamics, Shi G. et al. (2018). 🎞️DBN+POMCP
Towards Human-Like Prediction and Decision-Making for Automated Vehicles in Highway Scenarios , Sierra Gonzalez D. (2019).- Planning with Goal-Conditioned Policies, Nasiriany S. et al. (2019). 🎞️
MuZero
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, Schrittwiese J. et al. (2019). :octocat:BADGR
BADGR: An Autonomous Self-Supervised Learning-Based Navigation System, Kahn G., Abbeel P., Levine S. (2020). 🎞️ :octocat:H-UCRL
Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning, Curi S., Berkenkamp F., Krause A. (2020). :octocat:
Exploration:
- Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear, Lipton Z. et al. (2016).
Pseudo-count
Unifying Count-Based Exploration and Intrinsic Motivation, Bellemare M. et al (2016). 🎞️HER
Hindsight Experience Replay, Andrychowicz M. et al. (2017). 🎞️VHER
Visual Hindsight Experience Replay, Sahni H. et al. (2019).RND
Exploration by Random Network Distillation, Burda Y. et al. (OpenAI) (2018). 🎞️Go-Explore
Go-Explore: a New Approach for Hard-Exploration Problems, Ecoffet A. et al. (Uber) (2018). 🎞️C51-IDS
Information-Directed Exploration for Deep Reinforcement Learning, Nikolov N., Kirschner J., Berkenkamp F., Krause A. (2019). :octocat:Plan2Explore
Planning to Explore via Self-Supervised World Models, Sekar R. et al. (2020). 🎞️ :octocat:RIDE
RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments, Raileanu R., Rocktäschel T., (2020). :octocat:
Hierarchy and Temporal Abstraction:
- Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning, Sutton R. et al. (1999).
- Intrinsically motivated learning of hierarchical collections of skills, Barto A. et al. (2004).
OC
The Option-Critic Architecture, Bacon P-L., Harb J., Precup D. (2016).- Learning and Transfer of Modulated Locomotor Controllers, Heess N. et al. (2016). 🎞️
- Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving, Shalev-Shwartz S. et al. (2016).
FuNs
FeUdal Networks for Hierarchical Reinforcement Learning, Vezhnevets A. et al. (2017).- Combining Neural Networks and Tree Search for Task and Motion Planning in Challenging Environments, Paxton C. et al. (2017). 🎞️
DeepLoco
DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning , Peng X. et al. (2017). 🎞️ | 🎞️- Hierarchical Policy Design for Sample-Efficient Learning of Robot Table Tennis Through Self-Play, Mahjourian R. et al (2018). 🎞️
DAC
DAC: The Double Actor-Critic Architecture for Learning Options, Zhang S., Whiteson S. (2019).- Multi-Agent Manipulation via Locomotion using Hierarchical Sim2Real, Nachum O. et al (2019). 🎞️
- SoftCon: Simulation and Control of Soft-Bodied Animals with Biomimetic Actuators, Min S. et al. (2020). 🎞️ :octocat:
H-REIL
Reinforcement Learning based Control of Imitative Policies for Near-Accident Driving, Cao Z. et al. (2020). 🎞️ 1, 2
Partial Observability:
PBVI
Point-based Value Iteration: An anytime algorithm for POMDPs, Pineau J. et al. (2003).cPBVI
Point-Based Value Iteration for Continuous POMDPs, Porta J. et al. (2006).POMCP
Monte-Carlo Planning in Large POMDPs, Silver D., Veness J. (2010).- A POMDP Approach to Robot Motion Planning under Uncertainty, Du Y. et al. (2010).
- Probabilistic Online POMDP Decision Making for Lane Changes in Fully Automated Driving, Ulbrich S., Maurer M. (2013).
- Solving Continuous POMDPs: Value Iteration with Incremental Learning of an Efficient Space Representation, Brechtel S. et al. (2013).
- Probabilistic Decision-Making under Uncertainty for Autonomous Driving using Continuous POMDPs, Brechtel S. et al. (2014).
MOMDP
Intention-Aware Motion Planning, Bandyopadhyay T. et al. (2013).DNC
Hybrid computing using a neural network with dynamic external memory, Graves A. et al (2016). 🎞️- The value of inferring the internal state of traffic participants for autonomous freeway driving, Sunberg Z. et al. (2017).
- Belief State Planning for Autonomously Navigating Urban Intersections, Bouton M., Cosgun A., Kochenderfer M. (2017).
- Scalable Decision Making with Sensor Occlusions for Autonomous Driving, Bouton M. et al. (2018).
- Probabilistic Decision-Making at Road Intersections: Formulation and Quantitative Evaluation, Barbier M., Laugier C., Simonin O., Ibanez J. (2018).
- Beauty and the Beast: Optimal Methods Meet Learning for Drone Racing, Kaufmann E. et al. (2018). 🎞️
social perception
Behavior Planning of Autonomous Cars with Social Perception, Sun L. et al (2019).
Transfer:
IT&E
Robots that can adapt like animals, Cully A., Clune J., Tarapore D., Mouret J-B. (2014). 🎞️MAML
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Finn C., Abbeel P., Levine S. (2017). 🎞️- Virtual to Real Reinforcement Learning for Autonomous Driving, Pan X. et al. (2017). 🎞️
- Sim-to-Real: Learning Agile Locomotion For Quadruped Robots, Tan J. et al. (2018). 🎞️
ME-TRPO
Model-Ensemble Trust-Region Policy Optimization, Kurutach T. et al. (2018). 🎞️- Kickstarting Deep Reinforcement Learning, Schmitt S. et al. (2018).
- Learning Dexterous In-Hand Manipulation, OpenAI (2018). 🎞️
GrBAL / ReBAL
Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning, Nagabandi A. et al. (2018). 🎞️- Learning agile and dynamic motor skills for legged robots, Hwangbo J. et al. (ETH Zurich / Intel ISL) (2019). 🎞️
- Robust Recovery Controller for a Quadrupedal Robot using Deep Reinforcement Learning, Lee J., Hwangbo J., Hutter M. (ETH Zurich RSL) (2019)
IT&E
Learning and adapting quadruped gaits with the “Intelligent Trial & Error” algorithm, Dalin E., Desreumaux P., Mouret J-B. (2019). 🎞️FAMLE
Fast Online Adaptation in Robotics through Meta-Learning Embeddings of Simulated Priors, Kaushik R., Anne T., Mouret J-B. (2020). 🎞️- Robust Deep Reinforcement Learning against Adversarial Perturbations on Observations, Zhang H. et al (2020). :octocat:
- Learning quadrupedal locomotion over challenging terrain, Lee J. et al. (2020). 🎞️
PACOH
PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees, Rothfuss J., Fortuin V., Josifoski M., Krause A. (2021).- Model-Based Domain Generalization, Robey A. et al. (2021).
SimGAN
SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning, Jiang Y. et al. (2021). 🎞️ :octocat:- Learning robust perceptive locomotion for quadrupedal robots in the wild, Miki T. et al. (2022).
Multi-agent:
Minimax-Q
Markov games as a framework for multi-agent reinforcement learning, M. Littman (1994).- Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems, Albrecht S., Stone P. (2017).
MILP
Time-optimal coordination of mobile robots along specified paths, Altché F. et al. (2016). 🎞️MIQP
An Algorithm for Supervised Driving of Cooperative Semi-Autonomous Vehicles, Altché F. et al. (2017). 🎞️SA-CADRL
Socially Aware Motion Planning with Deep Reinforcement Learning, Chen Y. et al. (2017). 🎞️- Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction: Theory and experiment, Galceran E. et al. (2017).
- Online decision-making for scalable autonomous systems, Wray K. et al. (2017).
MAgent
MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence, Zheng L. et al. (2017). 🎞️- Cooperative Motion Planning for Non-Holonomic Agents with Value Iteration Networks, Rehder E. et al. (2017).
MPPO
Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning, Long P. et al. (2017). 🎞️COMA
Counterfactual Multi-Agent Policy Gradients, Foerster J. et al. (2017).MADDPG
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments, Lowe R. et al (2017). :octocat:FTW
Human-level performance in first-person multiplayer games with population-based deep reinforcement learning, Jaderberg M. et al. (2018). 🎞️- Towards Learning Multi-agent Negotiations via Self-Play, Tang Y. C. (2020).
MAPPO
The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games, Yu C. et al. (2021). |:octocat:](https://github.com/marlbenchmark/on-policy)- Many-agent Reinforcement Learning, Yang Y. (2021)
Representation Learning
- Variable Resolution Discretization in Optimal Control, Munos R., Moore A. (2002). 🎞️
DeepDriving
DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving, Chen C. et al. (2015). 🎞️- On the Sample Complexity of End-to-end Training vs. Semantic Abstraction Training, Shalev-Shwartz S. et al. (2016).
- Learning sparse representations in reinforcement learning with sparse coding, Le L., Kumaraswamy M., White M. (2017).
- World Models, Ha D., Schmidhuber J. (2018). 🎞️ :octocat:
- Learning to Drive in a Day, Kendall A. et al. (2018). 🎞️
MERLIN
Unsupervised Predictive Memory in a Goal-Directed Agent, Wayne G. et al. (2018). 🎞️ 1 | 2 | 3 | 4 | 5 | 6- Variational End-to-End Navigation and Localization, Amini A. et al. (2018). 🎞️
- Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks, Lee M. et al. (2018). 🎞️
- Deep Neuroevolution of Recurrent and Discrete World Models, Risi S., Stanley K.O. (2019). 🎞️ :octocat:
FERM
A Framework for Efficient Robotic Manipulation, Zhan A., Zhao R. et al. (2021). :octocat:S4RL
S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning, Sinha S. et al (2021).
Offline
Other
- Is the Bellman residual a bad proxy?, Geist M., Piot B., Pietquin O. (2016).
- Deep Reinforcement Learning that Matters, Henderson P. et al. (2017).
- Automatic Bridge Bidding Using Deep Reinforcement Learning, Yeh C. and Lin H. (2016).
- Shared Autonomy via Deep Reinforcement Learning, Reddy S. et al. (2018). 🎞️
- Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review, Levine S. (2018).
- The Value Function Polytope in Reinforcement Learning, Dadashi R. et al. (2019).
- On Value Functions and the Agent-Environment Boundary, Jiang N. (2019).
- How to Train Your Robot with Deep Reinforcement Learning; Lessons We’ve Learned, Ibartz J. et al (2021).
Learning from Demonstrations:
Imitation Learning
DAgger
A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning, Ross S., Gordon G., Bagnell J. A. (2011).QMDP-RCNN
Reinforcement Learning via Recurrent Convolutional Neural Networks, Shankar T. et al. (2016). (talk)DQfD
Learning from Demonstrations for Real World Reinforcement Learning, Hester T. et al. (2017). 🎞️- Find Your Own Way: Weakly-Supervised Segmentation of Path Proposals for Urban Autonomy, Barnes D., Maddern W., Posner I. (2016). 🎞️
GAIL
Generative Adversarial Imitation Learning, Ho J., Ermon S. (2016).- From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots, Pfeiffer M. et al. (2017). 🎞️
Branched
End-to-end Driving via Conditional Imitation Learning, Codevilla F. et al. (2017). 🎞️ | talkUPN
Universal Planning Networks, Srinivas A. et al. (2018). 🎞️DeepMimic
DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills, Peng X. B. et al. (2018). 🎞️R2P2
Deep Imitative Models for Flexible Inference, Planning, and Control, Rhinehart N. et al. (2018). 🎞️- Learning Agile Robotic Locomotion Skills by Imitating Animals, Bin Peng X. et al (2020). 🎞️
- Deep Imitative Models for Flexible Inference, Planning, and Control, Rhinehart N., McAllister R., Levine S. (2020).
Applications to Autonomous Driving:
- ALVINN, an autonomous land vehicle in a neural network, Pomerleau D. (1989).
- End to End Learning for Self-Driving Cars, Bojarski M. et al. (2016). 🎞️
- End-to-end Learning of Driving Models from Large-scale Video Datasets, Xu H., Gao Y. et al. (2016). 🎞️
- End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies, Eraqi H. et al. (2017).
- Driving Like a Human: Imitation Learning for Path Planning using Convolutional Neural Networks, Rehder E. et al. (2017).
- Imitating Driver Behavior with Generative Adversarial Networks, Kuefler A. et al. (2017).
PS-GAIL
Multi-Agent Imitation Learning for Driving Simulation, Bhattacharyya R. et al. (2018). 🎞️ :octocat:- Deep Imitation Learning for Autonomous Driving in Generic Urban Scenarios with Enhanced Safety, Chen J. et al. (2019).
Inverse Reinforcement Learning
Projection
Apprenticeship learning via inverse reinforcement learning, Abbeel P., Ng A. (2004).MMP
Maximum margin planning, Ratliff N. et al. (2006).BIRL
Bayesian inverse reinforcement learning, Ramachandran D., Amir E. (2007).MEIRL
Maximum Entropy Inverse Reinforcement Learning, Ziebart B. et al. (2008).LEARCH
Learning to search: Functional gradient techniques for imitation learning, Ratliff N., Siver D. Bagnell A. (2009).CIOC
Continuous Inverse Optimal Control with Locally Optimal Examples, Levine S., Koltun V. (2012). 🎞️MEDIRL
Maximum Entropy Deep Inverse Reinforcement Learning, Wulfmeier M. (2015).GCL
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization, Finn C. et al. (2016). 🎞️RIRL
Repeated Inverse Reinforcement Learning, Amin K. et al. (2017).- Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning, Piot B. et al. (2017).
Applications to Autonomous Driving:
- Apprenticeship Learning for Motion Planning, with Application to Parking Lot Navigation, Abbeel P. et al. (2008).
- Navigate like a cabbie: Probabilistic reasoning from observed context-aware behavior, Ziebart B. et al. (2008).
- Planning-based Prediction for Pedestrians, Ziebart B. et al. (2009). 🎞️
- Learning for autonomous navigation, Bagnell A. et al. (2010).
- Learning Autonomous Driving Styles and Maneuvers from Expert Demonstration, Silver D. et al. (2012).
- Learning Driving Styles for Autonomous Vehicles from Demonstration, Kuderer M. et al. (2015).
- Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks, Sharifzadeh S. et al. (2016).
- Watch This: Scalable Cost-Function Learning for Path Planning in Urban Environments, Wulfmeier M. (2016). 🎞️
- Planning for Autonomous Cars that Leverage Effects on Human Actions, Sadigh D. et al. (2016).
- A Learning-Based Framework for Handling Dilemmas in Urban Automated Driving, Lee S., Seo S. (2017).
- Learning Trajectory Prediction with Continuous Inverse Optimal Control via Langevin Sampling of Energy-Based Models, Xu Y. et al. (2019).
- Analyzing the Suitability of Cost Functions for Explaining and Imitating Human Driving Behavior based on Inverse Reinforcement Learning, Naumann M. et al (2020).
Motion Planning:
Search
Dijkstra
A Note on Two Problems in Connexion with Graphs, Dijkstra E. W. (1959).A*
A Formal Basis for the Heuristic Determination of Minimum Cost Paths , Hart P. et al. (1968).- Planning Long Dynamically-Feasible Maneuvers For Autonomous Vehicles, Likhachev M., Ferguson D. (2008).
- Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame, Werling M., Kammel S. (2010). 🎞️
- 3D perception and planning for self-driving and cooperative automobiles, Stiller C., Ziegler J. (2012).
- Motion Planning under Uncertainty for On-Road Autonomous Driving, Xu W. et al. (2014).
- Monte Carlo Tree Search for Simulated Car Racing, Fischer J. et al. (2015). 🎞️
Sampling
Optimization
Reactive
Architecture and applications