Sequential Learning
1 Sequential Learning
This section covers sequential learning approaches including multi-armed bandits, contextual bandits, best arm identification, and black-box optimization.
- Prediction, Learning and Games, Cesa-Bianchi N., Lugosi G. (2006).
1.1 Multi-Armed Bandit
TS
On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples, Thompson W. (1933).- Exploration and Exploitation in Organizational Learning, March J. (1991).
UCB1 / UCB2
Finite-time Analysis of the Multiarmed Bandit Problem, Auer P., Cesa-Bianchi N., Fischer P. (2002).Empirical Bernstein / UCB-V
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Audibert J-Y, Munos R., Szepesvari C. (2009).- Empirical Bernstein Bounds and Sample Variance Penalization, Maurer A., Ponti M. (2009).
- An Empirical Evaluation of Thompson Sampling, Chapelle O., Li L. (2011).
kl-UCB
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond, Garivier A., Cappé O. (2011).KL-UCB
Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation, Cappé O. et al. (2013).IDS
Information Directed Sampling and Bandits with Heteroscedastic Noise Kirschner J., Krause A. (2018).
1.1.1 Contextual Bandits
LinUCB
A Contextual-Bandit Approach to Personalized News Article Recommendation, Li L. et al. (2010).OFUL
Improved Algorithms for Linear Stochastic Bandits, Abbasi-yadkori Y., Pal D., Szepesvári C. (2011).- Contextual Bandits with Linear Payoff Functions, Chu W. et al. (2011).
- Self-normalization techniques for streaming confident regression, Maillard O.-A. (2017).
- Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems Mann T. et al. (2018). (prediction setting)
- Weighted Linear Bandits for Non-Stationary Environments, Russac Y. et al. (2019).
- Linear bandits with Stochastic Delayed Feedback, Vernade C. et al. (2020).
1.2 Best Arm Identification
Successive Elimination
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, Even-Dar E. et al. (2006).LUCB
PAC Subset Selection in Stochastic Multi-armed Bandits, Kalyanakrishnan S. et al. (2012).UGapE
Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence, Gabillon V., Ghavamzadeh M., Lazaric A. (2012).Sequential Halving
Almost Optimal Exploration in Multi-Armed Bandits, Karnin Z. et al (2013).M-LUCB / M-Racing
Maximin Action Identification: A New Bandit Framework for Games, Garivier A., Kaufmann E., Koolen W. (2016).Track-and-Stop
Optimal Best Arm Identification with Fixed Confidence, Garivier A., Kaufmann E. (2016).LUCB-micro
Structured Best Arm Identification with Fixed Confidence, Huang R. et al. (2017).
1.3 Black-box Optimization
GP-UCB
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, Srinivas N., Krause A., Kakade S., Seeger M. (2009).HOO
X–Armed Bandits, Bubeck S., Munos R., Stoltz G., Szepesvari C. (2009).DOO/SOO
Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness, Munos R. (2011).StoOO
From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning, Munos R. (2014).StoSOO
Stochastic Simultaneous Optimistic Optimization, Valko M., Carpentier A., Munos R. (2013).POO
Black-box optimization of noisy functions with unknown smoothness, Grill J-B., Valko M., Munos R. (2015).EI-GP
Bayesian Optimization in AlphaGo, Chen Y. et al. (2018)