Sequential Learning

1 Sequential Learning

This section covers sequential learning approaches including multi-armed bandits, contextual bandits, best arm identification, and black-box optimization.

Prediction, Learning and Games, Cesa-Bianchi N., Lugosi G. (2006).

1.1 Multi-Armed Bandit

TS On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples, Thompson W. (1933).
Exploration and Exploitation in Organizational Learning, March J. (1991).
UCB1 / UCB2 Finite-time Analysis of the Multiarmed Bandit Problem, Auer P., Cesa-Bianchi N., Fischer P. (2002).
Empirical Bernstein / UCB-V Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Audibert J-Y, Munos R., Szepesvari C. (2009).
Empirical Bernstein Bounds and Sample Variance Penalization, Maurer A., Ponti M. (2009).
An Empirical Evaluation of Thompson Sampling, Chapelle O., Li L. (2011).
kl-UCB The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond, Garivier A., Cappé O. (2011).
KL-UCB Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation, Cappé O. et al. (2013).
IDS Information Directed Sampling and Bandits with Heteroscedastic Noise Kirschner J., Krause A. (2018).

1.1.1 Contextual Bandits

LinUCB A Contextual-Bandit Approach to Personalized News Article Recommendation, Li L. et al. (2010).
OFUL Improved Algorithms for Linear Stochastic Bandits, Abbasi-yadkori Y., Pal D., Szepesvári C. (2011).
Contextual Bandits with Linear Payoff Functions, Chu W. et al. (2011).
Self-normalization techniques for streaming confident regression, Maillard O.-A. (2017).
Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems Mann T. et al. (2018). (prediction setting)
Weighted Linear Bandits for Non-Stationary Environments, Russac Y. et al. (2019).
Linear bandits with Stochastic Delayed Feedback, Vernade C. et al. (2020).

1.2 Best Arm Identification

Successive Elimination Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, Even-Dar E. et al. (2006).
LUCB PAC Subset Selection in Stochastic Multi-armed Bandits, Kalyanakrishnan S. et al. (2012).
UGapE Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence, Gabillon V., Ghavamzadeh M., Lazaric A. (2012).
Sequential Halving Almost Optimal Exploration in Multi-Armed Bandits, Karnin Z. et al (2013).
M-LUCB / M-Racing Maximin Action Identification: A New Bandit Framework for Games, Garivier A., Kaufmann E., Koolen W. (2016).
Track-and-Stop Optimal Best Arm Identification with Fixed Confidence, Garivier A., Kaufmann E. (2016).
LUCB-micro Structured Best Arm Identification with Fixed Confidence, Huang R. et al. (2017).

1.3 Black-box Optimization

GP-UCB Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, Srinivas N., Krause A., Kakade S., Seeger M. (2009).
HOO X–Armed Bandits, Bubeck S., Munos R., Stoltz G., Szepesvari C. (2009).
DOO/SOO Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness, Munos R. (2011).
StoOO From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning, Munos R. (2014).
StoSOO Stochastic Simultaneous Optimistic Optimization, Valko M., Carpentier A., Munos R. (2013).
POO Black-box optimization of noisy functions with unknown smoothness, Grill J-B., Valko M., Munos R. (2015).
EI-GP Bayesian Optimization in AlphaGo, Chen Y. et al. (2018)