Bubeck & Bianchi - 2012 - Regret Analysis of Stochastic and
Nonstochastic Multi-armed
Bandit Problems

Chap 1 - Introduction

Chap 5 - Linear bandits

3 formalizations depending on the assumed nature of the reward process

stochastic MAB w i.i.d. payoff

UCB algo

nonstochastic MAB w adversarial payoff - a game-theoretic formulation - different gain vector g at different t

Exp3 randomised algo

Markovian

Gittins indices

oblivious adversary - payoff is set independent of the agent's actions

nonoblivious adversary - payoff/adversary is set adaptively to the agent's past behaviour - the regret compares the player's cumulative gain to that obtained by playing the single best arm for the first n round

types of averaged regret

expected regret

pseudo-regret

Chap 2 - Stochastic bandits

solve the exploration-exploitation dilemma using a heuristic principle "optimism in face of uncertainty"

upper confidence bound (UCB) strategies - alpha-UCB

set reward distribution lower bound using KL divergence

refinements of UCB

improved constants

second-order bounds

distribution-free bounds

high probability bounds

miu-greedy

Thompson sampling

heavy-tailed distributions

stochastic assumption is made on the generation of rewards

maximise the payoff/rewards; minimise the regret of the agent for not playing always optimally

Chap 3 - Adversarial bandits

no stochastic assumption is made on the generation of rewards

measure the performance of the player compared w the performance of the best arm in the past n rounds through the regret - the loss & gain versions r symmetric

pseudo-regret bounds

randomized agent Exp3

use an exponential reweighting of the cumulative estimated losses to define the prob distribution of each arm at each time t

Exp3 mixes the exponential weights w a uniform distribution over the arms

high prob & expected regret bounds

Exp3.P - introduce a bias in the gain estimate

lower bound

minimax lower bound

refinements of Exp3 & Exp3.P

log-free upper bounds

Implicitly normalized forecaster (INF) - weighted average forecaster

adaptive bounds

competing w the best switching strategy

stochastic vs adversarial bandits

alternative feedback structures

Chap 4 - Contextual bandits

"contextual regret" is introduced where optimality is defined w respect to the best policy (i.e. mapping from contexts to arms)

S-Exp3 - runs an instance of Exp3 on each context s

the expert case

there is a finite set of randomised policies/experts

each expert advice is a prob distr over arms representing the randomised play of expert j at time t

Exp4 - runs Exp3 over the N experts using estimates of the experts' losses; in order to draw arms, Exp4 mixes the expert advice w the prob distr over experts maintained by Exp3

competing against the best context set - a scenario in which Exp4 uses instances of S-Exp3 as experts

Stochastic contextual bandits

each policy is a function f mapping the context space to the arm space {1, ..., K} and the set F of policies is given as an input param to the forecaster

a bandit variant of supervised learning

the multiclass case

a bandit variant of the online protocol for multiclass classification, where the goal is to sequentially learn a mapping from the context space to the label space {1, ..., K}

Banditron - a bandit variant of the multiclass Perceptron

Exp2 (expanded Exp) w John's exploration

Online mirror descent (OMD)

online stochastic mirror descent (OSMD)

online combinatorial optimization

improved regret bounds for bandit feedback

Chap 6 - Nonlinear bandits

Chap 7 - Variants

a generalization of adversarial bandits - assume that the loss at each time step is a lienar function of arms

online linear optimization is still far from being completely understood

Two-point bandit feedback

One-point bandit feedback

nonlinear stochastic bandits

in both the stochastic & adversarial frameworks, the variants discussed in prev chapters basically revolved around a single principle: by adding constraints on the losses/rewards, it is possible to compete against larger sets of arms

Markov decision processes, restless & sleeping bandits

pure exploration problems - the goal is not maximising cumulative rewards

dueling bandits

discovery w probabilistic expert advice

many-armed bandits

truthful bandits

also known as stationary stochastic bandits - the distributions used to generate the rewards remain fixed

a statistic problem

a deterministic game theory problem

stationarity

stationary MAB problem - the distr of rewards remain constant

non-stationary MAB problem - the distr of rewards change over time; policy needs to track changes of the best arm