Please enable JavaScript.
Coggle requires JavaScript to display documents.
Bubeck & Bianchi - 2012 - Regret Analysis of Stochastic and …
Bubeck & Bianchi - 2012 - Regret Analysis of Stochastic and
Nonstochastic Multi-armed
Bandit Problems
Chap 1 - Introduction
3 formalizations depending on the assumed nature of the reward process
stochastic MAB w i.i.d. payoff
UCB algo
also known as stationary stochastic bandits - the distributions used to generate the rewards remain fixed
a statistic problem
nonstochastic MAB w adversarial payoff - a game-theoretic formulation - different gain vector g at different t
Exp3 randomised algo
oblivious adversary - payoff is set independent of the agent's actions
nonoblivious adversary - payoff/adversary is set adaptively to the agent's past behaviour - the regret compares the player's cumulative gain to that obtained by playing the single best arm for the first n round
a deterministic game theory problem
Markovian
Gittins indices
types of averaged regret
expected regret
pseudo-regret
maximise the payoff/rewards; minimise the regret of the agent for not playing always optimally
stationarity
stationary MAB problem - the distr of rewards remain constant
non-stationary MAB problem - the distr of rewards change over time; policy needs to track changes of the best arm
Chap 5 - Linear bandits
Exp2 (expanded Exp) w John's exploration
Online mirror descent (OMD)
online stochastic mirror descent (OSMD)
online combinatorial optimization
improved regret bounds for bandit feedback
a generalization of adversarial bandits - assume that the loss at each time step is a lienar function of arms
online linear optimization is still far from being completely understood
Chap 2 - Stochastic bandits
solve the exploration-exploitation dilemma using a heuristic principle "optimism in face of uncertainty"
upper confidence bound (UCB) strategies - alpha-UCB
set reward distribution lower bound using KL divergence
refinements of UCB
improved constants
second-order bounds
distribution-free bounds
high probability bounds
miu-greedy
Thompson sampling
heavy-tailed distributions
stochastic assumption is made on the generation of rewards
Chap 3 - Adversarial bandits
no stochastic assumption is made on the generation of rewards
measure the performance of the player compared w the performance of the best arm in the past n rounds through the regret - the loss & gain versions r symmetric
pseudo-regret bounds
randomized agent Exp3
use an exponential reweighting of the cumulative estimated losses to define the prob distribution of each arm at each time t
Exp3 mixes the exponential weights w a uniform distribution over the arms
high prob & expected regret bounds
Exp3.P - introduce a bias in the gain estimate
lower bound
minimax lower bound
refinements of Exp3 & Exp3.P
log-free upper bounds
Implicitly normalized forecaster (INF) - weighted average forecaster
adaptive bounds
competing w the best switching strategy
stochastic vs adversarial bandits
alternative feedback structures
Chap 4 - Contextual bandits
"contextual regret" is introduced where optimality is defined w respect to the best policy (i.e. mapping from contexts to arms)
S-Exp3 - runs an instance of Exp3 on each context s
the expert case
there is a finite set of randomised policies/experts
each expert advice is a prob distr over arms representing the randomised play of expert j at time t
Exp4 - runs Exp3 over the N experts using estimates of the experts' losses; in order to draw arms, Exp4 mixes the expert advice w the prob distr over experts maintained by Exp3
competing against the best context set - a scenario in which Exp4 uses instances of S-Exp3 as experts
Stochastic contextual bandits
each policy is a function f mapping the context space to the arm space {1, ..., K} and the set F of policies is given as an input param to the forecaster
a bandit variant of supervised learning
the multiclass case
a bandit variant of the online protocol for multiclass classification, where the goal is to sequentially learn a mapping from the context space to the label space {1, ..., K}
Banditron - a bandit variant of the multiclass Perceptron
Chap 6 - Nonlinear bandits
Two-point bandit feedback
One-point bandit feedback
nonlinear stochastic bandits
Chap 7 - Variants
in both the stochastic & adversarial frameworks, the variants discussed in prev chapters basically revolved around a single principle: by adding constraints on the losses/rewards, it is possible to compete against larger sets of arms
Markov decision processes, restless & sleeping bandits
pure exploration problems - the goal is not maximising cumulative rewards
dueling bandits
discovery w probabilistic expert advice
many-armed bandits
truthful bandits