Please enable JavaScript.
Coggle requires JavaScript to display documents.
COMP 579 Reinforcement Learning - Coggle Diagram
COMP 579 Reinforcement Learning
K-Armed Bandit
Exploration v.s. Exploitation
Action Selection
Optimistic Initial Values
Upper Confidence Bound (UCB): evaluate an action with its previous performance and a promising upper bound
\( \epsilon \) - Greedy: usually greedy, pick random action with probability \( \epsilon \)
Gradient Bandit Algorithm
Decaying \( \epsilon \) - Greedy: the value of \( \epsilon \) decrease by time
You choose an action A_t from k possibilities and receive a reward R_t
Regret: the opportunity loss for one step (value of the best action - value of action taken)
Total regret: the total opportunity loss
The UCB algorithm achieves logarithmic asymptotic total regret
Markov Decision Process
Markov Property: next state/reward only depend on previous state and action
MDP: a task with Markov Property that has finite state & action sets
Contextual Bandits
Have states, but not determined by previous states or actions
No delayed reward
2 types
Episodic Tasks : simple total reward
Continuing Tasks: discounted return, Bellman equation