Please enable JavaScript.
Coggle requires JavaScript to display documents.
RL (Explore-exploit dilemma (epsilin-greedy, optimistic initial value,…
RL
Explore-exploit
dilemma
epsilin-greedy
optimistic initial value
Upper Confidence Band 1
(UCB1)
Baysian or
Thompson sampling
MDP (Markov Dec. Proc.)
-solutions
DP
(Dynamic Programming)
pi-eval--> V(s)
pi-iteration: double-loop-->opt
value-iteration: single-loop--> opt
direct app of Bellman's Eq
MC (Monte Carlo)
First-Visit
pi-eval-Exploring-Start (ES)
opt-with ES
opt-without-ES using epsilon-soft
learn from exp
but not fully online
TD (Temporal Difference)
prediction
TD(0)
TD(1)
TD(lambda)
control(opt)
SARSA
Q-Learning (off-policy)
learn from exp
fully online
with bootstrapping
(make estimate from other estimate)
opt
on-policy (e.g DP, MC, TD-SARSA)
off-policy(e.g. TD-Q-Learning)
may not cont. with opt_pi
function-approximation
(generalization, no LUT)
MC-GD
(Gradient Descent)
TD0-SGD (semi-GD)
target depend of prediction
prediction: find V(s) for given policy
(policy evaluation)
Control: find optimum policy and V
(policy iteration)
LUT (not feasible for
large/continuous state/space)
Continuous state (light intensity)
Continuous Action (Force value)