Please enable JavaScript.
Coggle requires JavaScript to display documents.
Reinforcement Learning (Resources (David Silver's lectures &…
Reinforcement Learning
Resources
David Silver's lectures & slides
(
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
;
https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=1
)
Introduction to Reinforcement Learning, 2nd, Sutton, Barto (
http://incompleteideas.net/book/the-book-2nd.html
)
What?
At each step \(t\)
the agent
receives observation \(O_t\)
receives scalar reward \(R_t\)
executes action \(A_t\)
the environment
receives action \(A_t\)
emits observation \(O_t\)
emits scalar reward \(R_t\)
The
history
is the sequence of observations, actions, rewards: \(H_t = A_1, O_1, R_1, ..., A_t, O_t, R_t\)
The
agent state
is the information used to determine what happens next: \(S_t^a = f(H_t)\)
The
environment state
\(S_t^e\) is the environment's private representation. \(S_t^e\) is usually not visible to the agent. Even if \(S_t^e\) is visible, it may contain irrelevant information.
An information state (a.k.a. Markov state) \(S_t\) is
Markov
if and only if:
\(\mathbb{P}[S_{t+1} \mid S_t] = \mathbb{P}[S_{t+1} \mid S_1, ..., S_t] \)
i.e.: "
The future is independent of the past given the present.
"
Full observability
: the agent
directly
observes the environment state, i.e.: \(O_t = S_t^a = S_t^e\)
An agent
may include
a
value function
: how good is each state and/or action.
A value function is a prediction of future reward.
\(v_{\pi}(s) = \mathbb{E_{\pi}} [ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... \mid S_t = s ]\)
a
model
: agent's representation of the environment.
A model predicts what the environment will do next.
transitions model
rewards model
a
policy
: agent's behaviour function
stochastic: \(\pi(a \mid s) = \mathbb{P} [A_t = a \mid S_t = s] \)
deterministic: \(a = \pi(s)\)
balances
exploration
exploitation
balances decision making problems
Prediction (evaluate the future) v.s. Control (optimise the future)
Learning v.s. Planning
How?
Temporal Difference
Distributional RL