Please enable JavaScript.
Coggle requires JavaScript to display documents.
AI (image (Informed Search
New Idea: Heuristic (image
Consistent \(…
AI
-
MDP
- \( s \in S \) State
- \( a \in A \) Actions ( before implicit)
New:
- Transition Function \(T(s,a, s') \)-> non deterministic successor function
- Reward \( R(s,a,s') \)
- New: policy ( map from state to actions) \(\pi : S \rightarrow A \)
- Utility = Sum of feature rewards ( :red_cross: relation to optimality )
- Goal Find policy \( \pi^* \)which maximizes expected utility
- Offline planing T and R are known
- Assumption state space is countable and small
Value Iteration ( Goal: calculating \( V^*(s) \) )
- idea view equations as iterative update rule
- initialise with \( V_0(s) = 0 \)
\( V_{t+1}(s) = \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V_t (s')] \)
- Problem: slow, per iteration \( \mathcal{O(S^2A) } \)
- to get optimal policy \( \pi^* \rightarrow \) policy extraction
- Oberservation: Policy converges before Values -> wasted compute by considering all actions -> Policy Iteration
Policy Iteration:
- start with random policy \( \pi \)
- do many steps of policy evaluation
- do 1 step of policy extraction
Q-Value Iteration ( Goal: calculating \( Q^*(s, a) \) ) :question:
- idea view equations as iterative update rule
- initialise with \( Q_0(s, a) = 0 \) :question:
- \( Q_{t+1}(s, a) = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma\max_{a'}(Q_t (s', a')] \)
- Problem: slow, per iteration \( \mathcal{O(S^2A) } \) :question:
Policy Extraction (Goal: get policy consistent with Values V*)
- from Values V*
- \( \pi^*(s) = arg \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^*(s')] \)
- \( \mathcal{O(S^2A) } \)
from Q* Values :red_cross:
- \( \pi^*(s) = max_a (Q^*(s,a) \)
- \( \mathcal{ O (SA)} \)
Policy evaluation:
- \( V^\pi(s) = \sum_{s'} T(s, \pi(s), s') [R(s, a, s') + \gamma V^\pi(s')] \)
- \( \mathcal{O(S^2) } \)
Basics:
Value of a state:
- \(V^*(s) = \max_a Q^*(s,a) \)
- \( V^*(s) = \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^*(s')] \)
Q-Value of State
- \( Q^*(s, a) = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^*(s)]\)
- \( Q^*(s, a) = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma\max_{a'}(Q^*(s', a)] \)
Reinforcment Learning:
New:
- T and R are unknown
- Learn by interacting with enviroment
Model based
- Estimate T and R by interaction with the environment
- Do offline planning based on estimates
Model Free
Policy Evaluation( Goal: evaluate \( V^\pi \)):
- act according to some policy \( \pi \)
- after visting s transitioning into s' and recieving reward r
\( V^\pi(s) \leftarrow V^\pi(s) + \alpha ([r + \gamma V^\pi(s'))] - V^\pi(s) ) \)
Q learning (Goal learn optimal \( \pi^* \))
- act according to some policy \( \pi \)
- state action pair (s,a) has been visited with transition into state s' and reward r :
\( Q(s,a) \leftarrow Q(s,a) + \alpha ([r + \gamma \max_{a'} Q(s',a')] - Q(s,a)) \)
- if \( \pi \) fullfills mild conditions ( every state is visited suffiencently often) \( Q \rightarrow Q^* \)
- extract optimal policy with policy extraction
-
-
-
Deep Mind
-
Concepts
Definitions
-
-
Typically solve prediction to solve control
-
-
-
Three fundamental problems in sequential decision making
-
-
-
Scaling
-
-
-
New:
- model environments that go on forever -> need disounting factor \( \gamma \)
- Expectimax only calculates action at root
Relations:
- policy <-> path ( since we have uncertainty: learn map instead of path ) :red_cross:
Agents
Reflex Agent
- consider how worl is
- choose action based on current percept and maybee memory
. may have memory or model of the world current state
- do not consider future consequences
Planning Agents:
- ask what if? ( decision based on consequences)
- requires model of how world works( necessary for considering consequences)
- goal
Bayes Net
-