Please enable JavaScript.
Coggle requires JavaScript to display documents.
Reinforcement Learning :checkered_flag: - Coggle Diagram
Reinforcement Learning :checkered_flag:
Methods
Model Free
Temporal Difference (TD)
R(s,a) + γΣ(P(s,a,s^' )maxQ(s^',a^' )-Q(s,a
)
On Policy (SARSA)
[ Q(S,A) to R+γQ(S’,A’) ]
Off Policy (Q-Learning)
Q(St, At) <- Q(St, At) + α [ Rt+1+ γmaxQ(St+1,a)-Q(St,At)
Monte Carlo (MC)
π^' (s)=argmaxΣ p(s^',r│s,a)[ r+ γv(s^')
Function Approximation
Incremental Methods
Batch Methods
DQN: It is a reinforcement learning algorithm where a deep learning model is built to find the actions an agent can take at each state.
Least Squares for Control
Least Squares: dn = yn – f(xn)
Policy Gradients
Policy Gradient Theorem: It describes the gradient of the expected discounted return with respect to an agent's policy parameters.
No Value Function (MC Policy Gradient = REINFORCE)
Actor-Critic (Both Approx value and Policy): It is an reinforcement-learning technique in which you simultaneously learn a policy function and a value function.
Model Based
Basic Idea is learning training method based on rewarding desired behaviors and/or punishing undesired ones
Planning: It is coming up with future actions without performing them.
Integration (Dyna): Itis designed for changing environment, and it gives reward to not-exploit-enough state, action pairs to drive an agent to explore.
Search: It is a subfield in reinforcement learning which focuses on finding good parameters for a given policy parametrization.
Introduction
Markov Decision Process
Value Iteration
[ (Vk+1(s) = maxΣP^a[R^a+ γVk(s^'))]
Policy Iteration
V(s) = Σ p (s^',r│s,π(s) )[ r+ γ(s^' )]
Value
v
Policy
π(s)
Reward
r
Exploration vs Exploitation
Bandits: It is a Machine Learning framework in which an agent has to select actions (arms) in order to maximize its cumulative reward in the long term.
Greedy Algorithms: It means the Agent constantly performs the action that is believed to yield the highest expected reward.
Dynamic Programming: It is used for the planning in a MDP either to solve policy evaluation or control problem
Bellman's Equation
[V(s)=maxa(R(s,a)+ γV(s’))]