Please enable JavaScript.

Coggle requires JavaScript to display documents.

RL (Model Unknown Transition function and Reward functions are not…

- - - - SARSA
        Similar to TD prediction, we now bootstrap action value function calculation for the current state-action pair based on the reward obtained and the next state-action pairs action value. The idea is exactly same as TD prediction the only difference that it was about state value function now we are calculating action value function.
      - Q-learning
        Instead of bootstrapping from the next state-action pairs action value as mentioned in SARSA, in this algorithm the best action for the next sate, action value is considered. The intuition is that we want to know the action value of the current state if we take the best action on the next state. This is called of-policy learning as the policy getting learned is different than the policy getting updated.
      - MC Control
        Similar to MC prediction, whole episode is generated by interacting with the environment. Based on the state-action pairs, the return is calculated. The calculation is similar to MC control the only difference is that both action-state are considered in action value function calculation where as only state was considered in state value calculations.
      - Double Q-learning
        The main disadvantage with q-learning is that we are always selecting the best action from the next state based on the arg max, this may cascade the initial errors and compound the error the propagation. To solve this problem two copies of the q-function are created and randomly each of them is updated at each interaction. The final q-value is average of both of them, and it seems to reduce the error propagation.
      - SARSA lambda
        Similar to TD lambda, now we consider the lambda weighed sum of n next states q-function for estimation of the current action value function. Calculating for the in-between state action value function and then weighting them is computationally expensive because of that eligibility traces are used, the idea is that all the states that are encountered in the trajectory, all of their errors are used to update the current action-value estimate
      - watkins lambda
        The idea of lambda calculations is applied to q learning, similar to calculations based on best action at the next step, the eligibility traces are applied to obtain the estimate of action value function.
    - - Dyna Q
        It is a modified q-learning algorithm. In addition to updates of action value function at each state, another loop is started which randomly selects the visited states, and action, and based on the learned state transition function and reward function next state and updates the action value function. The loop for this random action selection and simulation continues for the predefined amount of time. This way the agent is learning from both actual and simulated interactions.
      - Trajctory Sampling
        The main disdvantage of dyna-q is that, in the simulated states, the staest are selected at random. This is inefficient, why we should care about the random states, we should care about the most probable states that we might encounter in the interaction with the environment. Becase of that the simulation loop is improved where the actions are taken based on the current action-value function, instead of getting radomly visited state, the trajectory starts with the current state and continues with the learned model. The action-value function is updated based on this.