Please enable JavaScript.

Coggle requires JavaScript to display documents.

Intro to RL (Sutton) (tabular solution methods (Chap3 - finite MDPs (types…

- - - - action value estimation Q(a) - estimated reward
        
        stationary bandit problem
        
        sample-average methods
        
        incremental average: Q(a) = Q(a) + stepsize(R - Q(a))
        
        nonstationary bandit problem
        
        exponential recency-weighted average
      - action selection method
        
        greedy method
        
        epsilon-greedy method
        
        upper confidence bound (UCB) method
        
        gradient bandit algos
  - - - episodic task
      - continuing task
    - - state value v(s) = the expected return to be in state s
      - state-action value q(a,s) = the expected return of taking action a
  - - - on-policy
        
        Sarsa
      - off-policy (1 greedy policy, 1 epsilon-greedy policy)
        
        Q-learning
        
        expected sarsa
      - double learning to minimize maximization bias
        
        double Q-learning, double Sarsa, double expected Sarsa
      - types of value function updates - summary
        
        binary dimension 1
        
        update state values v(s)
        
        update action values q(s,a)
        
        dimension 2
        
        estimates the value for the optimal policy - v, q
        
        estimates values for arbitrary given policy
        
        dimension 3
        
        expected updates = considering all possible events that might happen
        
        sample update = considering a single sample of what might happen
  - - - by importance sampling (i.e. the relative prob under 2 policies of taking the n actions) - simple but w high vairance
      - by importance sampling with control variates
      - w/o importance sampling by using tree-backup algo (using all actions rather than sampling just one action)
  - - - distribution models - produce all next states & rewards' probabilities
      - sampled models - produces just one next state & reward's prob
      - state-space planning
      - both planning & learning methods r to estimate value functions by backing up the update operations; planning use simulated experience generated by model, learning methods use real experience generated by env
    - - model-based RL methods - rely on planning
      - model-free RL methods - rely on learning
    - - backward focusing
        
        prioritized sweeping - prioritize updates according to a measure of their urgency
        
        Dyna-Q+ = encourage exploration by giving 'bonus rewards' to actions that have not been tried for certain time steps; exploration is needed when there are environmental changes
        
        Dyna-Q = an architecture that integrates planning, acting, model-learning & direct RL - all occurring continuously
        
        'small backups' = one-sample update
    - - expected update (i.e. considering all possible events) - yields better estimate
      - sample update (considering just one sample) - simple, computational cheaper, prone to sampling error
      - sample updates r superior to expected update on problem w large state space
    - - exhaustive sweeping
      - trajectory sampling
        
        sampling based on uniform distribution
        
        sampling based on on-policy distribution (ex: real-time DP)
    - - background planning - using simulated experience to gradually improve a policy / value function
      - decision-time planning - using simulated experience to select an action for the current state
        
        heuristic search - great focusing of memory and computational resources on the current decision
        
        rollout algos - based on MC control -> estimate action value by averaging the returns of many simulated trajectories
        
        MC tree search - based on MC control applied to simulations that start from the root state
- - - - gradient MC algo
      - semi-gradient (bootstrapping) TD(0)
    - - need a feature vector x
      - linear gradient MC algo
      - linear semi-gradient TD methods
      - state aggregation = a special linear fn approximation
      - feature construction
        
        represent the interaction betw features by combining them - methods
        
        order-n polynomial basis features - not recommended for on-line learning
        
        n-dimensional Fourier cosine features
        
        coarse coding (binary features) - approximate value function by using other values that lie within the same receptive field
        
        tile coding (binary features) - a type of detailed coarse coding
        
        asymmetrically offset tiling has better generalization than uniform offset tiling
        
        tiling strategy - no of tiles, shape of tiles - determine generalization
        
        use hashing to reduce memory requirement
        
        radial basis function = coarse coding/tile coding w continuous-valued features - computational heavy
    - - training ANN
        
        use BP algo (a type of SG method) - consist of forward & backward passes; a change of weight influence the ANN performance - estimate the partial derivatives of the obj fn with regards to weights
        
        reduce overfitting w cross validation, regularization, weight sharing, dropout
        
        weight initialization using unsupervised learning algos (Hinton, 2006)
        
        batch normalization
        
        deep residual learning - learn how a fn differs from the identity fn
    - - nearest neighbour methods
      - weighted average methods
      - locally weighted regression
      - well-suited for real / simulated trajectory sampling in RL
      - speed degrades as memory size grows
    - - kernel fn = fn that assigns weights & measure strength of generalization; k(s', s) = a measure of the strength of generalization from s' to s
      - kernel regression; Gaussian RBF
      - any linear parametric regression method can be recast as kernel regression - instead of constructing features for linear regression, construct a kernel fn without referring to feature vectors
    - - interest I = a rand var indicated the degree to which we r interested in accurately valuing the state / state-action at time t
      - emphasis M = a rand var multiplies the learning update & emphasizes or de-emphasizes the learning at time t
  - - - semi-gradient one-step sarsa
      - n-step semi-gradient sarsa
    - - use average-reward setting coz the discounted setting is problematic w fn approximation
      - in average-reward setting, returns G r defined in term of differences betw rewards & the average reward = differential return
      - differential semi-gradient one-step sarsa
      - n-step differential semi-gradient sarsa
  - - - TD(1) = similar to MC method but can be applied to discounted continuing tasks & can be performed online
      - semi-gradient TD(lamda) algo
      - TD(0) = one-step semi-gradient ID - only one state preceding the current state is changed by the TD error
    - - n-step truncated lamda-return algo = TTD(lamda)
      - online lamda-return algo - determine a new weight vector w at each time step t
    - - ET used in true online TD(lamda) is called a dutch trace
      - the best performing TD algo - less computational heavy than online TD(lamda)
    - - extend ET to action-value methods
      - use fading-trace boostrapping strategy
      - True online Sarsa(lamda)
    - - Watkin's Q(lamda)
      - tree-backup(lamda) = TB(lamda)
      - stable methods - GTD(lamda), GQ(lamda), HTD(lamda), Emphatic TD(lamda)
  - - - the approximate policy can approach a deterministic policy
      - PG can find stochastic optimal policies
      - for some problem, it is simpler to approximate policy than action-value fn
      - a good way of injecting prior knowledge
    - - REINFORCE = a MC PG method (episodic)
      - REINORCE w baseline (episodic) - - learn state-value fn as baseline - faster
      - REINFORCE learn slowly (produce estimates w high variance) & not well suited for online & continuing tasks
    - - one-step actor-critic (episodic)
      - actor-critic w eligibility traces (episodic)
      - estimate both w & theta
    - - need to define performance in term of average rate of reward per time step w steady-state distribution & differential return
      - actor-critic w ET (continuing)
    - - to deal w continuous space w an infinite no of actions - instead of computing learned prob for each of the many actions, we instead learn statistics of the prob distribution
      - use prob density fn for the normal distribution; approximate mean as a linear fn, approximate std dev as the exponential of a linear fn
- - - - off-policy TD control algos
        
        Q-learning
        
        R-learning
      - on-policy TD control algos
        
        actor-critic methods
        
        Sarsa
    - - multi-step - link to MC
      - use function approximation (NNs) rather than tables
      - model-based - planning, link to DP