Please enable JavaScript.

Coggle requires JavaScript to display documents.

Reinforcement Learning (Resources (David Silver's lectures &…

- - - - the agent
        
        receives observation \(O_t\)
        
        receives scalar reward \(R_t\)
        
        executes action \(A_t\)
      - the environment
        
        receives action \(A_t\)
        
        emits observation \(O_t\)
        
        emits scalar reward \(R_t\)
  - - - a value function: how good is each state and/or action.
        A value function is a prediction of future reward.
        \(v_{\pi}(s) = \mathbb{E_{\pi}} [ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... \mid S_t = s ]\)
      - a model: agent's representation of the environment.
        A model predicts what the environment will do next.
        
        transitions model
        
        rewards model
      - a policy: agent's behaviour function
        
        stochastic: \(\pi(a \mid s) = \mathbb{P} [A_t = a \mid S_t = s] \)
        
        deterministic: \(a = \pi(s)\)
    - - exploration
      - exploitation
    - - Prediction (evaluate the future) v.s. Control (optimise the future)
      - Learning v.s. Planning