Please enable JavaScript.

Coggle requires JavaScript to display documents.

RL Reinforcement_learning_diagram.svg (MDP (Dynamic programing (Value…

- - - - function approximation
        
        Q value
        
        NN
      - MD
      - TD
  - - - Value Iteration
        
        期望值
        
        MD
        
        Variance
        
        TD
        
        biased
- - - - label * entropy
        
        y log(pi)
  - - - MD
      - TD
      - n-step return
        
        TD lamda
- - - - value iteration
        
        Q value
      - tabule
      - Policy iteration
  - - - function approximation
      - NN
        
        DQN
        
        classification
        
        Q target
        
        policy 做對的事
        
        on-policy
        
        off policy
      - learning
        
        MD
        
        baseline
        
        TD
        
        AD
        
        n-step return
        
        TD lamda
        
        policy gradient
- - - - total variation distance
      - KL divergence
  - - - action space
        
        \[\Huge \tilde{\pi}\]
      - transition matrix
        
        \[\Huge \rho_{\pi}\]
        
        \[\huge \rho_{\pi}(s)=P(s_{0}=s)+\gamma P(s_{1}=s)+\gamma^{2}P(s_{2}=s)+\cdots\]
      - 动作函数相比于当前状态的值函数的优势
        
        \[\eta(\tilde{\pi})=E_{\tau|\tilde{\pi}}\left[ \sum_{t=0}^{\infty}\gamma^{t}r(s_{t}) \right]\]
        
        \[ \eta(\tilde{\pi})=\eta(\pi)+E_{s_{0},a_{0},\dots\sim\tilde{\pi}}\left[ \sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t}) \right] \]
        
        \[ \eta(\tilde{\pi})=\eta(\pi)+\sum_{t=0}^{\infty}\sum_{s}P(s_{t}=s|\tilde{\pi})\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]
        
        s 由新的策略产生，严重依赖新策略
        
        \[ \eta(\tilde{\pi})=\eta(\pi)+\sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]
        
        π ~表示新策略。我们令 π 表示旧策略，那么拆分回报函数
  - - - 重要採樣
        
        \[\begin{gather*} \sum_{a}\tilde{\pi}(a|s_{n})A_{\theta_{old}}(s_{n},a)=E_{a\sim q}\left[ \frac{\tilde{\pi}(a|s_{n})}{q(a|s_{n})}A_{\theta_{old}}(s_{n},a) \right] \ \text{ let }q(a|s_{n})=\pi_{\theta_{old}}(a|s_{n}) \ \frac{1}{1-\gamma}E_{s\sim \rho_{\theta_{old}}}[\cdot] \text{ approximates } \sum_{s}\rho_{\theta_{old}}(s)[\cdot] \end{gather*}\]
      - \[ L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s}\rho_{\pi}(s)\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]
    - - \[\eta(\tilde{\pi})\geq L_{\pi}(\tilde{\pi}) - CD_{KL}^{\max}(\pi,\tilde{\pi})\quad \text{ where } C=\frac{2\epsilon\gamma}{(1-\gamma)^{2}}\]
      - \[\max_{\theta}[L_{\theta_{old}}(\theta)-CD_{KL}^{\max}]\]
        
        簡化
        
        \[\begin{split} & \max_{\theta}E_{s\sim \rho_{\theta_{old}}, a\sim\pi_{\theta_{old}}}\left[ \frac{\tilde{\pi}_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\theta_{old}}(s,a) \right] \ & \text{subject to } E_{s\sim \rho_{\theta_{old}}}[D_{KL}( \pi_{\theta_{old}}(\cdot | s) | \pi_{\theta}(\cdot | s) )] \leq \delta \end{split}\]
        
        近似
        
        共轭梯度法
        
        證明
- - - - expected rewards minus a baseline
    - - new policy cannot be too different from the old policy
- - - - TD
      - MC
      - AD
    - - q value
  - - - MD
      - policy iteration
      - value iteration
  - - - policy iteration
        
        eval. q
        
        to the end state
        
        improve policy at the same time
        
        策略搜尋
        
        greedy
      - value iteration
        
        evaluate V,
        
        eval action space
        
        to the end state
        
        策略搜尋
        
        greedy
  - - - Q learning
        
        DQN
        
        Q target
        
        Replay buffer
        
        Prioritized replay
        
        policy
        
        greedy
        
        eposion greedy
    - - 修改pi 使reward的期望值最大
        
        baseline
        
        PPO
        
        Q-V (動作相對於平均)
      - continue space.
      - 分類問題(maxlikehood)
      - ActorCritor
        
        reward 改成 Q value
    - - policy
        
        On
        
        action == eval.
        
        off
        
        action != eval
        
        Q target
        
        重要性取樣 PPO
      - Vist
        
        first
        
        只有第一次到該state 的G 才會取用
        
        every
        
        所有經歷過的G 都會取用
      - MC control
        
        soft epsilon
    - - TD lamda
        
        forward
        
        backward
  - - - linear function approx.
      - NN
    - - table
      - kernel method
        
        SVM