Please enable JavaScript.
Coggle requires JavaScript to display documents.
RL Reinforcement_learning_diagram.svg (MDP (Dynamic programing (Value…
RL Reinforcement_learning_diagram.svg
MDP
Dynamic programing
Value Interation
function approximation
Q value
NN
MD
TD
表格紀錄state , action reward
function approximation
policy iteration
學習例子
transition matrix
Model-free
Value Iteration
期望值
MD
Variance
TD
biased
learning
Q learing
DQN
Duel DQN
Policy learning
gradient
classification
label * entropy
y log(pi)
error estimator
value estimator
MD
TD
n-step return
TD lamda
實作
gym
tensorflow
MDP
model
transition matrix
value iteration
Q value
tabule
Policy iteration
Dynamic programing
model-free
value estmator
function approximation
NN
DQN
classification
Q target
policy 做對的事
on-policy
off policy
learning
MD
baseline
TD
AD
n-step return
TD lamda
policy gradient
env
gym trading
TRPO 找到適合的learning rate
Monotonically improving guarantee
boundary
total variation distance
KL divergence
Objective function
advantage
action space
\[\Huge \tilde{\pi}\]
transition matrix
\[\Huge \rho_{\pi}\]
\[\huge \rho_{\pi}(s)=P(s_{0}=s)+\gamma P(s_{1}=s)+\gamma^{2}P(s_{2}=s)+\cdots\]
动作函数相比于当前状态的值函数的优势
\[\eta(\tilde{\pi})=E_{\tau|\tilde{\pi}}\left[ \sum_{t=0}^{\infty}\gamma^{t}r(s_{t}) \right]\]
\[ \eta(\tilde{\pi})=\eta(\pi)+E_{s_{0},a_{0},\dots\sim\tilde{\pi}}\left[ \sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t}) \right] \]
\[ \eta(\tilde{\pi})=\eta(\pi)+\sum_{t=0}^{\infty}\sum_{s}P(s_{t}=s|\tilde{\pi})\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]
s 由新的策略产生,严重依赖新策略
\[ \eta(\tilde{\pi})=\eta(\pi)+\sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]
π ~表示新策略。我们令 π 表示旧策略,那么拆分回报函数
Optimizing
Intuition
\[\huge\eta(\tilde{\pi} )=E_{\tilde{\pi}}[\sum_{t=0}^{\infty}\gamma^t(r(s_t))] η(π~)=Eπ~[t=0∑∞γt(r(st))]\]
Surrogate function
动作 a 也是由新策略产生,但是新策略参数是未知的
重要採樣
\[\begin{gather*} \sum_{a}\tilde{\pi}(a|s_{n})A_{\theta_{old}}(s_{n},a)=E_{a\sim q}\left[ \frac{\tilde{\pi}(a|s_{n})}{q(a|s_{n})}A_{\theta_{old}}(s_{n},a) \right] \ \text{ let }q(a|s_{n})=\pi_{\theta_{old}}(a|s_{n}) \ \frac{1}{1-\gamma}E_{s\sim \rho_{\theta_{old}}}[\cdot] \text{ approximates } \sum_{s}\rho_{\theta_{old}}(s)[\cdot] \end{gather*}\]
\[ L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s}\rho_{\pi}(s)\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]
learning rate issue
\[\eta(\tilde{\pi})\geq L_{\pi}(\tilde{\pi}) - CD_{KL}^{\max}(\pi,\tilde{\pi})\quad \text{ where } C=\frac{2\epsilon\gamma}{(1-\gamma)^{2}}\]
\[\max_{\theta}[L_{\theta_{old}}(\theta)-CD_{KL}^{\max}]\]
簡化
\[\begin{split} & \max_{\theta}E_{s\sim \rho_{\theta_{old}}, a\sim\pi_{\theta_{old}}}\left[ \frac{\tilde{\pi}_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\theta_{old}}(s,a) \right] \ & \text{subject to } E_{s\sim \rho_{\theta_{old}}}[D_{KL}( \pi_{\theta_{old}}(\cdot | s) | \pi_{\theta}(\cdot | s) )] \leq \delta \end{split}\]
近似
共轭梯度法
證明
TRPO/PPO
Surrogate function
\[\eta(\pi)=E_{\tau~\pi_{\theta}}[\sum_{t=0}^\infty\gamma^tr_{t}]\]
Monotonically improving guarantee
Objective function
advantage function
expected rewards minus a baseline
constraint
new policy cannot be too different from the old policy
MDP
*model free
*value estmator
TD
MC
AD
policy iteration
q value
value interation
*model base
value base
MD
policy iteration
value iteration
mode-base
value driven
policy iteration
eval. q
to the end state
improve policy at the same time
策略搜尋
greedy
value iteration
evaluate V,
eval action space
to the end state
策略搜尋
greedy
model free
value base
Q learning
DQN
Q target
Replay buffer
Prioritized replay
policy
greedy
eposion greedy
policy base
修改pi 使reward的期望值最大
baseline
PPO
Q-V (動作相對於平均)
continue space.
分類問題(maxlikehood)
ActorCritor
reward 改成 Q value
MC
policy
On
action == eval.
off
action != eval
Q target
重要性取樣 PPO
Vist
first
只有 第一次到該state 的G 才會取用
every
所有經歷過的G 都會取用
MC control
soft epsilon
TD
TD lamda
forward
backward
value estmator
參數化
linear function approx.
NN
非參數化
table
kernel method
SVM