RL Reinforcement_learning_diagram.svg
MDP
Dynamic programing
Value Interation
表格紀錄state , action reward
policy iteration
function approximation
學習例子
transition matrix
learning
Q learing
DQN
Q value
NN
Policy learning
error estimator
value estimator
MD
TD
Duel DQN
實作
gym
tensorflow
n-step return
TD lamda
gradient
classification
label * entropy
y log(pi)
Model-free
MD
TD
Value Iteration
期望值
MD
TD
Variance
biased
function approximation
MDP
model
transition matrix
value iteration
model-free
Q value
tabule
Dynamic programing
value estmator
function approximation
NN
learning
MD
TD
n-step return
TD lamda
DQN
classification
baseline
AD
policy gradient
Q target
policy 做對的事
on-policy
off policy
click to edit
Policy iteration
env
gym trading
TRPO 找到適合的learning rate
Monotonically improving guarantee
Objective function
boundary
advantage
action space
transition matrix
˜π
total variation distance
\[\huge\eta(\tilde{\pi} )=E_{\tilde{\pi}}[\sum_{t=0}^{\infty}\gamma^t(r(s_t))] η(π~)=Eπ~[t=0∑∞γt(r(st))]\]
\[\Huge \rho_{\pi}\]
动作函数相比于当前状态的值函数的优势
\[\eta(\tilde{\pi})=E_{\tau|\tilde{\pi}}\left[ \sum_{t=0}^{\infty}\gamma^{t}r(s_{t}) \right]\]
\[ \eta(\tilde{\pi})=\eta(\pi)+E_{s_{0},a_{0},\dots\sim\tilde{\pi}}\left[ \sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t}) \right] \]
π ~表示新策略。我们令 π 表示旧策略,那么拆分回报函数
\[ \eta(\tilde{\pi})=\eta(\pi)+\sum_{t=0}^{\infty}\sum_{s}P(s_{t}=s|\tilde{\pi})\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]
s 由新的策略产生,严重依赖新策略
\[ \eta(\tilde{\pi})=\eta(\pi)+\sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]
Surrogate function
动作 a 也是由新策略产生,但是新策略参数是未知的
重要採樣
learning rate issue
\[ L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s}\rho_{\pi}(s)\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]
\[\begin{gather*} \sum_{a}\tilde{\pi}(a|s_{n})A_{\theta_{old}}(s_{n},a)=E_{a\sim q}\left[ \frac{\tilde{\pi}(a|s_{n})}{q(a|s_{n})}A_{\theta_{old}}(s_{n},a) \right] \ \text{ let }q(a|s_{n})=\pi_{\theta_{old}}(a|s_{n}) \ \frac{1}{1-\gamma}E_{s\sim \rho_{\theta_{old}}}[\cdot] \text{ approximates } \sum_{s}\rho_{\theta_{old}}(s)[\cdot] \end{gather*}\]
\[\eta(\tilde{\pi})\geq L_{\pi}(\tilde{\pi}) - CD_{KL}^{\max}(\pi,\tilde{\pi})\quad \text{ where } C=\frac{2\epsilon\gamma}{(1-\gamma)^{2}}\]
\[\max_{\theta}[L_{\theta_{old}}(\theta)-CD_{KL}^{\max}]\]
簡化
\[\begin{split} & \max_{\theta}E_{s\sim \rho_{\theta_{old}}, a\sim\pi_{\theta_{old}}}\left[ \frac{\tilde{\pi}_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\theta_{old}}(s,a) \right] \ & \text{subject to } E_{s\sim \rho_{\theta_{old}}}[D_{KL}( \pi_{\theta_{old}}(\cdot | s) | \pi_{\theta}(\cdot | s) )] \leq \delta \end{split}\]
近似
共轭梯度法
證明
\[\huge \rho_{\pi}(s)=P(s_{0}=s)+\gamma P(s_{1}=s)+\gamma^{2}P(s_{2}=s)+\cdots\]
KL divergence
Optimizing
Intuition
TRPO/PPO
Surrogate function
Monotonically improving guarantee
Objective function
\[\eta(\pi)=E_{\tau~\pi_{\theta}}[\sum_{t=0}^\infty\gamma^tr_{t}]\]
advantage function
expected rewards minus a baseline
constraint
new policy cannot be too different from the old policy
MDP
*model free
*value estmator
TD
MC
AD
policy iteration
*model base
value base
MD
policy iteration
value iteration
mode-base
value driven
policy iteration
eval. q
to the end state
improve policy at the same time
策略搜尋
greedy
value iteration
evaluate V,
eval action space
to the end state
策略搜尋
greedy
model free
value base
Q learning
DQN
Q target
Replay buffer
Prioritized replay
policy
greedy
eposion greedy
policy base
修改pi 使reward的期望值最大
baseline
PPO
Q-V (動作相對於平均)
continue space.
分類問題(maxlikehood)
ActorCritor
reward 改成 Q value
MC
policy
On
action == eval.
off
action != eval
Q target
重要性取樣 PPO
Vist
first
只有 第一次到該state 的G 才會取用
every
所有經歷過的G 都會取用
MC control
soft epsilon
TD
TD lamda
forward
backward
value estmator
參數化
linear function approx.
NN
非參數化
table
kernel method
SVM
value interation
q value