RL Reinforcement_learning_diagram.svg

MDP

Dynamic programing

Value Interation

表格紀錄state , action reward

policy iteration

function approximation

學習例子

transition matrix

learning

Q learing

DQN

Q value

NN

Policy learning

error estimator

value estimator

MD

TD

Duel DQN

實作

gym

tensorflow

n-step return

TD lamda

gradient

classification

label * entropy

y log(pi)

Model-free

MD

TD

Value Iteration

期望值

MD

TD

Variance

biased

function approximation

MDP

model

transition matrix

value iteration

model-free

Q value

tabule

Dynamic programing

value estmator

function approximation

NN

learning

MD

TD

n-step return

TD lamda

DQN

classification

baseline

AD

policy gradient

Q target

policy 做對的事

on-policy

off policy

click to edit

Policy iteration

env

gym trading

TRPO 找到適合的learning rate

Monotonically improving guarantee

Objective function

boundary

advantage

action space

transition matrix

˜π

total variation distance

\[\huge\eta(\tilde{\pi} )=E_{\tilde{\pi}}[\sum_{t=0}^{\infty}\gamma^t(r(s_t))] η(π~)=Eπ~​[t=0∑∞​γt(r(st​))]\]


\[\Huge \rho_{\pi}\]

动作函数相比于当前状态的值函数的优势

\[\eta(\tilde{\pi})=E_{\tau|\tilde{\pi}}\left[ \sum_{t=0}^{\infty}\gamma^{t}r(s_{t}) \right]\]

\[ \eta(\tilde{\pi})=\eta(\pi)+E_{s_{0},a_{0},\dots\sim\tilde{\pi}}\left[ \sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t}) \right] \]

π ~表示新策略。我们令 π 表示旧策略,那么拆分回报函数

\[ \eta(\tilde{\pi})=\eta(\pi)+\sum_{t=0}^{\infty}\sum_{s}P(s_{t}=s|\tilde{\pi})\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]

s 由新的策略产生,严重依赖新策略

\[ \eta(\tilde{\pi})=\eta(\pi)+\sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]

Surrogate function

动作 a 也是由新策略产生,但是新策略参数是未知的

重要採樣

learning rate issue

\[ L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s}\rho_{\pi}(s)\sum_{a}\tilde{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \]

\[\begin{gather*} \sum_{a}\tilde{\pi}(a|s_{n})A_{\theta_{old}}(s_{n},a)=E_{a\sim q}\left[ \frac{\tilde{\pi}(a|s_{n})}{q(a|s_{n})}A_{\theta_{old}}(s_{n},a) \right] \ \text{ let }q(a|s_{n})=\pi_{\theta_{old}}(a|s_{n}) \ \frac{1}{1-\gamma}E_{s\sim \rho_{\theta_{old}}}[\cdot] \text{ approximates } \sum_{s}\rho_{\theta_{old}}(s)[\cdot] \end{gather*}\]

\[\eta(\tilde{\pi})\geq L_{\pi}(\tilde{\pi}) - CD_{KL}^{\max}(\pi,\tilde{\pi})\quad \text{ where } C=\frac{2\epsilon\gamma}{(1-\gamma)^{2}}\]

\[\max_{\theta}[L_{\theta_{old}}(\theta)-CD_{KL}^{\max}]\]

簡化

\[\begin{split} & \max_{\theta}E_{s\sim \rho_{\theta_{old}}, a\sim\pi_{\theta_{old}}}\left[ \frac{\tilde{\pi}_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\theta_{old}}(s,a) \right] \ & \text{subject to } E_{s\sim \rho_{\theta_{old}}}[D_{KL}( \pi_{\theta_{old}}(\cdot | s) | \pi_{\theta}(\cdot | s) )] \leq \delta \end{split}\]

近似

共轭梯度法

證明

\[\huge \rho_{\pi}(s)=P(s_{0}=s)+\gamma P(s_{1}=s)+\gamma^{2}P(s_{2}=s)+\cdots\]

KL divergence

Optimizing

Intuition

TRPO/PPO

Surrogate function

Monotonically improving guarantee

Objective function

\[\eta(\pi)=E_{\tau~\pi_{\theta}}[\sum_{t=0}^\infty\gamma^tr_{t}]\]

advantage function

expected rewards minus a baseline

constraint

new policy cannot be too different from the old policy

MDP

*model free

*value estmator

TD

MC

AD

policy iteration

*model base

value base

MD

policy iteration

value iteration

mode-base

value driven

policy iteration



main-qimg-112e75cfee3d444c2f864551fb3d0bc5

eval. q
valueiteration2

to the end state

improve policy at the same time

策略搜尋

greedy

main-qimg-b70a1bc916c1b48811129c8fd80eb865 value iteration


evaluate V, value

eval action space

to the end state

策略搜尋

greedy

model free

value base

Q learning

DQN

Q target

Replay buffer

Prioritized replay

policy

greedy

eposion greedy

policy base

修改pi 使reward的期望值最大

baseline

PPO

Q-V (動作相對於平均)

continue space.

分類問題(maxlikehood)

ActorCritor

reward 改成 Q value

MC
temporal-difference-learning-6-638

policy

On

action == eval.

off

action != eval

Q target

重要性取樣 PPO

Vist

first

只有 第一次到該state 的G 才會取用

every

所有經歷過的G 都會取用

MC control

soft epsilon

TD
images

TD lamda
20180120214835237

forward
1521812126162

backward
mc-td-41-638

value estmator

參數化

linear function approx.

NN

非參數化

table

kernel method

SVM

value interation

q value