Please enable JavaScript.
Coggle requires JavaScript to display documents.
Sample INefficiency in on-policy DRL, Second order Optimization…
Sample INefficiency
in on-policy DRL
small
learning rates
why are these small?
small changes in param space can cause huge (unwanted) changes
in the action distribution
how to overcome this?
Trust Region Updates
choose max learning rate for update
without changing policy too much
Second order Optimization Techniques
trust regions
limit change in policy
continuous stable
but slow improvements
policy
initialization
far from optimum
better
approaches
pretrain the policy
to output specific values
stochastic policies
gradient is an expectation
w.r.t states and actions
Deterministic Policy Gradient
is only an expectation w.r.t. states
needs exploration
optimistic Q(s,a) initialization
Noisy states/actions
exploration
(continuous spaces)
stochastic policy
noisy nets
optimistic value initialization
intrinsic motivation
hyperparameters
learning rate
batch size
discount factor
high
dimensionality
state space
action space
high number of NN parameters
throwing away
experiences
what could they
be used for?
improving policy
learning a model
absence
of a model
would allow saving
samples by planning
learned models are inprecise and
not sample efficient
why are models inprecise?
use model
only when it's precise
partially precise
for certain (s,a,s',r) pairs
learning objective
maximization of
return expectation
considered time horizon
is too long -> n-step return
expectation might not
be the optimum of distribution
1st order optimization techniques
inefficient
architectures
Linear Regression
in the output layer
Linear combination of inputs
no notion of symmetry and periodicity
reward function
Second order Optimization Techniques
Second order Optimization Techniques