Please enable JavaScript.
Coggle requires JavaScript to display documents.
Reasons for Sample INefficiency of on-policy Deep RL, inefficient …
Reasons for
Sample INefficiency
of on-policy Deep RL
stochastic policies
Deterministic Policy Gradient
is only an expectation w.r.t. states:
requires less samples to estimate
gradient is an expectation
over states and actions
(random) policy
initialization
far from optimum
trust regions
limit change in policy
per update step
big batch sizes
leads to precise estimation
of the policy gradient but
reduces number of policy updates
inefficient
architectures
Just a linear combination of inputs
no notion of symmetry and periodicity
Linear Regression
in the output layer
absence
of a model
would allow saving
samples by planning
a differentiable model would even
allow to backpropagate through it
high
dimensionality
of parameter space
requires many samples
for generalization
of state and action spaces
requires many samples
for sufficient exploration
small
learning rates
small changes in param space can cause huge (unwanted) changes
in the action distribution
complicated tasks
might require hierarchical control,
curriculum and prior knowledge
reward function
doesn't provide a
good enough training signal
bad credit assignment
requires many samples
to determine the true value of
individual states and actions
throwing away
experiences
to estimate gradient of a policy
we need trajectories to be sampled
from this same policy
LEGEND
Causes for Sample Inefficiency
details