Please enable JavaScript.
Coggle requires JavaScript to display documents.
DL(17) Vanishing/Exploding of Gradients remedies - Coggle Diagram
DL(17) Vanishing/Exploding of Gradients remedies
Architectural
Gated Recurrent units
it simplify LSTM using a single gating unit that simultaneously control the forgetting factore and the decision to update the state unit
h
(t) =
z
(t) ⊙
h
(t-1) + (
1
-
z
(t)) ⊙ 𝜎(
Ux
(t) +
W
(
r
(t) ⊙
h
(t-1) )
the
update gate z
: select whether the hidden state is to be update with a new hidden state h
the
reset gate r
: decides whether the previous hidden state is ignored
Long Short-Term Memory
allows the networks to "remember relevant information" for a long period of time
idea
: replacing the sigmoid unit with something else that is easier to deal with when you perform the gradient descent in order to preserv information
structure
3 gate units
with sigmoid
output gate ON
: let the current value stored in the memory cell to be read in input
forget gate OFF
: let the current value stored in the memory cell to be reset to 0, it's
crucial
for LSTM performance
input gate: ON
: let input to flow in the memory cell
peepholes connection
allows to directly control all gates to allow for easier learning of precise time
linear memory cell
: integrate input information through time
memory obtained by self-loop
gradient not down-sized by Jacobian of sigmoidal function → no vanishing gradient
Full BPTT
Reservoir Computing
fix the input-to-hidden and hidden-to-hidden connection at random values and only learn the output units connections
in this way we have no backpropagation ⇒ no exploding/vanishing problem
Echo State Networks (ESN)
standard recurrent neurons
leaky integratos unit:
h
(t) = (1 - a)
h
(t-1) + 𝜎(
Ux
(t) +
Wh
(t-1), where a is the leaky decay rate (a < 1)
Liquid State Machines (LSM)
spiking
integrate-and-
fire
neurons
neurons become off after activation for some second
are methods that are trying to reproduce the behaviour of a real brain neuron
Reservoir Computing:
Additional details
the
reservoir
is randomly created and remains unchanged during training
hidden state
h
(t) manteins a nonlinear version of input history
to procude a
rich
set of dynamics, it should
be
big
sparsely
(
W
up to 20% possible connection) and
randomly
connected
satisfy the
echo state property
: p(
W
) < 1
passivly exited by input
x
(t)
on contrary input
U
and output
O
are
dense
output
is computed as a
linear combination
of the input-exited reservoir, the
linear combination
is obtained by
linear regression
a
simple cycle reservois
obtains performances comparable to
ESN
the
memory capacity
of a simple linear cyclic reservoir can be made to be close to proved
optimal memory capacity value
on the other side, very simple topologies can be very effective
Intrinsic Plasticity (IP)
is a efficient online learning rule to adjust threshold and gain of sigmoid reservoir neurons:
it drives the neurons' output activities to approximate exponential distributions
the exponential distribution maximized the entropy of a non-negative random variabile with a fixed mean, this enabling the neurons to transmit maximal information
alternative topologies for the reservoir were compared with no significant improvement
Deep Echo State Networks
deep version of the reservoir
mix
reservoir in parallel
deep forward
Memory Capacity
task
: reconstruct the input with increasing delay
memory capacity
: ∑ r² (
x
(t-k),
oᵏ
(t) )
where r² (
x
(t-k),
oᵏ
(t) ) is the squared correlation coefficient beeween:
input
:
x
(t-k) with delay k
corresponding output
:
oᵏ
(t) generated by the net at time t for delay k
target
:
yᵏ
(t) =
x
(t-k) ∀k ∈ [0, ... , ∞]