Please enable JavaScript.
Coggle requires JavaScript to display documents.
Recurrent Neural Networks (Examples (Speech recognition, Music generation,…
Recurrent Neural Networks
Examples
Speech recognition
Music generation
Sentiment classification
DNA source analysis
Machine translation
Video activity recognition
Name entity recognition
Notation
input: Harry potter and hermione granger invented a new spell
x<1> x<2>... x<t>... x<9>
output: 1 1 0 1 1 0 0 0 0
y<1> y<2> ... y<9>
t-th element of sequence
Tx(i) = 9 - input length
How to represent words?
Vocabulary
Let's say we have
Dictionary: ["a", "aeron", "Harry", ... "Zulu"]
length = 10k
One-hot representation
[0...., 1, ...0]
1 - is index of word in dictionary
RNN Model
Standard network isn't working well:
inputs, outputs can be different lengths in different examples
doesn't share features learned across different positions of text
NN which tries to predict y<t>
gets input from previous NN y<t-1>
Downside:
can't get info from words further in the sentence
Backpropagation
Backpropagation through time
L<t>(y^<t>, y<t> = -y<t> log y^<t> - (1 - y<t>)*log(1 - y^<t>)
L(y^, y) = sum from 1 to Ty L<t>(y^<t>, y<t>)
RNN types
Many-to-many
x<1> x<2>... x<t>
y<1> y<2>... y<t>
Like in RNN model above
Many-to-one
Sentiment classification
x<1> x<2>... x<t>
y^
x=text
y=0,1 or 1..5
One to one
that one is pretty much ordinary NN
One-to-many
Music generation
x
y<1> y<2>... y<t>
Many-to-many where Tx != Ty
Translation between languages
here NN is broken into encoder and decoder
"attention based architectures"
Language models
Language model
given each sentence says what is the probability of it?
Training: take large corpus of english text
break down sentence to tokens (tokenize)
unknown words get token UNK
y<1> determines probabilities of all words in a sentence via softmax
y<2> takes into account the very first chosen word and determines probability of next word coming after selected by y<1>
something like that
Sampling novel sequences
take your trained network
input x<1>=0
see what it's outputing
Character level language model
vocabulary = [a, b, c... 0..9, A...Z]
Vanishing gradients
RNNs are bad at capturing long range dependencies
due to vanishing gradients!
cat
which already ate a lot of food
was
full
cats
which already ate a lot of food
were
full
Exploding gradients could also be bad,
but it's less of a problem for RNNs
Gated Recurrent Unit (GRU)
can be used to improve
long range dependency
we introduce c = memory cell
c<t> = a<t>
every time we'll consider overwriting c<t> with
~c<t> = tanh (Wc [c<t-1>, x<t>] + bc)
Гu = sigmoid(wu
[c<t-1>, x<t>] + bu]
c<t>.= Гu
~c<t> + (1 - Гu)* c<t-1>
Basically the memory sits in a cell
and can capture dependencies
and remember them
LSTM
Long range short term memory
Historically more proven,
usually tried first
<a lot of equations>
Bidirectional RNNs
BRNN
Downside of RNN, LSTM, GRU is
they don't capture dependencies
from further words in sentence
we have
x<1>...x<t>
y<1>...y<t>
as before, but we double amount of a<t>
so that there is another one more a for each old a
Downside:
in speech recognition you have to wait until user stops talking to get whole phrase
for speech - there are additional things people use (don't know which)
Deep RNNs
x<t> -> a<t> -> y<t>
in Deep RNN becomes
x<t> -> a[1]<t> -> a[2]<t> -> a[3]<t> -> a[l]<t> -> y<t>
or maybe
x<t> -> a[1]<t> -> a[2]<t> -> a[3]<t> -> DeepNN -> y<t>