Please enable JavaScript.
Coggle requires JavaScript to display documents.
Speech synthesis with deep neural networks (Other useful resources (CNNs…
Speech synthesis with deep neural networks
Other general links
https://youtu.be/nsrSrYtKkT8
A video from Heiga Zen describing different speech synthesis approaches
A comparison of different techniques:
https://www.reddit.com/user/kkastner/
(half way down)
https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers/blob/master/README.md#Speech-Synthesis
Generative Adversarial Networks
MelGAN
Examples
Official repo
Unofficial repo
Slides from NeurIPS
Main contribution
GAN-TTS
Main contribution
WaveNet
Simplified explanations
https://youtu.be/CqFIVCD1WWo
https://youtu.be/nsrSrYtKkT8
Some interesting future work ideas
Blog post
Key ideas (from Heiga Zen's talk)
Causal dilated convolution:
capture long-term dependency
Gated convolution and residual and skip:
powerful non-linearity
Softmax at output:
classification rather than regression
Further reading
Dilated convolutional neural network (
PixelCNN
)
Summary of paper
WaveNet
Dilated Causal Convolutions
Softmax distributions
Gated Activation Units
Residual and Skip Connections
Conditional WaveNets
Context stacks
Intro
Inspired by neural autoregressive models for images and text
Can a similar approach work for audio?
Architecture based on PixelCNN
Experiments
Multi-speaker speech generation
Text-to-speech
Music
Speech recognition
Conclusion
When applied to TTS, best subjective naturalness
Appendix
Nice text-to-speech background
Main contribution
Uses dilated convolutional neural networks to generate audio one sample at a time and achieves state of the art results for naturalness on TTS
Remaining issues
It is a neural backend (or neural vocoder) - isn't an end to end TTS system and it still relies on a frontend with some handcrafted features, also F0 tracking
High-frequency, auto-regressive nature of the model makes it much slower than real-time
SampleRNN
Key ideas
Predicts one sample at a time using multiple recurrent neural networks
Responds to WaveNet in that it has modules running at different clock rates (sample level, frame level) to pick up information at different time intervals (sample vs phoneme for example)
Similarly to WaveNet, still not end-to-end, Char2Wav takes on this challenge
Compares results to their own implementation of WaveNet
Unconditional, not conditioned on text. Just audio synthesis
Implements a combination of autoregressive perceptrons and a hierarchy of recurrent neural networks running at different clock rates to generate audio one sample at a time whilst capturing longer term temporal variations
Summary of paper
Remaining issues
This is just an audio synthesis model, no recognisable speech was generated, just mumbling type sounds
Char2Wav
Key ideas
Takes SampleRNN and attempts to build full end-to-end system (can produce audio directly from text, quite novel at this time)
Uses an encoder-decoder model with attention as the reader and SampleRNN as the vocoder
Summary of paper
Introduction
Describing how text-to-speech systems work (frontend, backend) and explaining how this system combines two models to achieve both
Related work
Discussion about attention based models and the work of Alex Graves
Model description
Description of the attention-based reader (including diagrams) Inspired heavily by
this paper
Good summary of the SampleRNN neural vocoder
Training details
Pretrained the reader and neural vocoder separately (using normalized
WORLD vocoder features
as targets for the reader and inputs for the vocoder). Code available
online
.
Results
No quantative analysis of results, insteads,
samples online
and example images
Remaining issues
No quantitative analysis of results
An end-to-end model for text-to-speech synthesis made up of two components - a reader (front end) that learns acoustic features from text and a neural vocoder (SampleRNN) that takes these features and generates audio.
Further reading
Attention based models, in particular,
the one used in this model
WORLD vocoder
Deep Voice
Main ideas
Lays the groundwork for a truly end-to-end system
Faster than WaveNet and production ready
Is standalone, unlike WaveNet that relied on a TTS system from Zen et al (2013) and Char2Wav that depends on the WORLD TTS system
Uses neural networks for every stage of the TTS system, meaning that we don't need hand engineered features
Discusses production requirements and optimisation a lot
A TTS system where every stage is implemented using neural networks and where a modified WaveNet is used to generate speech at faster than real-time in a production-ready manner
Summary of paper
Introduction
Good summary of TTS
Description of how this model uses neural networks for every stage of the system and can be retrained on new data without any hyperparameter changes
Related Work
Discussing previous use of NNs as substitutes for elements of a TTS system but how none of them combine to solve the entire TTS problem
Comparing to WaveNet, SampleRNN and Char2Wav and stating that DeepVoice does not need an external system (Zen et al 2013 for WaveNet, WORLD vocoder for Char2Wav)
TTS System components
Description of the five models used in the system
Grapheme-To-Phoneme Model
3.2 Segmentation Model
3.3 Phoneme duration and Fundamental Frequency Model
3.4 Audio Synthesis Model
Results
Results for each of the models
Seems like the audio synthesis part is good but that the phoneme duration and fundamental frequency models need to be improved
Optimizing inference
Technical details regarding speeding it up making it production ready
Conclusion
All neural networks, simplified system
Idea of removing separation between stages, turning it into a full sequence-to-sequence model
Remaining challenges
Seems like the audio synthesis part is good but that the phoneme duration and fundamental frequency models need to be improved
Idea of removing separation between stages, turning it into a full sequence-to-sequence model
Deep Voice 2
Key ideas
Improves single voice model compared to Deep Voice 2
Experiments with speaker embedding (new voices trained with 30 minutes of data)
Adds WaveNet vocoder to Tacotron model and compares to Griffin Lim (also improves)
Trainable speaker embedding idea taken from speech recognition
Deep Voice 3
Key ideas
Faster training due to the use of CNNs rather than RNNs (parallel)
Trained on huge amount of data
Scaling improvements, can run inference fast, however, uses WORLD rather than WaveNet (which by their own results is shown to sound better than WORLD)
Improves issues commonly found with attention based systems
Compares different vocoders (Griffin-Lim, WORLD and WaveNet)
A "fully-convolutional attention-based neural text-to-speech system" that trains much faster than comparable systems, is trained on a huge amounts of data, demonstrates ways to overcome common issues with attention-based systems and describes how the system is implemented in production setting
Tacotron
Further reading
Sequence to sequence learning with neural networks
(Sutskever et al)
Requires understanding of LTSMs. Read "
Understanding LSTM Networks
" by Chris Olah
Attention:
Neural Machine Translation by Jointly Learning to Align and Translate
(Bahdanau et al 2015)
Grammar as a Foreign Language
(Vinyals et al 2015)
Wang et al 2016
Main ideas
"end-to-end generative text-to-speech model that synthesizes speech directly from characters"
Faster than sample level auto-regressive models because it generates at the frame level
Outperforms a production parametric system in terms of naturalness
Can train the whole model from scratch, not lots of separate models (like Deep Voice 1)
An end-to-end TTS system that modifies existing sequence-to-sequence architectures to synthesize speech directly from characters, achieving a high mean opinion score for naturalness and - due to generating speech at the frame level - doing so faster than sample-level autogressive approaches
VoiceLoop
Simplified explanations
http://kbullaughey.github.io/lstm-play/2017/10/27/voice-loop-summary.html
Key ideas
Samples voices in the wild
Unconstrained voice samples without aligned linguistic and phonetic features
Much simpler model than others mentioned here
Heavily inspired by Phonological Loop
Compares and contrasts with Deep Voice 2, Tacotron and Char2Wav
Very fast at inference time
Attention model taken from Graves
A text-to-speech system with a much simpler network architecture than other so far that is able to create speech from voices sampled 'in-the-wild' that do not have aligned linguistic features or phonemes
Parallel WaveNet
Simplified explanations
https://youtu.be/hzpxXZJQNFg
Blog post
Key ideas
Much faster than WaveNet
Trains a new network on an existing WaveNet (teacher - student)
Uses
Inverse Autoregressive Flows
(IAFs)
Experiments include multi-speaker generation
Used for Google Assistant
WaveNets are quick to train as they use CNNs as opposed to RNNs but are slow to produce samples due to their autoregressive method (making it impossible to use parallel computing). This paper aims to solve this problem by introducing "Probability Distribution Distillation", a method for marrying the fast training of WaveNet with the fast generation of Inverse Autoregressive Flows (IAFs). The autoregressive WaveNet acts as a teacher and a parallel WaveNet acts as a student. The result is audio generation 3 orders of magnitude faster than the original WaveNet model with no loss of quality or naturalness.
Summary of paper
Introduction
Introduces WaveNet
Discusses problems caused by using an autoregressive model
Introduces Inverse Autoregressive Flows (IAFs)
Briefly explains how the paper proposes to marry the best features of WaveNet and IAFs
WaveNet
Recaps how WaveNet works
Discusses higher fidelity WaveNet (16bit 24kHz as opposed to 8bit 16kHz which is what the original paper used)
Parallel WaveNet
Explains how Inverse Autoregressive Flows work
Probability Density Distillation
Introduces student - teacher approach and contrasts with Generative Adversarial Networks
Discusses the use of Kullback - Leibler divergence as the loss function
Discusses the additional loss terms that were used
Experiments
Uses linguistic features and pitch information
Gives details of hyperparameters
Gives detail on audio generation speed (three orders of magnitude faster)
Discusses improved audio fidelity
Discusses
multi-speaker generation
Ablation study on different loss functions
Conclusion
Significantly faster than WaveNet with no loss of quality
Also applied this algorithm to other languages and speakers
Used in production for the Google Assistant
Tacotron 2
Key ideas
Combining end-to-end system from Tacotron with Wavenet
Replacing complex features used in WaveNet with mel-scaled spectrogram only
Fairly simple building blocks for predicting the mel-spectrogram
WaveNet architecture is simplified (due to using mel-spectrogram frames rather than text features. Don't need such a large receptive field)
Produces very high quality speech that comes close to human quality (and most issues seem to be regarding pronunciation and prosody)
Combines a text to mel-spectrogram prediction network and a modified WaveNet to produce more natural sounding results than any previous TTS system
Summary of paper
Introduction
A very good summary of all the major neural TTS work to date
Model architecture
2.1 Intermediate Feature Representation
Explanation of mel-spectrograms
2.2 Spectrogram Prediction Network
Description of this part of the model - a combination of fairly simple building blocks (in contrast to Tacotron 1)
2.3 WaveNet Vocoder
Explanation of why fewer layers are needed
Similar to PixelCN++ and Parallel WaveNet, a mixture of logistic distributions (MoL) is used rather than a discretized buckets
Experiments and Results
Hyperparameters
Evaluation - MOS and side-by-side evaluation
Comparison to Deep Voice 3
Brief discussion of pronunciation and prosody issues
Ablation studies to verify that the model architecture and intermediate features (mel-spectrograms) are appropriate. Interesting discussion of receptive field for the WaveNet part
Conclusion
High quality prosody coupled with audio quality, state-of-the-art naturalness
Further reading
WaveRNN
Key ideas
"reducing sampling time while maintaining high output quality"
Weight pruning
Subscaling
Takes a tensor with dimensions L and folds it into B sub-tensors of dimension L/B
This allows it to predict 16 samples at a time
Large but sparse model
Can run on mobile devices
Introduces a recurrent neural network that uses weight pruning and subscaling to produce high quality audio from a small model that can be run on low power CPUs, such as those found on mobile devices
(VQ-VAE)
(Neural Discrete Representation Learning)
Videos
https://youtu.be/HqaIkq3qH40
https://www.youtube.com/watch?v=QoCyQBzi7us&t=71s
Key ideas
Learning represenentations
(ClariNet)
Key ideas
Alternative parallel approach to WaveNet
Improves over DeepVoice 3
(Sample Efficient Adaptive Text-to-Speech)
Key ideas
WaveGlow
Key ideas
"combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression"
Simple model
Fast audio creation
Open source code
MOS as good as WaveNet
(FastSpeech)
Main contribution
Transformer TTS
Main contribution
Capacitron
Main contribution
Read Taylor (2009) for an overview of text-to-speech systems, also Zen's video is good
Other useful resources
Taylor (2009)
Zen's video
CNNs
http://cs231n.github.io/convolutional-networks/
https://github.com/vdumoulin/conv_arithmetic
http://colah.github.io/posts/2014-07-Understanding-Convolutions/
https://distill.pub/2016/deconv-checkerboard/
RNNs
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
LTSMs
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Grokking Deep Learning
Zach Hodari's slides
http://www.speech.zone/courses/speech-synthesis/module-8-speech-synthesis-using-neural-networks/
http://www.jordipons.me/google-speech-summit-2018/
Sequence-to-sequence
Sequence to Sequence Learning with Neural Networks
(Ilya Sutskever, Oriol Vinyals, Quoc V. Le)
arXiv vanity link
With attention
Neural Machine Translation by Jointly Learning to Align and Translate
(Bahdanau et al 2015)
Measuring a decade of progress in Text-to-Speech
(Simon King 2014)