Please enable JavaScript.

Coggle requires JavaScript to display documents.

Speech synthesis with deep neural networks (Other useful resources (CNNs…

- - - - Some interesting future work ideas
  - - - Dilated Causal Convolutions
      - Softmax distributions
      - Gated Activation Units
      - Residual and Skip Connections
      - Conditional WaveNets
      - Context stacks
    - - Inspired by neural autoregressive models for images and text
      - Can a similar approach work for audio?
      - Architecture based on PixelCNN
    - - Multi-speaker speech generation
      - Text-to-speech
      - Music
      - Speech recognition
    - - When applied to TTS, best subjective naturalness
    - - Nice text-to-speech background
- - - - Describing how text-to-speech systems work (frontend, backend) and explaining how this system combines two models to achieve both
    - - Discussion about attention based models and the work of Alex Graves
    - - Description of the attention-based reader (including diagrams) Inspired heavily by this paper
      - Good summary of the SampleRNN neural vocoder
    - - Pretrained the reader and neural vocoder separately (using normalized WORLD vocoder features as targets for the reader and inputs for the vocoder). Code available online.
    - - No quantative analysis of results, insteads, samples online and example images
- - - - Good summary of TTS
      - Description of how this model uses neural networks for every stage of the system and can be retrained on new data without any hyperparameter changes
    - - Discussing previous use of NNs as substitutes for elements of a TTS system but how none of them combine to solve the entire TTS problem
      - Comparing to WaveNet, SampleRNN and Char2Wav and stating that DeepVoice does not need an external system (Zen et al 2013 for WaveNet, WORLD vocoder for Char2Wav)
    - - Description of the five models used in the system
      - Grapheme-To-Phoneme Model
      - 3.2 Segmentation Model
      - 3.3 Phoneme duration and Fundamental Frequency Model
      - 3.4 Audio Synthesis Model
    - - Results for each of the models
      - Seems like the audio synthesis part is good but that the phoneme duration and fundamental frequency models need to be improved
    - - Technical details regarding speeding it up making it production ready
    - - All neural networks, simplified system
      - Idea of removing separation between stages, turning it into a full sequence-to-sequence model
- - - - Requires understanding of LTSMs. Read "Understanding LSTM Networks" by Chris Olah
- - - - Unconstrained voice samples without aligned linguistic and phonetic features
- - - - Introduces WaveNet
      - Discusses problems caused by using an autoregressive model
      - Introduces Inverse Autoregressive Flows (IAFs)
      - Briefly explains how the paper proposes to marry the best features of WaveNet and IAFs
    - - Recaps how WaveNet works
      - Discusses higher fidelity WaveNet (16bit 24kHz as opposed to 8bit 16kHz which is what the original paper used)
    - - Explains how Inverse Autoregressive Flows work
    - - Introduces student - teacher approach and contrasts with Generative Adversarial Networks
      - Discusses the use of Kullback - Leibler divergence as the loss function
      - Discusses the additional loss terms that were used
    - - Uses linguistic features and pitch information
      - Gives details of hyperparameters
      - Gives detail on audio generation speed (three orders of magnitude faster)
      - Discusses improved audio fidelity
      - Discusses multi-speaker generation
      - Ablation study on different loss functions
    - - Significantly faster than WaveNet with no loss of quality
      - Also applied this algorithm to other languages and speakers
      - Used in production for the Google Assistant
- - - - A very good summary of all the major neural TTS work to date
    - - 2.1 Intermediate Feature Representation
        
        Explanation of mel-spectrograms
      - 2.2 Spectrogram Prediction Network
        
        Description of this part of the model - a combination of fairly simple building blocks (in contrast to Tacotron 1)
      - 2.3 WaveNet Vocoder
        
        Explanation of why fewer layers are needed
        
        Similar to PixelCN++ and Parallel WaveNet, a mixture of logistic distributions (MoL) is used rather than a discretized buckets
    - - Hyperparameters
      - Evaluation - MOS and side-by-side evaluation
      - Comparison to Deep Voice 3
      - Brief discussion of pronunciation and prosody issues
      - Ablation studies to verify that the model architecture and intermediate features (mel-spectrograms) are appropriate. Interesting discussion of receptive field for the WaveNet part
    - - High quality prosody coupled with audio quality, state-of-the-art naturalness
- - - - Takes a tensor with dimensions L and folds it into B sub-tensors of dimension L/B
      - This allows it to predict 16 samples at a time
- - - - http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  - - - Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al 2015)