/ETTTS_pytorch

pytorch implementation of deep convolutional text to speech model (DCTTS) in paper entitled Efficiently Trainable Text to Speech (ETTTS): https://arxiv.org/abs/1710.08969

Primary LanguageJupyter NotebookMIT LicenseMIT

ETTTS_pytorch

pytorch implementation of https://arxiv.org/abs/1710.08969

TODO

High level

ETTTS - convolutional TTS

  • https://arxiv.org/abs/1710.08969
    • read
    • understand math
    • draw architecture
    • implement in pytorch
      • get data
      • preprocess data
      • char embed
      • 1d conv
        • fix causality
      • 1d transpose conv oooooh
      • highway connection/highway convolution
      • weights initialize
      • textenc
      • audioenc
      • attention
        • guided
        • forcibly incremental
      • audiodec
      • ssrn
      • impl loss functions
      • train text2Mel
      • train SSRN
      • get GPU training working
        • collab
        • google cloud
        • make backwards compatible w/ CPU
      • bigger batch size - gpu mem usge at < 10%
        • might have to increase cores for dataloader - 5 cores about saturates gpu at batch size 16
        • pretty sure model limited by fetcher speed
      • checkpoint models % training
        • remember to call model.eval() on load chkpt to make sure layers are in evaluation (as opposed to training mode)
        • combine checkpointing logic for text2mel and ssrn by combining the text2Mel,audioDec,attention models into one class
        • save model results also
          • plots of attention,mel,fft
          • generated sound
          • model speed it/sec on cpu and gpu
      • different checkpoint paths for different model params
        • work smthg out that prevents loading models w/ conflicting hyperparams
          • doesn't this already happen?
        • incorporate hash of model structure into model name
        • automatic cold start i.e. don't have to specify load = 1|0
      • implement model params
        • nonsep vs sep vs super sep
        • batch vs layer vs weight vs instance vs group norm
        • alpha
        • learning rate
        • chunk size (1 default for paper)
        • sample rate
        • method to migrate checkpoints w/ different model param sets
      • check if calling contiguous after transpose/permute speeds up model
      • support different sample rates
        • recalculate hop length and fft window size
        • down/up sample in data fetcher
        • add as hyperparam in tunable model params
      • abstract class/fun for training/checkpointing/loss monitoring
      • test out if concatenating mel and text enc makes sense
        • probably does - common in most attention mechanisms
      • combine the text2Mel,audioDec,attention models into one class
      • generate text2Mel
      • generate SSRN
      • fix inference memory leak
        • with ch.no_grad()
      • train text2Mel and SSRN together
      • chunked generation - train network to encode multiple timesteps at a time
      • hyperparams
        • hyperparams class
          • add initialization?
        • have models take hyperparams class as arguments
        • hyperparameter optimization package
        • create train dispatcher to train different hyperparameter combinations on different gpus
          • request gpu limit increase -> 4
          • hyperparam queue?
          • more cores for dataloader? - maybe not for layer norm
      • multi GPU speedup
      • separate training code from model code
      • separate eval code from training code
      • set behavior at preempt to restart and resume training
      • split train test
      • separability
        • non sep
        • sep
        • super sep
        • check if non-torch lambda function slowing down network
        • get rid of unnecesary separability params for separable convolutions
        • try 2dconv over 1 channel instead of 1d conv over multiple channels
        • bottleneck conv layers
          • figure out why model not detecting bottleneck weights
          • training really slow
            • try layer norm between all bottleneck layers?
            • increase lr
          • ssrn not training
            • gradient clipping
          • try not bottlenecking when channel depth changes
            • doesnt work bc of highwat conv def
        • see if there's a way to decompose non separated weights into separated convolutions then finetune w/ separated architecture
        • some stuff here: https://arxiv.org/pdf/1706.07156.pdf
          • try 1 channel 2d conv w/ stride and perhaps dilation
      • normalization
        • batch norm
        • layer norm
          • channels <-- best so far
          • weights
        • instance norm
        • group norm
        • get idea for learning rate
      • decay
      • gradient clipping
      • residual connections vs highway connections
      • try different padding - found long sentences not spoken well at end
        • pad from other direction?
        • pad both ends of spectrogram randomly
        • modify guided attention loss function
      • get some NULL character going for padding - alternatively modify c2i to not map any character to 0
      • account for equal loudness perception envelope
        • mfcc vs mel spectrogram
        • equal loudness loss on WFT
        • equal loudness loss on MSB
  • use as reference
  • citations - [ ] main insipration: https://arxiv.org/abs/1705.03122
  • cited by

Further work