ETTTS_pytorch
pytorch implementation of https://arxiv.org/abs/1710.08969
TODO
High level
- get any NLP network working
- get any audio network working
- try
- chainer - https://docs.chainer.org/en/stable/
- gluon - https://medium.com/apache-mxnet/mxnet-gluon-in-60-minutes-3d49eccaf266
- pytorch
ETTTS - convolutional TTS
- https://arxiv.org/abs/1710.08969
- read
- understand math
- draw architecture
- implement in pytorch
- get data
- preprocess data
- char embed
- 1d conv
- fix causality
- 1d transpose conv oooooh
- highway connection/highway convolution
- weights initialize
- textenc
- audioenc
- attention
- guided
- forcibly incremental
- audiodec
- ssrn
- impl loss functions
- train text2Mel
- train SSRN
- get GPU training working
- collab
- google cloud
- make backwards compatible w/ CPU
- bigger batch size - gpu mem usge at < 10%
- might have to increase cores for dataloader - 5 cores about saturates gpu at batch size 16
- pretty sure model limited by fetcher speed
- checkpoint models % training
- remember to call model.eval() on load chkpt to make sure layers are in evaluation (as opposed to training mode)
- combine checkpointing logic for text2mel and ssrn by combining the text2Mel,audioDec,attention models into one class
- save model results also
- plots of attention,mel,fft
- generated sound
- model speed it/sec on cpu and gpu
- different checkpoint paths for different model params
- work smthg out that prevents loading models w/ conflicting hyperparams
- doesn't this already happen?
- incorporate hash of model structure into model name
- automatic cold start i.e. don't have to specify load = 1|0
- work smthg out that prevents loading models w/ conflicting hyperparams
- implement model params
- nonsep vs sep vs super sep
- batch vs layer vs weight vs instance vs group norm
- alpha
- learning rate
- chunk size (1 default for paper)
- sample rate
- method to migrate checkpoints w/ different model param sets
- check if calling contiguous after transpose/permute speeds up model
- support different sample rates
- recalculate hop length and fft window size
- down/up sample in data fetcher
- add as hyperparam in tunable model params
- abstract class/fun for training/checkpointing/loss monitoring
- test out if concatenating mel and text enc makes sense
- probably does - common in most attention mechanisms
- combine the text2Mel,audioDec,attention models into one class
- generate text2Mel
- generate SSRN
- fix inference memory leak
- with ch.no_grad()
- train text2Mel and SSRN together
- chunked generation - train network to encode multiple timesteps at a time
- hyperparams
- hyperparams class
- add initialization?
- have models take hyperparams class as arguments
- hyperparameter optimization package
- create train dispatcher to train different hyperparameter combinations on different gpus
- request gpu limit increase -> 4
- hyperparam queue?
- more cores for dataloader? - maybe not for layer norm
- hyperparams class
- multi GPU speedup
- separate training code from model code
- separate eval code from training code
- set behavior at preempt to restart and resume training
- split train test
- separability
- non sep
- sep
- super sep
- check if non-torch lambda function slowing down network
- get rid of unnecesary separability params for separable convolutions
- try 2dconv over 1 channel instead of 1d conv over multiple channels
- bottleneck conv layers
- figure out why model not detecting bottleneck weights
- training really slow
- try layer norm between all bottleneck layers?
- increase lr
- ssrn not training
- gradient clipping
- try not bottlenecking when channel depth changes
- doesnt work bc of highwat conv def
- see if there's a way to decompose non separated weights into separated convolutions then finetune w/ separated architecture
- some stuff here: https://arxiv.org/pdf/1706.07156.pdf
- try 1 channel 2d conv w/ stride and perhaps dilation
- normalization
- batch norm
- layer norm
- channels <-- best so far
- weights
- instance norm
- group norm
- get idea for learning rate
- decay
- gradient clipping
- residual connections vs highway connections
- try different padding - found long sentences not spoken well at end
- pad from other direction?
- pad both ends of spectrogram randomly
- modify guided attention loss function
- get some NULL character going for padding - alternatively modify c2i to not map any character to 0
- account for equal loudness perception envelope
- mfcc vs mel spectrogram
- equal loudness loss on WFT
- equal loudness loss on MSB
- use as reference
- citations - [ ] main insipration: https://arxiv.org/abs/1705.03122
- cited by
Further work
- waveRNN
- https://arxiv.org/pdf/1802.08435.pdf
- /Users/aduriseti/Documents/2018spring/tesla/WaveRNN-master
- /Users/aduriseti/Documents/2018spring/tesla/TensorFlow-Efficient-Neural-Audio-Synthesis-master
- waveNet
- similar to dctts but for speech recognition
- streaming spectrogram generation
- gan TTS/voice conversion (VC)
- adversatial audio sythesis: https://arxiv.org/abs/1802.04208
- https://github.com/r9y9/gantts
- https://www.youtube.com/watch?v=nsrSrYtKkT8
- styleNN - if only for the dataset
- deepVoice3
- tacotron:
- general optimization
- squeezenet
- mobilenet
- https://arxiv.org/abs/1704.04861
- depthwise separable /w memory managemnt opt and op vectorizing opt
- understand depthwise sep & complexity
- understand memory management opt
- understand op opt
- look at related papers
- try out channel thinning parameter
$\alpha$
- xception:
- https://arxiv.org/abs/1610.02357
- pure depthwise separable w/ residual connections - demonstrated state of the art performance on ImageNet and faster training
- depthwise sep convolution for NMT
- https://openreview.net/forum?id=S1jBcueAb
- super separability: group ptwise ops also
- they found parameter savings from serparation/super separation to be superior to param savings from dilation
- try it out
- bottleneck convolution layers
- sparsity constraints w/ pruning
- for RNN i.e. waveRNN
- forn CNN - saw package online - https://github.com/jacobgil/pytorch-pruning
- Note: supposedly not equally efficient to train
- network weight decomposition