NVIDIA/flowtron

Published Flowtron LibriTTS2K model does not include iteration or optimizer

ttt733 opened this issue · 5 comments

Unless I'm missing something, the fine-tuning instructions in the readme do not work. In train.py:

iteration = checkpoint_dict['iteration']
...
    if len(ignore_layers) > 0:
        ...
    else:
        optimizer.load_state_dict(checkpoint_dict['optimizer'])

Hacking around the missing iteration value with iteration = 1 has been mentioned in previous issues, and the optimizer can be skipped over by putting a dummy value into ignore_layers, but it seems like making the published model fit the code would be ideal.

I cannot get the LibriTTS2K model to work with inference either, actually. I do not think the model's inheriting the weights properly, as it seems to be generating only random noise. If you see anything I'm doing wrong, let me know - if I can get it working, I'll put in a PR to update the readme instructions.
config.json

{
    "train_config": {
        "output_directory": "/outdir",
        "epochs": 10000000,
        "optim_algo": "RAdam",
        "learning_rate": 1e-3,
        "weight_decay": 1e-6,
        "grad_clip_val": 1,
        "sigma": 1.0,
        "iters_per_checkpoint": 1000,
        "batch_size": 1,
        "seed": 1234,
        "checkpoint_path": "models/flowtron_libritts2p3k.pt",
        "ignore_layers": [],
        "finetune_layers": [],
        "include_layers": ["speaker", "encoder", "embedding"],
        "warmstart_checkpoint_path": "",
        "with_tensorboard": true,
        "fp16_run": true,
        "gate_loss": true,
        "use_ctc_loss": true,
        "ctc_loss_weight": 0.01,
        "blank_logprob": -8,
        "ctc_loss_start_iter": 10000
    },
    "data_config": {
        "training_files": "filelists/libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt",
        "validation_files": "filelists/libritts_train_clean_100_audiopath_text_sid_atleast5min_val_filelist.txt",
        "text_cleaners": ["flowtron_cleaners"],
        "p_arpabet": 0.5,
        "cmudict_path": "data/cmudict_dictionary",
        "sampling_rate": 22050,
        "filter_length": 1024,
        "hop_length": 256,
        "win_length": 1024,
        "mel_fmin": 0.0,
        "mel_fmax": 8000.0,
        "max_wav_value": 32768.0,
        "use_attn_prior": true,
        "attn_prior_threshold": 0.0,
        "prior_cache_path": "/attention_prior_cache",
        "betab_scaling_factor": 1.0,
        "keep_ambiguous": false
    },
    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321"
    },
    "model_config": {
        "n_speakers": 123,
        "n_speaker_dim": 128,
        "n_text": 185,
        "n_text_dim": 512,
        "n_flows": 2,
        "n_mel_channels": 80,
        "n_attn_channels": 640,
        "n_hidden": 1024,
        "n_lstm_layers": 2,
        "mel_encoder_n_hidden": 512,
        "n_components": 0,
        "mean_scale": 0.0,
        "fixed_gaussian": true,
        "dummy_speaker_embedding": false,
        "use_gate_layer": true,
        "use_cumm_attention": false
    }
}

Command:
python inference.py -o ./outdir -c config.json -f models/flowtron_libritts2p3k.pt -w models/waveglow_256channels_universal_v5.pt -t "It is well known that deep generative models have a rich latent space!" -i 1088
Output:
sid1088_sigma0 5_attnlayer1
sid1088_sigma0 5_attnlayer0
Plus a 410 kb wav file of static. The waveglow model (v5) is the one linked in that repo's readme. And since it was mentioned in #74, my torch version is torch==1.8.1+cu111, though I wasn't sure what exactly was meant by "try inference in fp32."

waveglow_256channels_universal_v5.pt gives me nothing but noise as well. I could not figure out what was happening for a long time and then I switched to v4 and everything worked.

@ttt733 are you able to produce spectrograms with the pre-trained model?

No. I'm attempting to use the LibriTTS2k linked in the repo, and I've tried with waveglow v5 and v4 without success. In my latest attempt I'm also getting an error from pytorch:

~/dev/flowtron$ python inference.py -o ./outdir -c config.json -f models/flowtron_libritts2p3k.pt -w models/waveglow_256channels_universal_v4.pt -t "It is well known that deep generative models have a rich latent space!" -i 1088
/home/trevor/anaconda3/envs/blitz/lib/python3.8/site-packages/torch/serialization.py:671: SourceChangeWarning: source code of class 'torch.nn.modules.conv.ConvTranspose1d' has changed. Saved a reverse patch to ConvTranspose1d.patch. Run `patch -p0 < ConvTranspose1d.patch` to revert your changes.
  warnings.warn(msg, SourceChangeWarning)
/home/trevor/anaconda3/envs/blitz/lib/python3.8/site-packages/torch/serialization.py:671: SourceChangeWarning: source code of class 'torch.nn.modules.container.ModuleList' has changed. Tried to save a patch, but couldn't create a writable file ModuleList.patch. Make sure it doesn't exist and your working directory is writable.
  warnings.warn(msg, SourceChangeWarning)
/home/trevor/anaconda3/envs/blitz/lib/python3.8/site-packages/torch/serialization.py:671: SourceChangeWarning: source code of class 'torch.nn.modules.conv.Conv1d' has changed. Saved a reverse patch to Conv1d.patch. Run `patch -p0 < Conv1d.patch` to revert your changes.
  warnings.warn(msg, SourceChangeWarning)

The result is the same as what I posted above. Pytorch version is still 1.10.0.dev20210609.

It works with waveglow_256channels_ljs_v3.pt

to download :

curl -LO 'https://api.ngc.nvidia.com/v2/models/nvidia/waveglow_ljs_256channels/versions/3/files/waveglow_256channels_ljs_v3.pt'