152334H/DL-Art-School

Figure out the best training hyperparameters

152334H opened this issue · 20 comments

The numbers written in ./experiments/EXAMPLE_gpt.yml were picked completely at random! It is very likely the numbers can be better, so long as people are willing to test and see what works.

Please post results here if you change any of the parameters, even if it completely fails!

experiment 2

For my 2nd experiment (the first one being the one on the README page), I:

  • trained with the entirety of my Disco Elysium dataset, with a mix of many speakers, rather than with a single speaker.
  • I also adjust the learning rate higher -- to 1.5e-5 rather than 1e-5.

Both of these moves appear to have been mistakes. My mixed dataset was highly imbalanced, with >70% of the speech going to a single narrator alone (in a dataset of 100 speakers); this caused all voice outputs to be severely biased towards the most common speaker. I also observed much more noise in the resultant outputs, which might have to do with the dataset or with the higher learning rate or with the lack of other model fine-tunes.

Might commit results later, but my conclusions here are:

  • ensure multispeaker datasets are well-balanced between all speakers,
  • need to work on pipelines for training other model parts ASAP
  • maybe (?) don't touch the learning rate right now, not exactly sure on this

experiment 3

This was the one where I first used the colab notebook.

image

It went pretty well, which was surprising because the dataset had <200 samples.

However, this only really worked because I manually adjusted a whole bunch of parameters down. That led me to develop automatic calculations for some parameters based on the dataset.

experiment 4

image

This was just a redo of the previous experiment with the new automatic parameter system. Worked well enough.

experiment 5

image

This was my 2nd attempt at a multispeaker training session. This time, I capped samples for every character at a maximum of 1000 lines (in the training DS). I learned a few things:

  • loss logs are averaged based on the batches used since the last log printed. This means that the individual logged points can only really be compared well if the print log frequency is equivalent to the size of the dataset (in batch steps). Logging more often leads to more noisy up-down spikes, which will be displayed more smoothly on tensorboard
  • 20 epoches to decay is almost certainly too large for a dataset as large as mine (~20k lines). I set the decay at 3+ epoches (500 steps) initially, switched it out to 6+ epoches at the 600th step, and saw a pretty bumpy curve between 800-1000 steps, so the right decay step for this dataset probably lies somewhere between 500-1000 steps.
  • the results were honestly still pretty bad. Many speakers do not sound like what they ought to. I am not sure if CLVP/diffusion fine-tuning would solve this. Something to check: are all the 16 AR samples similar, or do they vary substantially?

experiment 6

Testing on a different dataset this time. Single speaker, female, emotional, fairly large dataset with maybe 1-2k samples.

Now that I've gotten the validation metrics working, I can use those as graphs:

image

This was a disastrous outcome, and the voices were all garbled when I test them. I don't know why, maybe the speaker is too different. I didn't change anything about the training process.

experiment 7

First case of diffusion fine-tuning! It looked amazingly good on the tensorboard graphs:

image

But the results were absolutely horrible! Sounded like random noise, incredible how bad it was.

I've been running tests on small datasets (15 total samples) and I notice the result sounds stepped, like he's talking through a broken speaker kind of thing, even weirder, the higher the preset you go the more it clears up and sounds good, I tried various combinations and believe it's mainly influenced by the autoregressive sample amount, not sure why it's getting this sort of effect..

Comparison:

All are using the same seed and the same candidate is being compared

Standard Preset - fine-tuned:
https://vocaroo.com/1mhvTk3mpPXt

Standard Preset - Original:
https://vocaroo.com/1jjsMCN54BZU

Ultra-fast - fine-tuned (notice the hiss and stepping):
https://vocaroo.com/1mYxZIWlHhZb

Ultra-fast - Original:
https://vocaroo.com/14VuHwH3s5Vw

Training curves && params would be good. It probably overfit on the small amount of data included, which could be made less bad if I manage to fix the conditioning latents problem.

Training curves && params would be good. It probably overfit on the small amount of data included, which could be made less bad if I manage to fix the conditioning latents problem.

I wonder if this can be further fixed using AudioLDM once they release their audio super-resolution, voicefixer completely destroys the speech

in regards to the diffusion model, I talked to the dev that wrote on reddit he retrained the vqvae, he said he didn't retrain the diffusion model at all

btw, check out this thread where neonbjb discusses the gpt training
neonbjb#10

I'm aware of the cheater latents problem, I discuss the problems with fixing that here, but thanks for the link nonetheless

I wonder if this can be further fixed using AudioLDM once they release their audio super-resolution, voicefixer completely destroys the speech

I haven't checked it out, I'll go do that later

in regards to the diffusion model, I talked to the dev that wrote on reddit he retrained the vqvae, he said he didn't retrain the diffusion model at all

Did he mean to recreate the VQVAE from scratch, or to fine-tune?

I'm not sure tbh
BTW

These configs were shared by neon on some random discussion awhile back, they're different from the ones in the original DL repo, perhaps they could help make sense of how he trained his GPT.
tts_flat_autoregressive_inputs_r2.zip

These are all very interesting... they look like the exact configs he used to train the actual tortoise model. This is the first time I've seen the real filepaths to his larger ocotillo transcribed dataset. I can already see some errors I made regarding the diffusion model trainer, like layer drop or lr decay.

This is good. Where was it from?

Regarding finding good hyperparameters, I think this might be useful.
https://github.com/optuna/optuna

I ran a bunch of experiments reducing lr (with its helpful bold comment "you should experiment with this value"). Reducing it seems to resolve the "stepped, like he's talking through a broken speaker kind of thing, even weirder, the higher the preset you go the more it clears up" situation. I found that values between 1e-7 and 5e-8 worked best (kinda hard to tell within that range which is best), avoiding both the unsmooth robot-like tonality of zero-shot (i.e., original model) and the stepped sound of 1e-5 . I'm using ~180 samples, .85/.15 train/validate, niter (I'm assuming this is "number of iterations" and synonymous with "steps") of 1800 so ~12 epochs, and then gen_lr_steps [462, 924, 1386, 1618] so stepping down the lr every four epochs. At least, that's what I think I'm doing anyway (not an ML genius), and it sounds pretty good. I'm training on a pretty normal voice that isn't that far off of the libritts-ish voices so may not need as much training as other voices would.

The thing that I still plagues me is issue 237 in the original tortoise repo: repeats (so, an inference issue, not Hparams). Posting on that in #61 to keep topics clean.

Sorry, ignore last comment. I hadn't comprehended steps well enough (i.e., "one batch of batch_size is 1 unit/step" in the example yml. I had a batch size of 77 (so two batches per epoch with 154 training samples), so 1800 steps was hundreds of epochs. Interesting to experiment with low learning rate, lots of iterations, I guess; nothing good enough to recommend. Works much better with lr 1e-5 and 5 to 8 epochs.

Are there units to the y-axis val_loss_text_ce? Is that just an arbitrary loss function? Trying to figure out if one can infer anything from the difference between it converging on, say 1.31 in experiment 6 here versus on 4.4 here in one of my recent experiments (or other future graphs), or if it is just more about the shape of the curve.

image

are you changing any temperature or or top p when using tortoise fast? so lower learning rate works better?

Caveat that I'm just a hobbyist here so my theoretical conceptions of these things are of a "I read a blog post about them" level. But I can report I have done experiments and can't discern any meaningful difference when moving the temperature or top_p dials (from .5 to .95 in each case). Or repetition_penalty or length_penalty for that matter -- nothing. At first I thought low top_p made the sound more "boring" (less prosody) but listening again now I think maybe that's just a bias from having read the docs, which say that's what it does.