facebookarchive/loop

How much minutes of audio datasets to train for a single speaker using blizzard model?

jaxlinksync opened this issue ยท 47 comments

How much minutes of audio datasets to train for a single speaker using blizzard model?

the output is very different from my orig.wav file.
output.zip

Did you use Blizzard 2011 dataset?

@enk100 no, i used my own datasets.

sj_017.gen_0.wav - is blizzard
Are you sure you train it on your data?
Did you change the data path to your own dataset?

Did you change the data path to your own dataset?

which part do you mean? on training?

yes, on train.py

here's what I did @enk100 ,

  1. extract own dataset using extract_feats.py
  2. override data/blizzard/* with the datasets that was extracted from 1
  3. run the training python train.py --noise 1 --expName blizzard_init --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-5 --epochs 10
  4. run the second stage of the training python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90
  5. Then generate python generate.py --npz data/blizzard/numpy_features_valid/sj_017.npz --checkpoint models/blizzard/bestmodel.pth

Are you sure you didn't mix between your dataset & blizzard?
Can you look into data/blizzard/ and check that it contain only your dataset?

It is very odd that you hear blizzard and you didn't train it on blizzard... maybe you start from checkpoint of blizzard model?

the data/blizzard only contains my datasets. I use the model/blizzard for training. Is that okay? or do I need to create a model from my datasets?

You need to train the model from scratch.
Does the argument '--checkpoint' in train.py stay empty string or did you insert the blizzard model checkpoint?

on the first stage of training the --checkpoint is empty. on the second stage of training the --checkpoint i use is checkpoints/blizzard_init/bestmodel.pth

check please the argument '--checkpoint' in train.py. if it contain some checkpoint of blizzard then the first stage train on pretrained model of blizzard

I'm sorry I'm confused on this statement

if it contain some checkpoint of blizzard then the first stage train on pretrained model of blizzard

for example, if you got argument in 'default' in train.py-
parser.add_argument('--checkpoint', default='checkpoints/blizzard_init/bestmodel.pth', metavar='C', type=str, help='Checkpoint path')
then your training is initialize with blizzard model.

if the 'default' argument is empty then it is ok -
parser.add_argument('--checkpoint', default='', metavar='C', type=str, help='Checkpoint path')
you start to train your model from scratch.

somehow your model get blizzard samples, you should search for blizzard data leak.

Ok I've done that. but what about the 2nd stage of training? do i need to execute it?

python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90

yes, you should execute it with the checkpoint argument
--checkpoint checkpoints/blizzard_init/bestmodel.pth

so it should give me the generated file with the same voice as my datasets right?

yes, of course

Thank you so much for the clarification @enk100

Hi @enk100 I trained the data and generate an output but the generated wav file don't have a sound. See attachment below.
output2.zip

1/ how many files do you have in your dataset for each speaker?
2/ are you sure that you extract the features correctly? you can check it by generate from the npz files
3/ how long did you train ? did you see convergence? can you share the learning curve ?

  1. how many files do you have in your dataset for each speaker?
    A: I have 140 wav files of 1 speaker in my datasets and 140 txt files

  2. are you sure that you extract the features correctly? you can check it by generate from the npz files
    A: Yes I extracted it correctly. You can hear the generated npz on the zip file I attached (the file ending with 'orig.wav'.

  3. how long did you train ? did you see convergence? can you share the learning curve ?
    A: first stage of training 10 epochs. second stage of training 90 epochs. where can I see the convergence and the learning curve?

  1. What is the duration of this 140 files? i think that you should train it with more data. for vctk experiments each speaker has 20-25 min. alternatively you can try to fit the model on the new speaker like we wrote in the new version of the paper (just note that you need to use model that train on large amount of speakers).
  2. good.
  3. you can check it on the logger.
  1. The total duration is 23 mins. what do you mean by this (just note that you need to use model that train on large amount of speakers). Does that mean that I don't have to train from scratch and just use the model in your paper instead?

Thanks.

Hi @enk100 I used the data in the vctk corpus for single speaker. after generation there is no sound.

Hi, you can choose -

  1. Combine your data with vctk data and train the model from scratch
  2. Take the vctk model, and fine tune to your new identity - add embedding vector for your new speaker

good luck.

You mean train it as multi speaker?

yes. train it on vctk with the 22 speakers + your data

so i have to run extract_feats.py with the 22 speakers + my data right?

@enk100 if I train it on vctk with the 22 speakers + my data , should I set the --nspkr to 23 in train.py?

@lvenoxi - yes
@jaxlinksync - no, run extract_feats.py only for your data and then combine the vctk22 with your data

what about the norm.dat of the extracted data? do I have to add it also to the norm_info directory and just name it anything? in my case i name it sj_norm.dat.
so inside my norm_info directory is

  1. norm.dat (included on downloading data in the voiceloop)
  2. sj_norm.dat (norm file generated after extracting my datasets.)

it only relevant when you are going to generate samples. so when you generate vctk, use vctk norm.dat. when you generate sj, use sj_norm.dat

Hi @enk100 Thank you so much for your help. One last thing.
By generate samples you mean this command?
python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth
how can I pass the sj_norm.dat as a parameter?

Thank you so much @enk100

you welcome!

by the way @enk100 how do I know which speaker ID is my new speaker?

Hi @enk100 you're awesome ๐Ÿ˜„ thanks.

one last thing. so when I generate the voice which checkpoint will I use?
a. models/vctk/bestmodel.pth
b. checkpoint/(name of expName)/bestmodel.pth

Thank you so much for your help.

try both:
checkpoint/(name of expName)/bestmodel.pth
checkpoint/(name of expName)/lastmodel.pth

Thank you so much @enk100

after I generate the data this is what I get.
output.zip
The generated output does not match the original wav file.
Here's the command when i generate
sudo python generate.py --npz data/vctk/numpy_features_valid/sj_014.npz --spkr 21 --checkpoint checkpoints/vctk_noise_2/bestmodel.pth
same goes for latestmodel.pth

Did I miss something? other speakers is ok but ours.

Are you sure your speaker is 21? i guess it should be 22, as vctk has 22 speakers.
Can you get more data of your speaker?

Hi @enk100 , I tried spkr 22 but it said that speaker did not exist. So i printed the list of speakers as per your suggestion above and got this.

image

As you can see the speaker with sj is 21

@enk100 can you please confirm if our datasets are valid? Please pm me at jax@upskill.com so that I can send you a link to our corpus if it's ok with you.

Hi, @jaxlinksync! Can you, please, give me an advice: did you succeed to fine tune an existing vector to your new identity?