How much minutes of audio datasets to train for a single speaker using blizzard model?

Question

How much minutes of audio datasets to train for a single speaker using blizzard model?

jaxlinksync opened this issue 7 years ago · 47 comments

jaxlinksync commented 7 years ago

Answer 1 · 2017-11-08T03:59:21.000Z

the output is very different from my orig.wav file.
output.zip

Answer 2 · 2017-11-08T12:47:49.000Z

Did you use Blizzard 2011 dataset?

Answer 3 · 2017-11-08T14:03:04.000Z

@enk100 no, i used my own datasets.

Answer 4 · 2017-11-09T08:35:01.000Z

sj_017.gen_0.wav - is blizzard
Are you sure you train it on your data?
Did you change the data path to your own dataset?

Answer 5 · 2017-11-09T08:51:05.000Z

Did you change the data path to your own dataset?

which part do you mean? on training?

Answer 6 · 2017-11-09T08:52:33.000Z

yes, on train.py

Answer 7 · 2017-11-09T09:24:56.000Z

here's what I did @enk100 ,

extract own dataset using extract_feats.py
override data/blizzard/* with the datasets that was extracted from 1
run the training python train.py --noise 1 --expName blizzard_init --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-5 --epochs 10
run the second stage of the training python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90
Then generate python generate.py --npz data/blizzard/numpy_features_valid/sj_017.npz --checkpoint models/blizzard/bestmodel.pth

Answer 8 · 2017-11-09T09:29:49.000Z

Are you sure you didn't mix between your dataset & blizzard?
Can you look into data/blizzard/ and check that it contain only your dataset?

It is very odd that you hear blizzard and you didn't train it on blizzard... maybe you start from checkpoint of blizzard model?

Answer 9 · 2017-11-09T09:58:51.000Z

the data/blizzard only contains my datasets. I use the model/blizzard for training. Is that okay? or do I need to create a model from my datasets?

Answer 10 · 2017-11-09T10:03:13.000Z

You need to train the model from scratch.
Does the argument '--checkpoint' in train.py stay empty string or did you insert the blizzard model checkpoint?

Answer 11 · 2017-11-09T10:15:06.000Z

on the first stage of training the --checkpoint is empty. on the second stage of training the --checkpoint i use is checkpoints/blizzard_init/bestmodel.pth

Answer 12 · 2017-11-09T10:20:56.000Z

check please the argument '--checkpoint' in train.py. if it contain some checkpoint of blizzard then the first stage train on pretrained model of blizzard

Answer 13 · 2017-11-09T10:56:07.000Z

I'm sorry I'm confused on this statement

if it contain some checkpoint of blizzard then the first stage train on pretrained model of blizzard

Answer 14 · 2017-11-09T11:03:37.000Z

for example, if you got argument in 'default' in train.py-
parser.add_argument('--checkpoint', default='checkpoints/blizzard_init/bestmodel.pth', metavar='C', type=str, help='Checkpoint path')
then your training is initialize with blizzard model.

if the 'default' argument is empty then it is ok -
parser.add_argument('--checkpoint', default='', metavar='C', type=str, help='Checkpoint path')
you start to train your model from scratch.

somehow your model get blizzard samples, you should search for blizzard data leak.

Answer 15 · 2017-11-09T11:12:54.000Z

Ok I've done that. but what about the 2nd stage of training? do i need to execute it?

python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90

Answer 16 · 2017-11-09T11:15:34.000Z

yes, you should execute it with the checkpoint argument
--checkpoint checkpoints/blizzard_init/bestmodel.pth

Answer 17 · 2017-11-09T11:25:49.000Z

so it should give me the generated file with the same voice as my datasets right?

Answer 18 · 2017-11-09T11:38:38.000Z

yes, of course

Answer 19 · 2017-11-09T13:57:16.000Z

Thank you so much for the clarification @enk100

Answer 20 · 2017-11-10T04:20:12.000Z

Hi @enk100 I trained the data and generate an output but the generated wav file don't have a sound. See attachment below.
output2.zip

Answer 21 · 2017-11-10T13:01:57.000Z

1/ how many files do you have in your dataset for each speaker?
2/ are you sure that you extract the features correctly? you can check it by generate from the npz files
3/ how long did you train ? did you see convergence? can you share the learning curve ?

Answer 22 · 2017-11-10T16:29:37.000Z

how many files do you have in your dataset for each speaker?
A: I have 140 wav files of 1 speaker in my datasets and 140 txt files
are you sure that you extract the features correctly? you can check it by generate from the npz files
A: Yes I extracted it correctly. You can hear the generated npz on the zip file I attached (the file ending with 'orig.wav'.
how long did you train ? did you see convergence? can you share the learning curve ?
A: first stage of training 10 epochs. second stage of training 90 epochs. where can I see the convergence and the learning curve?

Answer 23 · 2017-11-11T16:09:32.000Z

What is the duration of this 140 files? i think that you should train it with more data. for vctk experiments each speaker has 20-25 min. alternatively you can try to fit the model on the new speaker like we wrote in the new version of the paper (just note that you need to use model that train on large amount of speakers).
good.
you can check it on the logger.

Answer 24 · 2017-11-13T00:05:04.000Z

The total duration is 23 mins. what do you mean by this (just note that you need to use model that train on large amount of speakers). Does that mean that I don't have to train from scratch and just use the model in your paper instead?

Thanks.

Answer 25 · 2017-11-13T04:08:47.000Z

Hi @enk100 I used the data in the vctk corpus for single speaker. after generation there is no sound.

Answer 26 · 2017-11-13T13:47:16.000Z

Hi, you can choose -

Combine your data with vctk data and train the model from scratch
Take the vctk model, and fine tune to your new identity - add embedding vector for your new speaker

good luck.

Answer 27 · 2017-11-13T16:38:06.000Z

You mean train it as multi speaker?

Answer 28 · 2017-11-13T21:58:11.000Z

yes. train it on vctk with the 22 speakers + your data

Answer 29 · 2017-11-14T01:12:46.000Z

so i have to run extract_feats.py with the 22 speakers + my data right?

Answer 30 · 2017-11-14T05:52:31.000Z

@enk100 if I train it on vctk with the 22 speakers + my data , should I set the --nspkr to 23 in train.py?

Answer 31 · 2017-11-14T08:32:40.000Z

@lvenoxi - yes
@jaxlinksync - no, run extract_feats.py only for your data and then combine the vctk22 with your data

Answer 32 · 2017-11-14T10:53:53.000Z

what about the norm.dat of the extracted data? do I have to add it also to the norm_info directory and just name it anything? in my case i name it sj_norm.dat.
so inside my norm_info directory is

norm.dat (included on downloading data in the voiceloop)
sj_norm.dat (norm file generated after extracting my datasets.)

Answer 33 · 2017-11-15T08:27:11.000Z

it only relevant when you are going to generate samples. so when you generate vctk, use vctk norm.dat. when you generate sj, use sj_norm.dat

Answer 34 · 2017-11-15T09:33:51.000Z

Hi @enk100 Thank you so much for your help. One last thing.
By generate samples you mean this command?
python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth
how can I pass the sj_norm.dat as a parameter?

Answer 35 · 2017-11-15T09:38:58.000Z

https://github.com/facebookresearch/loop/blob/c866e8df9b7afdc58460bcae060a3bc0e11a8987/generate.py#L86

modify this line or add new argument to the function

Answer 36 · 2017-11-15T09:44:51.000Z

Thank you so much @enk100

Answer 37 · 2017-11-15T09:46:26.000Z

you welcome!

Answer 38 · 2017-11-15T14:18:43.000Z

by the way @enk100 how do I know which speaker ID is my new speaker?

Answer 39 · 2017-11-15T15:49:44.000Z

print self.speakers
in https://github.com/facebookresearch/loop/blob/c866e8df9b7afdc58460bcae060a3bc0e11a8987/data.py#L94

Answer 40 · 2017-11-15T16:32:50.000Z

Hi @enk100 you're awesome 😄 thanks.

one last thing. so when I generate the voice which checkpoint will I use?
a. models/vctk/bestmodel.pth
b. checkpoint/(name of expName)/bestmodel.pth

Thank you so much for your help.

Answer 41 · 2017-11-15T19:51:01.000Z

try both:
checkpoint/(name of expName)/bestmodel.pth
checkpoint/(name of expName)/lastmodel.pth

Answer 42 · 2017-11-16T05:48:54.000Z

Thank you so much @enk100

Answer 43 · 2017-11-16T05:53:32.000Z

after I generate the data this is what I get.
output.zip
The generated output does not match the original wav file.
Here's the command when i generate
sudo python generate.py --npz data/vctk/numpy_features_valid/sj_014.npz --spkr 21 --checkpoint checkpoints/vctk_noise_2/bestmodel.pth
same goes for latestmodel.pth

Did I miss something? other speakers is ok but ours.

Answer 44 · 2017-11-16T08:38:32.000Z

Are you sure your speaker is 21? i guess it should be 22, as vctk has 22 speakers.
Can you get more data of your speaker?

Answer 45 · 2017-11-16T08:58:13.000Z

Hi @enk100 , I tried spkr 22 but it said that speaker did not exist. So i printed the list of speakers as per your suggestion above and got this.

As you can see the speaker with sj is 21

Answer 46 · 2017-11-16T08:59:47.000Z

@enk100 can you please confirm if our datasets are valid? Please pm me at jax@upskill.com so that I can send you a link to our corpus if it's ok with you.

Answer 47 · 2018-05-14T13:22:25.000Z

Hi, @jaxlinksync! Can you, please, give me an advice: did you succeed to fine tune an existing vector to your new identity?