Tony607/Keras-Trigger-Word

How many samples required for custom keyword

Opened this issue · 21 comments

Hi I have created the dataset for my own custom key word with 15 backgrounds and some 60 keys and 60 negatives.I have created 4000 samples with randomly overlaying the keys over the background.I ran some 500 epochs starting from scratch and the accuracy is0.77. Its not even recognizing my keyword.
So I tried to create huge dataset of about 40000 samples in ubuntu instance of 32GB RAM,It ran out of memory?
How many samples do I need ?
If large samples are required what file format should I use ?

Hello! Do you solve this problem? Could you tell me how you set training samples, validation samples and test samples? And if is it overfitting when you set 500 epochs? I have 40 keys and 40 negatives in total, and I partition it into 20 training samples, 10 validation samples and 10 test samples, respectively. But the result I get is bad that the accuracy of keys on testing set is 0.51, can you give me a favor?

Hi @nicole-zhao . Please mail me.

Hi,
Is there anyone who used this model to train on customized command. I created data and related labels and trained it using the model, but does not find good results. I started with one sequence of noise and good keyword and created 100 examples of data set and labels out of it, than I tried with huge data set, the accuracy metrics shows upto 0.93 accuracy number but when tried on data set which is used to train the network, the results are not even near to already existing model which detects words rightly.
Any help?

Hi,
Is there anyone who used this model to train on customized command. I created data and related labels and trained it using the model, but does not find good results. I started with one sequence of noise and good keyword and created 100 examples of data set and labels out of it, than I tried with huge data set, the accuracy metrics shows upto 0.93 accuracy number but when tried on data set which is used to train the network, the results are not even near to already existing model which detects words rightly.
Any help?

I have trained with 4000 data. Got 77 percent accuracy only. How much samples did you create ?

@jennings1716 ,
Thanks for reply.
Initially I tried with 8gb set of data set thinking it would converge. Once I started from existing model and second time used entirely a new model. I am afraid this accuracy metric help very little to insight into performance of the network. In one of my simple experiment, I started training fresh model and accuracy moved from 0.5 and reached to 0.93 and when I tried predicting over sequence used to train the model, the result was randomly varying prediction whenever there is any keyword found in sequence.
With 77 percent accuracy, did you tried it over sequence used to train network? More the network trained, more it performs worst as compared to existing model. Something is wrong in dataset or in model, I feel.

Hello, How you made your training data ? Have you checked your labelings are ok ?

Hi Petrimmz,
Yes, it was own customized data set. I do not see any issue with labelling.
But I wonder is that a possible bug in the code that the whole model talk about sampling data at 44.1kHz but when it computes spectrogram, it computes it with 8kHz sample rate which does not make sense to me. I corrected the sample rate and after training it with high number of epochs, I can see my customized commands getting recognized fairly well.
Thanks

Ok, what is the output dimensions of your spectrogram ? I am using examples def graph_spectrogram(wav_file) -method, and it outputs 101, 998 when I use 10 sec long audio with sampling rate 8000hz, nfft = 200 , noverlap = 120 .. hmm, in example it outputs 5541 .. or is this just the bug you mentioned ?

I see this sample rate of 8kHz a bug. The dimension of the frame would not change with change of sample rate as nfft tells how many points you wanted which remain the same for 8k or 44.1kHz. Reduced sample rate as compared to the actual one would add aliasing which can cause problems in training.

Yes, I understand. How big is your training data ? How many "unique" positive keywords you used ? Or you just used "few unique ones" (maybe from one person only ?) and inserted those few keywords to hundreds of 10 sec long samples ? Are you using examples network structure, or did you make it deeper or wider etc. ?

I recorded some good dataset from various room repeatedly saying customized words. I have only one positive word to train against. The positive/negative words were inserted randomly over the 10second background noise. But note I am testing on test vectors against which network is trained. Yes, I am using example network. Where are you stuck?

Well, I am kind of stuck. It might be just amount of training data. But here is my setup:

64 Audio files, 8000hz, each of them have 2 positive keywords inserted. Keywords are randomly chosen from 90 positive keywords said by one person. That 10 sec audio is same person talking something else than positive keyword. I am not inserting it "on top of that existing speech", but automatically cutting randomly that 10 sec audio, and then insert this randomly chose 700ms long keyword. So, I am not using those existing 10 sec audio clips "as background" noise. I am labeling them as 0's except where that keyword is said.

I am using examples network structure, except my first line is different:
X = Conv1D(32, kernel_size=3, strides=2)(X_input)

My data dimensions are (64, 2000, 51) which is input and output is 64,999 ...

When I training it with adam and LR=0.0001 it just dont do it .. acc is about 0.7 and val_acc is 0.899 (Its learning only 0's .. maybe) ..

Should I just increase training data size or what is your opinion ?

seems you are doing manythings good...but I would have tried to get model working with its own params and network structure before making modifications.
It is always better to start from existing trained model as such model would not take longer to converge to new command set. You may converge to new command set only with few 5-6 odd good samples with training set of 1000 using 100 epoches. If you have to start from scratch, you may spend week sitting and waiting for anything useful coming up.. I didnt try to train from scratch, therefore cant advice how long you would have to wait... and what epoches you should be using. If you get it working let us how long it look you to train from scratch.
Any way, are your data set normalized? In my experience if you dont get accuracy of 0.98-0.99, network would not do any good...

Dont we have transfer learning for hotword detection model ??

@jennings1716 btw. How is your keyword -spotting doing ? Have you managed to do you own custom keyword detection system ?

I created .npy file with 10000 samples...the accuracy was 0.77. System RAM is not enough to hold large array greater than 10000. Writing in .h5 file consumes huge space and time.

Google has patent rights for transfer learning for hotword detection

So, you have 10000 samples with inserted keywords + negative words and you end up with only 0.77 accuracy .. somethings wrong. Have you tried to make network deeper / wider ?

No I didnt try... Deepening the networking with same sample would increase accuracy ??

Hi! So anyone make it work? I have the same problem as many of you. I created 5k examples ( 10 poss, 9 neg, 10 background ). Feeded the 5k examples in batches of 1k and after 850 epochs it doesnt work.

Can you provide the probability graph found from here?
plt.subplot(2, 1, 2)
plt.plot(predictions[0,:,0])
plt.ylabel('probability')
plt.show()
My graph is not starting from zero and I have a very high loss value starting from 0.9956 dropping to 0.70xx. Is this okay?