Is it convenient for you to share the pretrain model with me

Question

Is it convenient for you to share the pretrain model with me

wangjianfly2003 opened this issue 7 years ago · 15 comments

wangjianfly2003 commented 7 years ago

Hi Dr.Xu,

Is it convenient for you to share the pretrain model with me?

Answer 1 · 2017-07-27T13:19:58.000Z

Hi, the initialized model was not pre-trained. Just with random initialization.

Answer 2 · 2017-07-27T13:33:27.000Z

ok, it is here: https://github.com/yongxuUSTC/DNN-for-speech-enhancement/tree/master/toolbox/weights

source code for initializing your model weights randomly and change back the weights for matlab decoding

Answer 3 · 2017-07-28T05:51:30.000Z

Thank you very much for your kindly reply, Dr.Xu.
You means i don't need to do the pretrain process, and can get the speech enhancement effect like you provide in DNN_speech_enhancement_tool using only the fine tune process?

Answer 4 · 2017-07-28T08:16:39.000Z

Yes, correct. Just with fine-tuning process with random initialization. I once tried RBM-based pre-training which did not work.

Answer 5 · 2017-07-28T09:40:47.000Z

OK. i will try to train an new model with collected noisy data using the fine-tuning process you provide. Thank you very much.

From reading your decoding code, i guess you use noisy speech and noisy data as input feature, use clean speech and noisy as output feature to train the model you provide. Am i right?
Besides, you use the normalized “timit_aurora4_115NT_7SNRS_each190_80uuts_noisy_lsp_be_random_linux_global_mv.mat” file to deal with the input noisy speech , however, i don't understand why you use this file to do DNN decoding , why not use the normalized output feature to do decoding?

Answer 6 · 2017-07-29T11:27:40.000Z

The direct mapping is from noisy speech log-power spectra to clean speech log-power spectra. Additionally, you can also predict noise log-power spectra, ideal binary mask, or ideal ratio mask to do some post-processing.

The norm file is used both for training and decoding. In the decoding, you should normalize the input noisy feature, and transform the enhanced feature back to the normal scale using the norm file.

Answer 7 · 2017-07-29T11:29:26.000Z

where do you find "“timit_aurora4_115NT_7SNRS_each190_80uuts_noisy_lsp_be_random_linux_global_mv.mat”" ?

I think i use a different one: https://drive.google.com/file/d/0B5r5bvRpQ5DRR1lIV1hpZ0RLQ0E/view

Answer 8 · 2017-07-31T06:34:39.000Z

Hi Dr.XU. I made a mistake about the norm file. I tried the same norm file as you used.

In the "BP_GPU.cu" file, i think the code should be modified as below to make the output unit is linear, that is changed the second parameter from "cur_layer_y" to "cur_layer_x".
cudaMemcpy(dev[0].out,cur_layer_x,n_framescur_layer_unitssizeof(float),cudaMemcpyDeviceToDevice);

Am i right?

Answer 9 · 2017-07-31T09:43:54.000Z

You are right. cudaMemcpy(dev[0].out,cur_layer_x,n_framescur_layer_unitssizeof(float),cudaMemcpyDeviceToDevice);

I think i uploaded the code for ideal binary mask prediction. I commented the sigmoid code, but forgot to change "cur_layer_y" to "cur_layer_x".

I have updated the code.

Answer 10 · 2017-07-31T09:47:16.000Z

please update "cv_bunch_single" func also

Answer 11 · 2017-08-01T08:59:54.000Z

Hi Dr.Xu. Today i used noisy speech log-power spectra as input feature (50 TIMIT clean speech corrupted with 100 enviroment noise type with -5db SNR), clean speech log-power spectra as target feature to train the model, the learning rate is 0.0005, the layersize is 2827(257*11),2048,2048,2048,257, the weights is random initialization; the number of epoch is 35(the value of squared_err is decreased).Then i use the trained model to to decoding, but got a very poor effect, even can't hear the speech.

Could you tell me how to determine the cause of the problem?

the size of training set is too small?
the decoding error is wrong?
...

Answer 12 · 2017-08-01T09:28:14.000Z

Could you update the your "finetune_DNN_speech_enhancement_dropout_NAT.pl", "interface.cc" and "step1_DNNenh_for 16kHz.m" files for direct mapping model from noisy speech log-power spectra to clean speech log-power spectra. I think i only changed the above three files.

Answer 13 · 2017-08-01T19:51:28.000Z

If you want to check your code, you can map from clean to clean, if it still does not work. That means your code has some problem. You should do inverse-fea-norm as i did in step1_DNNenh_for 16kHz.m. Please ref "step1_DNNenh_for 16kHz.m" for decoding. There is no problem in the decoding code.

Answer 14 · 2017-08-02T07:51:31.000Z

Hi Dr,Xu. I mapped from clean to clean, it seems it still does not work. So i started to check the code, and found that the map from 11 frames of input feature to one frame of target feature is correct, but the input data of frame 5 and frame 10 in para->indata are the same , i also checked the frame 5 and frame 10 in dataori, which are not the same. So i think maybe there are something wrong in the following code:
for(j =0; j<= cur_frame_of_sent - para->fea_context;j++){
for(i =0;i< para->fea_context;i++){
for(k=0;k< para->fea_dim;k++){
para->indata[sample_index[cur_sample]* para->layersizes[0] +k +i *para->fea_dim] = dataori[(frames_processed +j +i) (2+para->fea_dim) +k+2];
}
}
I think the sentence "para->indata[sample_index[cur_sample] para->layersizes[0] +k +i *para->fea_dim] = dataori[(frames_processed +j +i) (2+para->fea_dim) +k+2];" should be changed to
"para->indata[sample_index[cur_sample] para->layersizes[0] +k +i para->fea_dim] = dataori[(frames_processed +j para->fea_context +i) *(2+para->fea_dim) +k+2];
Am i right?

Answer 15 · 2017-08-02T08:05:55.000Z

i comment the following code in interface.cc file:
/* i=i-1;
for(k=129;k< 2*(para->fea_dim);k++){
para->indata[sample_index[cur_sample]* para->layersizes[0] +k +i *para->fea_dim] = (dataori[(frames_processed + 0) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 1) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 2) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 3) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 4) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 5) *(2+para->fea_dim) +(k-129)+2])/6.0f;
}
*/