Is it convenient for you to share the pretrain model with me
wangjianfly2003 opened this issue · 15 comments
Hi Dr.Xu,
Is it convenient for you to share the pretrain model with me?
Hi, the initialized model was not pre-trained. Just with random initialization.
ok, it is here: https://github.com/yongxuUSTC/DNN-for-speech-enhancement/tree/master/toolbox/weights
source code for initializing your model weights randomly and change back the weights for matlab decoding
Thank you very much for your kindly reply, Dr.Xu.
You means i don't need to do the pretrain process, and can get the speech enhancement effect like you provide in DNN_speech_enhancement_tool using only the fine tune process?
Yes, correct. Just with fine-tuning process with random initialization. I once tried RBM-based pre-training which did not work.
OK. i will try to train an new model with collected noisy data using the fine-tuning process you provide. Thank you very much.
From reading your decoding code, i guess you use noisy speech and noisy data as input feature, use clean speech and noisy as output feature to train the model you provide. Am i right?
Besides, you use the normalized “timit_aurora4_115NT_7SNRS_each190_80uuts_noisy_lsp_be_random_linux_global_mv.mat” file to deal with the input noisy speech , however, i don't understand why you use this file to do DNN decoding , why not use the normalized output feature to do decoding?
The direct mapping is from noisy speech log-power spectra to clean speech log-power spectra. Additionally, you can also predict noise log-power spectra, ideal binary mask, or ideal ratio mask to do some post-processing.
The norm file is used both for training and decoding. In the decoding, you should normalize the input noisy feature, and transform the enhanced feature back to the normal scale using the norm file.
where do you find "“timit_aurora4_115NT_7SNRS_each190_80uuts_noisy_lsp_be_random_linux_global_mv.mat”" ?
I think i use a different one: https://drive.google.com/file/d/0B5r5bvRpQ5DRR1lIV1hpZ0RLQ0E/view
Hi Dr.XU. I made a mistake about the norm file. I tried the same norm file as you used.
In the "BP_GPU.cu" file, i think the code should be modified as below to make the output unit is linear, that is changed the second parameter from "cur_layer_y" to "cur_layer_x".
cudaMemcpy(dev[0].out,cur_layer_x,n_framescur_layer_unitssizeof(float),cudaMemcpyDeviceToDevice);
Am i right?
You are right. cudaMemcpy(dev[0].out,cur_layer_x,n_framescur_layer_unitssizeof(float),cudaMemcpyDeviceToDevice);
I think i uploaded the code for ideal binary mask prediction. I commented the sigmoid code, but forgot to change "cur_layer_y" to "cur_layer_x".
I have updated the code.
please update "cv_bunch_single" func also
Hi Dr.Xu. Today i used noisy speech log-power spectra as input feature (50 TIMIT clean speech corrupted with 100 enviroment noise type with -5db SNR), clean speech log-power spectra as target feature to train the model, the learning rate is 0.0005, the layersize is 2827(257*11),2048,2048,2048,257, the weights is random initialization; the number of epoch is 35(the value of squared_err is decreased).Then i use the trained model to to decoding, but got a very poor effect, even can't hear the speech.
Could you tell me how to determine the cause of the problem?
the size of training set is too small?
the decoding error is wrong?
...
Could you update the your "finetune_DNN_speech_enhancement_dropout_NAT.pl", "interface.cc" and "step1_DNNenh_for 16kHz.m" files for direct mapping model from noisy speech log-power spectra to clean speech log-power spectra. I think i only changed the above three files.
If you want to check your code, you can map from clean to clean, if it still does not work. That means your code has some problem. You should do inverse-fea-norm as i did in step1_DNNenh_for 16kHz.m. Please ref "step1_DNNenh_for 16kHz.m" for decoding. There is no problem in the decoding code.
Hi Dr,Xu. I mapped from clean to clean, it seems it still does not work. So i started to check the code, and found that the map from 11 frames of input feature to one frame of target feature is correct, but the input data of frame 5 and frame 10 in para->indata are the same , i also checked the frame 5 and frame 10 in dataori, which are not the same. So i think maybe there are something wrong in the following code:
for(j =0; j<= cur_frame_of_sent - para->fea_context;j++){
for(i =0;i< para->fea_context;i++){
for(k=0;k< para->fea_dim;k++){
para->indata[sample_index[cur_sample]* para->layersizes[0] +k +i *para->fea_dim] = dataori[(frames_processed +j +i) (2+para->fea_dim) +k+2];
}
}
I think the sentence "para->indata[sample_index[cur_sample] para->layersizes[0] +k +i *para->fea_dim] = dataori[(frames_processed +j +i) (2+para->fea_dim) +k+2];" should be changed to
"para->indata[sample_index[cur_sample] para->layersizes[0] +k +i para->fea_dim] = dataori[(frames_processed +j para->fea_context +i) *(2+para->fea_dim) +k+2];
Am i right?
i comment the following code in interface.cc file:
/* i=i-1;
for(k=129;k< 2*(para->fea_dim);k++){
para->indata[sample_index[cur_sample]* para->layersizes[0] +k +i *para->fea_dim] = (dataori[(frames_processed + 0) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 1) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 2) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 3) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 4) *(2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 5) *(2+para->fea_dim) +(k-129)+2])/6.0f;
}
*/