DNN-based speech enhancement optimized by a maximum likelihood criterion rather than the conventional MMSE criterion
This repository contains the codes and demos for the paper "Using Generalized Gaussian Distributions to Improve Regression Error Modeling for Deep-Learning-Based Speech Enhancement" (submitted to IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING)
1.cd SourceCode_Wav2LogSpec_be
Execute make
to generate the executable file "Wav2LPS_be"
Note: Set the sampling frequency, frame length and frame shift in the source code "Wav2LogSpec_be.c" according to your own needs. In our paper, we set the sampling frequency, frame length and frame shift to 16kHz, 32ms and 16ms respectively.
2.Extract LPS features with command below:
matlab -nodesktop -nosplash -r LPS_extract
1.Calculate the number of frames per sentence with command below:
perl GetLenForFeaScp.pl train_noisy.scp frame_numbers.len 257 1
2.Use quicknet toolset to prepare Pfile as the input and the output files with command below:
perl pfile_noisy.pl
perl pfile_clean.pl
Pfile is the big file of all training features.
3.Calculate the mean and standard deviation with the command below for normalization by Z-score:
perl get_norm.pl
Note that the reciprocal of standard deviation rather than standard deviation is obtained after executing this command.
1.cuda
2.g++
This code is for regression feed-forward DNN training, where the optimization criterion can be conventional MMSE or our proposed objective function with GGD error model derived according to ML criterion.
1.Execute make
to generate the executable file "BPtrain_Sigmoid"
2.You can train the feed-forward DNNs in the paper by calling "BPtrain_Sigmoid" with command below:
perl finetune.pl
Note: The paramters "MLflag" and "shapefactor" in "finetune.pl" control the choice of the objective function.
- When MLflag≠1, the classic β-norm function is selected as the objective function, where β=1 corresponds to the L1-norm, namely the least absolute deviation (LAD) and β=2 corresponds to the L2-norm, namely the MMSE.
- When MLflag=1, the GGD error model based log-likelihood function is selected as the objective function, where "shapefactor" refers to the shape factor β in GGD.
In this paper, we propose a new objective function. The codes for our proposed ML-GGD-DNN can be obtained by making minor modifications based on the codes for MMSE-DNN. More specifically, we only need to modify the gradient of the objective function with respect to the output in the backpropagation part of the codes for MMSE-DNN.
The following codes are from the lines 408 to 423 of "BP_GPU.cu", which is to calculate the gradient of the objective funtion with respect to the output. The called functions are defined in "DevFunc.cu".
DevSubClean2(streams, n_frames, cur_layer_units,shapefactor, dev.out, targ, cur_layer_dedx);
DevVecMulNum(streams, cur_layer_units * n_frames, cur_layer_dedx, 1.0f/n_frames, cur_layer_dedx);
if(MLflag == 1)
{
Deverror(streams, n_frames, cur_layer_units, dev.out, targ, realerror);
Devabsolutevalus(streams,cur_layer_units * n_frames,realerror,errorabsolute);
Devindex2(streams,n_frames*cur_layer_units, errorabsolute, shapefactor,errorabsolute2);
DevSumcol(streams, n_frames, cur_layer_units, errorabsolute2, vec1);
DevDivide(streams, cur_layer_units, vec1, vec1, n_frames);
DevVecMulNum(streams, cur_layer_units, vec1, shapefactor, vec2);
float ppp=1.0f/shapefactor;
Devindex2(streams,cur_layer_units, vec2,ppp, scalefactor);
Devfunc2(streams, n_frames,cur_layer_units, realerror,scalefactor, newobj,shapefactor);
DevVecMulNum(streams, cur_layer_units * n_frames, newobj, 1.0f/n_frames, cur_layer_dedx);
}
-
When MLflag≠1, the β-norm function is selected as the objective function as follows:
where β=2 corresponds to the MMSE criterion and β=1 corresponds to the LAD criterion.
Then the backpropagation procedure with a SGD method is used to update DNN parameters W in the minibatch mode of M sample frames (In this paper, M=128).The function "DevSubClean2" achieves the calculation of the gradient of with respect to the output as follows:
-
When MLflag=1, the GGD error model based log-likelihood function is selected as the objective function as follows:
We adopt maximum likelihood criterion to optimize both the DNN parameters W and GGD parameters α. In this paper, two algorithms for optimisation are proposed. Here, we only provide one optimization algorithm which is adopted in all the experiments for ML-GGD-DNN in our paper, namely the alternating two-step optimization algorithm.
Maximizing the log-likelihood is equivalent to minimizing the following error function:
Then W and α are alternatively optimized in each minibatch (M=128).
First, a closed solution of α referred to by "scalefactor" in the codes is derived by fixing W and minimizing in the minibatch mode of M sample frames as follows:Second, W is optimized by the backpropagation procedure with the SGD method by fixing α. The function "Devfun2" achieves the calculation of the gradient of with respect to the output as follows:
Select one well-trained model and change the suffix 'wts' to 'mat'. Then execute the following command:
matlab -nodesktop -nosplash -r decode
Section 1: Related waveforms refered in the submitted paper
(a) Clean |
(b) Noisy |
(c) MMSE |
(d) ML |
(a) Clean |
(b) Noisy |
(c) MMSE |
(d) ML |
(a) Clean |
(b) Noisy |
(c) MMSE |
(d) ML |
(a) Clean |
(b) Noisy |
(c) MMSE |
(d) ML |
Fig. 9. The spectrograms of utterances corrupted by N3 (Destroyer Operations), N5 (Factory1), N10 (Machine Gun), and N13 (Speech Babble) at 5 dB. Each row corresponds to one example set with the clean speech, noisy speech, MMSE-DNN and ML-GGD-DNN (β=0.9) enhanced speech.
Section 2: More enhanced speech demos
Selected results on the remaining unseen noise types:
Clean |
Noisy |
MMSE |
ML (β=0.9) |
|
JetCockpit2, SNR5 | ||||
Destroyer Engine, SNR0 | ||||
F-16 Cockpit, SNR10 | ||||
Factory2, SNR-5 | ||||
HF Channel, SNR15 | ||||
Military Vehicle, SNR0 | ||||
M109 Tank, SNR-5 | ||||
Pink, SNR-5 | ||||
Volvo, SNR-5 | ||||
White, SNR5 |