DNN-based speech enhancement optimized by a maximum likelihood criterion rather than the conventional MMSE criterion

This repository contains the codes and demos for the paper "Using Generalized Gaussian Distributions to Improve Regression Error Modeling for Deep-Learning-Based Speech Enhancement" (submitted to IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING)

Step1: Prepare the input and output files.

Step1.1: Extract log-power spectrum (LPS) features

cd Feature_prepare

1.cd SourceCode_Wav2LogSpec_be

Execute make to generate the executable file "Wav2LPS_be"

Note: Set the sampling frequency, frame length and frame shift in the source code "Wav2LogSpec_be.c" according to your own needs. In our paper, we set the sampling frequency, frame length and frame shift to 16kHz, 32ms and 16ms respectively.

2.Extract LPS features with command below:

matlab -nodesktop -nosplash -r LPS_extract

Step1.2: Packaged into a Pfile file

cd tools_pfile

1.Calculate the number of frames per sentence with command below:

perl GetLenForFeaScp.pl train_noisy.scp frame_numbers.len 257 1

2.Use quicknet toolset to prepare Pfile as the input and the output files with command below:

perl pfile_noisy.pl

perl pfile_clean.pl

Pfile is the big file of all training features.

3.Calculate the mean and standard deviation with the command below for normalization by Z-score:

perl get_norm.pl

Note that the reciprocal of standard deviation rather than standard deviation is obtained after executing this command.

Step2: Training

Installation

1.cuda

2.g++

cd Train_code_ML_GGD

This code is for regression feed-forward DNN training, where the optimization criterion can be conventional MMSE or our proposed objective function with GGD error model derived according to ML criterion.

1.Execute make to generate the executable file "BPtrain_Sigmoid"

2.You can train the feed-forward DNNs in the paper by calling "BPtrain_Sigmoid" with command below:

perl finetune.pl

Note: The paramters "MLflag" and "shapefactor" in "finetune.pl" control the choice of the objective function.

  • When MLflag≠1, the classic β-norm function is selected as the objective function, where β=1 corresponds to the L1-norm, namely the least absolute deviation (LAD) and β=2 corresponds to the L2-norm, namely the MMSE.
  • When MLflag=1, the GGD error model based log-likelihood function is selected as the objective function, where "shapefactor" refers to the shape factor β in GGD.

Implementation details

In this paper, we propose a new objective function. The codes for our proposed ML-GGD-DNN can be obtained by making minor modifications based on the codes for MMSE-DNN. More specifically, we only need to modify the gradient of the objective function with respect to the output in the backpropagation part of the codes for MMSE-DNN.

The following codes are from the lines 408 to 423 of "BP_GPU.cu", which is to calculate the gradient of the objective funtion with respect to the output. The called functions are defined in "DevFunc.cu".

DevSubClean2(streams, n_frames, cur_layer_units,shapefactor, dev.out, targ, cur_layer_dedx); 
DevVecMulNum(streams, cur_layer_units * n_frames, cur_layer_dedx, 1.0f/n_frames, cur_layer_dedx);
if(MLflag == 1) 
{
Deverror(streams, n_frames, cur_layer_units, dev.out, targ, realerror);
Devabsolutevalus(streams,cur_layer_units * n_frames,realerror,errorabsolute);
Devindex2(streams,n_frames*cur_layer_units, errorabsolute, shapefactor,errorabsolute2);
DevSumcol(streams, n_frames, cur_layer_units, errorabsolute2, vec1);
DevDivide(streams, cur_layer_units, vec1, vec1, n_frames);
DevVecMulNum(streams, cur_layer_units, vec1, shapefactor, vec2);
float ppp=1.0f/shapefactor;
Devindex2(streams,cur_layer_units, vec2,ppp, scalefactor);
Devfunc2(streams, n_frames,cur_layer_units, realerror,scalefactor, newobj,shapefactor);
DevVecMulNum(streams, cur_layer_units * n_frames, newobj, 1.0f/n_frames, cur_layer_dedx);
}
  • When MLflag≠1, the β-norm function is selected as the objective function as follows:
    公式

    where β=2 corresponds to the MMSE criterion and β=1 corresponds to the LAD criterion.
    Then the backpropagation procedure with a SGD method is used to update DNN parameters W in the minibatch mode of M sample frames (In this paper, M=128).

    The function "DevSubClean2" achieves the calculation of the gradient of 公式 with respect to the output 公式 as follows:

    公式

  • When MLflag=1, the GGD error model based log-likelihood function is selected as the objective function as follows: 公式

    We adopt maximum likelihood criterion to optimize both the DNN parameters W and GGD parameters α. In this paper, two algorithms for optimisation are proposed. Here, we only provide one optimization algorithm which is adopted in all the experiments for ML-GGD-DNN in our paper, namely the alternating two-step optimization algorithm.

    Maximizing the log-likelihood 公式 is equivalent to minimizing the following error function: 公式

    Then W and α are alternatively optimized in each minibatch (M=128).
    First, a closed solution of α referred to by "scalefactor" in the codes is derived by fixing W and minimizing 公式 in the minibatch mode of M sample frames as follows:

    公式

    Second, W is optimized by the backpropagation procedure with the SGD method by fixing α. The function "Devfun2" achieves the calculation of the gradient of 公式 with respect to the output 公式 as follows:

    公式

Step3: Testing

cd Test_code

Select one well-trained model and change the suffix 'wts' to 'mat'. Then execute the following command:

matlab -nodesktop -nosplash -r decode

Demos:

cd Enh_demos

Section 1: Related waveforms refered in the submitted paper

(a) Clean
(b) Noisy
(c) MMSE
(d) ML

(a) Clean
(b) Noisy
(c) MMSE
(d) ML

(a) Clean
(b) Noisy
(c) MMSE
(d) ML

(a) Clean
(b) Noisy
(c) MMSE
(d) ML

Fig. 9. The spectrograms of utterances corrupted by N3 (Destroyer Operations), N5 (Factory1), N10 (Machine Gun), and N13 (Speech Babble) at 5 dB. Each row corresponds to one example set with the clean speech, noisy speech, MMSE-DNN and ML-GGD-DNN (β=0.9) enhanced speech.

Section 2: More enhanced speech demos

Selected results on the remaining unseen noise types:

 
Clean
Noisy
MMSE
ML (β=0.9)
JetCockpit2, SNR5
Destroyer Engine, SNR0
F-16 Cockpit, SNR10
Factory2, SNR-5
HF Channel, SNR15
Military Vehicle, SNR0
M109 Tank, SNR-5
Pink, SNR-5
Volvo, SNR-5
White, SNR5