audio-super-resolution-tf2.4.1

This is an UNOFFICIAL implementation of the audio super-resolution model proposed in H.M. Wang and D.L. Wang, "Towards Robust Speech Super-Resolution".

The code is based heavily on https://github.com/kuleshov/audio-super-res

Environment

!pip install https://github.com/schmiph2/pysepm/archive/master.zip

Network

Parameters: 10281363 Kernel size = 11

D-Block:  (None, None, 64)
D-Block:  (None, None, 64)
D-Block:  (None, None, 64)
D-Block:  (None, None, 128)
D-Block:  (None, None, 128)
D-Block:  (None, None, 128)
D-Block:  (None, None, 256)
D-Block:  (None, None, 256)
B-Block:  (None, None, 256)
U-Block:  (None, None, 512)
U-Block:  (None, None, 512)
U-Block:  (None, None, 256)
U-Block:  (None, None, 256)
U-Block:  (None, None, 256)
U-Block:  (None, None, 128)
U-Block:  (None, None, 128)
U-Block:  (None, None, 128)

Dataset

Balanced Corpus: 100% VCTKS 10% TIMIT, IEEE 2% VCTKM, WSJ, LIBRI, and Mixed

Hyperparameter

Dropout rate = 0.2, Optimization = Adam, Learning rate = 0.0003 is halved if the loss has not improved for 3 consecutive epochs on the validation set. Early stop if the validation loss has not improved for 6 successive epochs.

Results

Here we provide a qualitative example of some datasets https://tan90xx.github.io/SR-display.github.io/

T1. EXPERIMENTAL RESULTS FOR CROSS-CORPUS SR USING THE FOUR BASELINES AND PROPOSED MODEL

		TIMIT			WSJ			LIBRI			IEEE
Model/Training Dataset	SNR	LSD	PESQ	SNR	LSD	PESQ	SNR	LSD	PESQ	SNR	LSD	PESQ
Spline	18.27	2.07	3.51	9.59	2.27	2.83	19.73	2.22	3.43	21.12	2.14	3.94
DNN-BWE/TIMIT	17.38	1.64	1.82	7.69	1.41	1.37	19.20	1.27	1.93	20.46	1.57	1.81
DNN-BWE/WSJ	17.40	1.56	1.91	7.70	1.31	1.42	19.14	1.23	2.00	20.39	1.49	1.91
DNN-BWE/LIBRI	17.52	1.57	1.89	7.75	1.37	1.43	19.37	1.21	2.03	20.68	1.53	1.86
DNN-BWE/IEEE	11.55	2.52	1.20	-1.12	2.40	1.06	15.39	1.96	1.31	15.92	2.31	1.23
DNN-Cepstral/TIMIT	17.23	1.02	2.43	11.18	0.81	1.79	18.97	0.85	2.68	19.84	1.02	2.59
DNN-Cepstral/WSJ	17.23	1.02	2.42	11.18	0.82	1.78	18.97	0.85	2.69	19.84	1.02	2.57
DNN-Cepstral/LIBRI	16.31	1.54	1.68	9.78	1.19	1.32	18.83	1.13	2.39	19.63	1.37	1.91
DNN-Cepstral/IEEE	17.22	1.06	2.34	11.11	0.84	1.72	18.96	0.87	2.64	19.83	1.03	2.56
AudioUNet/TIMIT	18.41	1.69	3.13	10.19	2.11	3.16	20.49	1.60	3.17	22.16	1.73	3.55
AudioUNet/WSJ	18.35	1.92	3.40	9.90	2.23	2.72	20.49	1.76	3.47	22.19	1.94	3.74
AudioUNet/LIBRI	18.44	2.01	3.62	10.19	2.27	2.96	20.82	2.01	3.85	22.62	2.20	4.01
AudioUNet/IEEE	18.37	1.80	3.28	10.01	2.23	3.00	20.32	1.73	3.36	21.92	1.84	3.67
TFNet/TIMIT	17.08	1.18	2.91	10.18	1.29	2.38	14.86	1.30	2.35	21.04	1.33	2.71
TFNet/WSJ	15.17	1.18	2.40	9.27	1.26	2.27	15.36	1.28	2.22	20.31	1.45	2.09
TFNet/LIBRI	15.82	1.17	2.33	10.97	1.20	2.22	16.84	1.26	2.76	22.97	1.24	3.04
TFNet/IEEE	13.48	1.39	1.82	9.39	1.36	1.80	14.89	1.36	2.08	21.37	1.33	2.46
Proposed/TIMIT	18.58	1.64	3.82	10.68	1.91	3.58	21.33	1.58	3.76	23.27	1.70	4.22
Proposed/WSJ	18.48	1.50	3.02	10.39	1.68	2.44	21.00	1.41	3.36	17.44	1.53	2.69
Proposed/LIBRI	18.55	1.87	3.41	10.57	2.09	2.59	21.31	1.83	3.79	23.22	1.98	3.82
Proposed/IEEE	18.29	1.50	2.54	-5.97	1.70	1.11	17.11	1.32	2.27	1.47	1.46	1.27
Proposed/Mixed	18.48	1.74	3.21	5.64	1.94	1.41	21.29	1.69	3.63	23.09	1.84	3.68

T2. COMPARISON OF VARIOUS LOSS FUNCTIONS ON THE TIMIT DATASET

LOSS	SNR	LSD	PESQ
MAE	18.48	1.70	3.33
MSE	18.48	1.49	2.78
F	18.50	1.61	3.11
RI	18.58	1.73	3.53
TF	18.44	1.69	3.00
RI-MAG	18.49	1.73	3.20
PCM	18.41	1.48	2.77
T-PCM	18.58	1.64	3.82

T3. MODEL TRAINED ON ORIGINAL TIMIT UTTERANCES TESTED ON DATA CONVOLVED WITH DIFFERENT MIRs

Modle	SNR	LSD	PESQ
Spline	18.27	2.07	3.51
Original	18.58	1.64	3.82
Test on MIR1	11.70	1.66	3.50
Test on MIR2	14.53	1.51	3.71
Average of 20 MIRs	15.33	1.58	3.70

T4. EXPERIMENTAL RESULTS FOR SR MODELS EVALUATED ON VCTK WITH DOWNSAMPLING FACTOR FOR 2 AND 4

			VCTKS		VCTKM
Model	R	SNR	LSD	PESQ	SNR	LSD	PESQ
Spline	2.00	19.42	2.13	3.06	22.40	1.96	3.92
DNN-BWE	2.00	19.12	1.49	2.07	17.48	2.23	1.79
DNN-Cepstral	2.00	18.59	0.89	3.08	19.62	1.45	2.69
AudioUNet	2.00	20.21	1.52	2.79	22.54	1.77	3.85
TFNet	2.00	22.00	1.50	2.41	-2.22	2.59	1.05
Proposed	2.00	14.98	1.44	2.77	23.01	1.73	4.06
Spline	4.00	15.28	3.01	3.16	19.28	2.64	3.36
DNN-BWE	4.00	15.01	1.72	1.72	18.61	2.22	1.66
DNN-Cepstral	4.00	15.16	1.36	2.62	18.28	1.66	2.30
AudioUNet	4.00	15.50	2.08	2.30	19.47	2.26	2.56
TFNet	4.00	13.41	1.74	1.87	15.20	1.61	2.15
Proposed	4.00	15.43	2.12	2.62	20.30	2.12	3.41

Changelog

The code is adopted from https://github.com/kuleshov/audio-super-res, and here are some details of changes:

Setup

(1)Set the random seeds and generate txts for train, valid, and test with the path of audio files for each corpus, to repeat this experiments exactly as much as possible.
(2)Apply MVN and silence filter in datasets preparation, for the reason the author observe that MVN improves cross-corpus generalization, and a silence filter is performed to stabilize training and ensure faster convergence.
(3)Extend down-sample schemes in datasets preparation, which used to be only the SciPy decimate function. In summary, the preprocess is like this:

-->MVN(mean and variance normalization)
-->Silence filter(discard samples below an energy threshold of 0.05)
  >>>librosa.effects.trim(x,top_db=-20*np.log10(0.05/1.0))
-->Four down-sample schemes to choose: (1)low-pass filter(Chebynov Type I iir filter of order 8)-->Subsampling(discarding samples at fixed intervals) (2)decimate (3)resample (4)mix
-->Upsampling(cubic spline interpolatioin)
-->padding-->generate patches(frames of 2048 samples with overlap of 1024 samples)

(4)Generalize prep_vctk.py to prep_dataset.py for other corpus, and record args in dataset.json for batch generating h5 files.

Slight changes in script , which works as follows.

optional arguments: 
-h, --help                  
--corpus
--state                              
--scale                        
--dimension        
--stride                         
--interpolate                 
--low-pass                    
--batch-size           
--sr               
--sam             	

example:
python prep_dataset.py \
  --corpus vctk-speaker1 \
  --state train \
  --scale 2 \
  --dimension 2048 \
  --stride 1024 \
  --interpolate \
  --low-pass decimating\
  --batch-size 32 \
  --sr 16000 \
  --sam 0.25 \
  
delete original arguments:
--file-list                    
--in-dir                        
--out	auto generate by args above:  
--file-list ./Corpus/vctk-speaker1-train-files.txt \
  --in-dir ./Corpus/ \
  --out vctk-speaker1-decimating-train.2.2048.1024.h5 \

Running the model

(5)Make it run.

# Wrong:
# Conv1D init-->kernel_initializer, subsample_length-->strides
# Convolution1D-->Conv1D
# stdev-->stddev
# merge-->add
# Warning:
# initialize_all_variables-->global_variables_initializer

(6)Build structure of the proposed model, adopted from audiounet. And exactly follow the sequence of Conv-->Relu-->Dropout

# set:
n_filters = [64, 64, 64, 128, 128, 128, 256, 256]
n_filtersizes = [11, 11, 11, 11, 11, 11, 11, 11, 11]
x = LeakyReLU(0.2)(x)-->x = PReLU(shared_axes=[1])(x)
# add dropout in U and D blocks:
if l%3 == 2:
  x = Dropout(rate=0.2)(x)

Fig from P2060 of the paper

(7)Self-define loss function. (OLA)"Framed segments are first divided into frames of 512 samples with a frame shift of 256 samples. Then we multiply these frames with a Hamming window."

# ola output[batch, length]
x_ola = tf.signal.overlap_and_add(X, 1024)
x_ola = tf.cast(x_ola, tf.float32)
X_spec = tf.signal.stft(signals=x_ola, frame_length=FRAME, frame_step=SHIFT, fft_length=FRAME,
                        window_fn=tf.signal.hamming_window)

Fig1.Flat

Fig2.OLA

(8)Apply dequeue to monitor val_loss and achieve early stopping ect.
(9)Write the logdir and csv records in the same path for quick check.

Slight changes in script , which works as follows.

optional arguments:
-h, --help   
--model
--loss_func 
--train          
--val             
-e --epochs 
--logname                                                               
--r                                  
--pools_size           
--strides             
--full               	

example:
python run.py train \
  --model proposed \
  --loss_func T_PCM \
  --train ../data/vctk/vctk-speaker1-train.2.2048.1024.h5 \
  --val ../data/vctk/vctk-speaker1-val.2.2048.1024.h5 \
  -e 100 \
  --logname tmp-rum \
  --r 2 \
  --pool_size 2 \
  --strides 2 \
  --full true \
  
delete original arguments:
--alg        
--batch-size  
--layers      
--lr      
--piano   default = false           
--grocery  default = false         
--speaker  default = single	

set in code:
if args.model == 'proposed':
  opt_params = {'loss_func':args.loss_func, 'alg': 'adam', 'lr': 0.0003, 
                'b1': 0.9, 'b2': 0.999, 'batch_size': 32, 'layers': 8}

Testing the model

(10)Design the output path of eval files same with logdir and csv.
(11)Auto-calculate metrics.
(12)Define visualization functions to display Spectrogram and Training process
(13)Build a web page [code].

Slight changes in script , which works as follows.

optional arguments:
-h, --help              
--logname 
--out-label 
--wav-file-list 
--r R                 
--sr SR          
--ola false

example:
python run.py eval \
  --logname ./singlespeaker.lr0.000300.1.g4.b64/model.ckpt-20101 \
  --out-label singlespeaker-out \
  --wav-file-list ../data/vctk/speaker1/speaker1-val-files.txt \
  --r 4 \
  --pool_size 2 \
  --strides 2 \
  --model audiotfilm

ACKNOWLEDGMENT

I would like to thank the author Heming Wang who have been kind enough to answer my questions about the dimensions of DFT and the silence filter.

zayd/deep-audio-super-resolution - for DNN baseline
https://github.com/moodoki/tfnet - for TFNet baseline

Individual items may not be added in time. If you find out, please let me know. Hope to understand.

tan90xx/audio-super-resolution-tf

audio-super-resolution-tf2.4.1

Environment

Network

Dataset

Hyperparameter

Results

Changelog

Setup

Running the model

Testing the model

ACKNOWLEDGMENT