This is an UNOFFICIAL implementation of the audio super-resolution model proposed in H.M. Wang and D.L. Wang, "Towards Robust Speech Super-Resolution".
The code is based heavily on https://github.com/kuleshov/audio-super-res
!pip install https://github.com/schmiph2/pysepm/archive/master.zip
Parameters: 10281363
Kernel size = 11
D-Block: (None, None, 64)
D-Block: (None, None, 64)
D-Block: (None, None, 64)
D-Block: (None, None, 128)
D-Block: (None, None, 128)
D-Block: (None, None, 128)
D-Block: (None, None, 256)
D-Block: (None, None, 256)
B-Block: (None, None, 256)
U-Block: (None, None, 512)
U-Block: (None, None, 512)
U-Block: (None, None, 256)
U-Block: (None, None, 256)
U-Block: (None, None, 256)
U-Block: (None, None, 128)
U-Block: (None, None, 128)
U-Block: (None, None, 128)
Balanced Corpus:
100% VCTKS
10% TIMIT, IEEE
2% VCTKM, WSJ, LIBRI, and Mixed
Dropout rate = 0.2, Optimization = Adam, Learning rate = 0.0003 is halved if the loss has not improved for 3 consecutive epochs on the validation set. Early stop if the validation loss has not improved for 6 successive epochs.
Here we provide a qualitative example of some datasets https://tan90xx.github.io/SR-display.github.io/
T1. EXPERIMENTAL RESULTS FOR CROSS-CORPUS SR USING THE FOUR BASELINES AND PROPOSED MODEL
TIMIT | WSJ | LIBRI | IEEE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Model/Training Dataset | SNR | LSD | PESQ | SNR | LSD | PESQ | SNR | LSD | PESQ | SNR | LSD | PESQ |
Spline | 18.27 | 2.07 | 3.51 | 9.59 | 2.27 | 2.83 | 19.73 | 2.22 | 3.43 | 21.12 | 2.14 | 3.94 |
DNN-BWE/TIMIT | 17.38 | 1.64 | 1.82 | 7.69 | 1.41 | 1.37 | 19.20 | 1.27 | 1.93 | 20.46 | 1.57 | 1.81 |
DNN-BWE/WSJ | 17.40 | 1.56 | 1.91 | 7.70 | 1.31 | 1.42 | 19.14 | 1.23 | 2.00 | 20.39 | 1.49 | 1.91 |
DNN-BWE/LIBRI | 17.52 | 1.57 | 1.89 | 7.75 | 1.37 | 1.43 | 19.37 | 1.21 | 2.03 | 20.68 | 1.53 | 1.86 |
DNN-BWE/IEEE | 11.55 | 2.52 | 1.20 | -1.12 | 2.40 | 1.06 | 15.39 | 1.96 | 1.31 | 15.92 | 2.31 | 1.23 |
DNN-Cepstral/TIMIT | 17.23 | 1.02 | 2.43 | 11.18 | 0.81 | 1.79 | 18.97 | 0.85 | 2.68 | 19.84 | 1.02 | 2.59 |
DNN-Cepstral/WSJ | 17.23 | 1.02 | 2.42 | 11.18 | 0.82 | 1.78 | 18.97 | 0.85 | 2.69 | 19.84 | 1.02 | 2.57 |
DNN-Cepstral/LIBRI | 16.31 | 1.54 | 1.68 | 9.78 | 1.19 | 1.32 | 18.83 | 1.13 | 2.39 | 19.63 | 1.37 | 1.91 |
DNN-Cepstral/IEEE | 17.22 | 1.06 | 2.34 | 11.11 | 0.84 | 1.72 | 18.96 | 0.87 | 2.64 | 19.83 | 1.03 | 2.56 |
AudioUNet/TIMIT | 18.41 | 1.69 | 3.13 | 10.19 | 2.11 | 3.16 | 20.49 | 1.60 | 3.17 | 22.16 | 1.73 | 3.55 |
AudioUNet/WSJ | 18.35 | 1.92 | 3.40 | 9.90 | 2.23 | 2.72 | 20.49 | 1.76 | 3.47 | 22.19 | 1.94 | 3.74 |
AudioUNet/LIBRI | 18.44 | 2.01 | 3.62 | 10.19 | 2.27 | 2.96 | 20.82 | 2.01 | 3.85 | 22.62 | 2.20 | 4.01 |
AudioUNet/IEEE | 18.37 | 1.80 | 3.28 | 10.01 | 2.23 | 3.00 | 20.32 | 1.73 | 3.36 | 21.92 | 1.84 | 3.67 |
TFNet/TIMIT | 17.08 | 1.18 | 2.91 | 10.18 | 1.29 | 2.38 | 14.86 | 1.30 | 2.35 | 21.04 | 1.33 | 2.71 |
TFNet/WSJ | 15.17 | 1.18 | 2.40 | 9.27 | 1.26 | 2.27 | 15.36 | 1.28 | 2.22 | 20.31 | 1.45 | 2.09 |
TFNet/LIBRI | 15.82 | 1.17 | 2.33 | 10.97 | 1.20 | 2.22 | 16.84 | 1.26 | 2.76 | 22.97 | 1.24 | 3.04 |
TFNet/IEEE | 13.48 | 1.39 | 1.82 | 9.39 | 1.36 | 1.80 | 14.89 | 1.36 | 2.08 | 21.37 | 1.33 | 2.46 |
Proposed/TIMIT | 18.58 | 1.64 | 3.82 | 10.68 | 1.91 | 3.58 | 21.33 | 1.58 | 3.76 | 23.27 | 1.70 | 4.22 |
Proposed/WSJ | 18.48 | 1.50 | 3.02 | 10.39 | 1.68 | 2.44 | 21.00 | 1.41 | 3.36 | 17.44 | 1.53 | 2.69 |
Proposed/LIBRI | 18.55 | 1.87 | 3.41 | 10.57 | 2.09 | 2.59 | 21.31 | 1.83 | 3.79 | 23.22 | 1.98 | 3.82 |
Proposed/IEEE | 18.29 | 1.50 | 2.54 | -5.97 | 1.70 | 1.11 | 17.11 | 1.32 | 2.27 | 1.47 | 1.46 | 1.27 |
Proposed/Mixed | 18.48 | 1.74 | 3.21 | 5.64 | 1.94 | 1.41 | 21.29 | 1.69 | 3.63 | 23.09 | 1.84 | 3.68 |
T2. COMPARISON OF VARIOUS LOSS FUNCTIONS ON THE TIMIT DATASET
LOSS | SNR | LSD | PESQ |
---|---|---|---|
MAE | 18.48 | 1.70 | 3.33 |
MSE | 18.48 | 1.49 | 2.78 |
F | 18.50 | 1.61 | 3.11 |
RI | 18.58 | 1.73 | 3.53 |
TF | 18.44 | 1.69 | 3.00 |
RI-MAG | 18.49 | 1.73 | 3.20 |
PCM | 18.41 | 1.48 | 2.77 |
T-PCM | 18.58 | 1.64 | 3.82 |
T3. MODEL TRAINED ON ORIGINAL TIMIT UTTERANCES TESTED ON DATA CONVOLVED WITH DIFFERENT MIRs
Modle | SNR | LSD | PESQ |
---|---|---|---|
Spline | 18.27 | 2.07 | 3.51 |
Original | 18.58 | 1.64 | 3.82 |
Test on MIR1 | 11.70 | 1.66 | 3.50 |
Test on MIR2 | 14.53 | 1.51 | 3.71 |
Average of 20 MIRs | 15.33 | 1.58 | 3.70 |
T4. EXPERIMENTAL RESULTS FOR SR MODELS EVALUATED ON VCTK WITH DOWNSAMPLING FACTOR FOR 2 AND 4
VCTKS | VCTKM | ||||||
---|---|---|---|---|---|---|---|
Model | R | SNR | LSD | PESQ | SNR | LSD | PESQ |
Spline | 2.00 | 19.42 | 2.13 | 3.06 | 22.40 | 1.96 | 3.92 |
DNN-BWE | 2.00 | 19.12 | 1.49 | 2.07 | 17.48 | 2.23 | 1.79 |
DNN-Cepstral | 2.00 | 18.59 | 0.89 | 3.08 | 19.62 | 1.45 | 2.69 |
AudioUNet | 2.00 | 20.21 | 1.52 | 2.79 | 22.54 | 1.77 | 3.85 |
TFNet | 2.00 | 22.00 | 1.50 | 2.41 | -2.22 | 2.59 | 1.05 |
Proposed | 2.00 | 14.98 | 1.44 | 2.77 | 23.01 | 1.73 | 4.06 |
Spline | 4.00 | 15.28 | 3.01 | 3.16 | 19.28 | 2.64 | 3.36 |
DNN-BWE | 4.00 | 15.01 | 1.72 | 1.72 | 18.61 | 2.22 | 1.66 |
DNN-Cepstral | 4.00 | 15.16 | 1.36 | 2.62 | 18.28 | 1.66 | 2.30 |
AudioUNet | 4.00 | 15.50 | 2.08 | 2.30 | 19.47 | 2.26 | 2.56 |
TFNet | 4.00 | 13.41 | 1.74 | 1.87 | 15.20 | 1.61 | 2.15 |
Proposed | 4.00 | 15.43 | 2.12 | 2.62 | 20.30 | 2.12 | 3.41 |
The code is adopted from https://github.com/kuleshov/audio-super-res, and here are some details of changes:
- (1)Set the random seeds and generate txts for train, valid, and test with the path of audio files for each corpus, to repeat this experiments exactly as much as possible.
- (2)Apply MVN and silence filter in datasets preparation, for the reason the author observe that MVN improves cross-corpus generalization, and a silence filter is performed to stabilize training and ensure faster convergence.
- (3)Extend down-sample schemes in datasets preparation, which used to be only the SciPy decimate function. In summary, the preprocess is like this:
-->MVN(mean and variance normalization)
-->Silence filter(discard samples below an energy threshold of 0.05)
>>>librosa.effects.trim(x,top_db=-20*np.log10(0.05/1.0))
-->Four down-sample schemes to choose: (1)low-pass filter(Chebynov Type I iir filter of order 8)-->Subsampling(discarding samples at fixed intervals) (2)decimate (3)resample (4)mix
-->Upsampling(cubic spline interpolatioin)
-->padding-->generate patches(frames of 2048 samples with overlap of 1024 samples)
- (4)Generalize prep_vctk.py to prep_dataset.py for other corpus, and record args in dataset.json for batch generating h5 files.
Slight changes in script , which works as follows.
optional arguments:
-h, --help
--corpus
--state
--scale
--dimension
--stride
--interpolate
--low-pass
--batch-size
--sr
--sam
example:
python prep_dataset.py \
--corpus vctk-speaker1 \
--state train \
--scale 2 \
--dimension 2048 \
--stride 1024 \
--interpolate \
--low-pass decimating\
--batch-size 32 \
--sr 16000 \
--sam 0.25 \
delete original arguments:
--file-list
--in-dir
--out auto generate by args above:
--file-list ./Corpus/vctk-speaker1-train-files.txt \
--in-dir ./Corpus/ \
--out vctk-speaker1-decimating-train.2.2048.1024.h5 \
- (5)Make it run.
# Wrong:
# Conv1D init-->kernel_initializer, subsample_length-->strides
# Convolution1D-->Conv1D
# stdev-->stddev
# merge-->add
# Warning:
# initialize_all_variables-->global_variables_initializer
- (6)Build structure of the proposed model, adopted from audiounet. And exactly follow the sequence of Conv-->Relu-->Dropout
# set:
n_filters = [64, 64, 64, 128, 128, 128, 256, 256]
n_filtersizes = [11, 11, 11, 11, 11, 11, 11, 11, 11]
x = LeakyReLU(0.2)(x)-->x = PReLU(shared_axes=[1])(x)
# add dropout in U and D blocks:
if l%3 == 2:
x = Dropout(rate=0.2)(x)
- (7)Self-define loss function. (OLA)"Framed segments are first divided into frames of 512 samples with a frame shift of 256 samples. Then we multiply these frames with a Hamming window."
# ola output[batch, length]
x_ola = tf.signal.overlap_and_add(X, 1024)
x_ola = tf.cast(x_ola, tf.float32)
X_spec = tf.signal.stft(signals=x_ola, frame_length=FRAME, frame_step=SHIFT, fft_length=FRAME,
window_fn=tf.signal.hamming_window)
- (8)Apply dequeue to monitor val_loss and achieve early stopping ect.
- (9)Write the logdir and csv records in the same path for quick check.
Slight changes in script , which works as follows.
optional arguments:
-h, --help
--model
--loss_func
--train
--val
-e --epochs
--logname
--r
--pools_size
--strides
--full
example:
python run.py train \
--model proposed \
--loss_func T_PCM \
--train ../data/vctk/vctk-speaker1-train.2.2048.1024.h5 \
--val ../data/vctk/vctk-speaker1-val.2.2048.1024.h5 \
-e 100 \
--logname tmp-rum \
--r 2 \
--pool_size 2 \
--strides 2 \
--full true \
delete original arguments:
--alg
--batch-size
--layers
--lr
--piano default = false
--grocery default = false
--speaker default = single
set in code:
if args.model == 'proposed':
opt_params = {'loss_func':args.loss_func, 'alg': 'adam', 'lr': 0.0003,
'b1': 0.9, 'b2': 0.999, 'batch_size': 32, 'layers': 8}
- (10)Design the output path of eval files same with logdir and csv.
- (11)Auto-calculate metrics.
- (12)Define visualization functions to display Spectrogram and Training process
- (13)Build a web page [code].
Slight changes in script , which works as follows.
optional arguments:
-h, --help
--logname
--out-label
--wav-file-list
--r R
--sr SR
--ola false
example:
python run.py eval \
--logname ./singlespeaker.lr0.000300.1.g4.b64/model.ckpt-20101 \
--out-label singlespeaker-out \
--wav-file-list ../data/vctk/speaker1/speaker1-val-files.txt \
--r 4 \
--pool_size 2 \
--strides 2 \
--model audiotfilm
I would like to thank the author Heming Wang who have been kind enough to answer my questions about the dimensions of DFT and the silence filter.
zayd/deep-audio-super-resolution - for DNN baseline
https://github.com/moodoki/tfnet - for TFNet baseline
Individual items may not be added in time. If you find out, please let me know. Hope to understand.