HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
This is an unofficial PyTorch implementation of the above mentioned paper by Su et al. (2020).
librosa 0.8.0
numpy 1.18.1
pandas 1.0.1
scipy 1.4.1
soundfile 0.10.3
torch 1.6.0
torchaudio 0.6.0
tqdm 4.54.1
Data for training can be supplied in several ways. In hparams.py (hparams.files), you can specify paths to your data. In
all cases, paths must point to either a directory containing audio files (.wav) or a .pkl file of a Pandas Dataframe.
All audio data should have a sample rate of 16kHz or above.
In the case of specifying directories, files can directly be contained in the specified directory.
In the case specifying .pkl files, the Dataframe for speakers, IRs and noises must contain a column labeled path
,
with paths to audio files as its rows.
Training can be performed on multiple GPUs. Run
python -m torch.distributed.launch --nproc_per_node=<DEVICE_COUNT> train.py [--checkpoint]
in the command line, replacing <DEVICE_COUNT>
with the number of CUDA devices in your system, optionally providing the
path to a checkpoint file when resuming training from an earlier checkpoint.
You can monitor training using tensorboard. Pass the path
to runs/<RUN_DIR>/logs
as the --logdir
parameter.
Run
python inference.py --checkpoint <CHECKPOINT> --input <INPUT> --output_dir <OUTPUT_DIR> [--device <DEVICE>] [--hparams <HPARAMS>]
in the command line and replace <CHECKPOINT>
with the path to a checkpoint file, <INPUT>
with the path to either a
single audio file or a directory of audio files you wish to perform inference on and <OUTPUT_DIR>
with the path to the
desired directory to store outputs in (will be created automatically). Optionally, specify a <DEVICE>
to run inference
on (e.g. cpu
or cuda:0
) and/or the path to a <HPARAMS>
file if you want to use hparams other than the ones
specified in hparams.py.
In our experiments, we've not yet been able to reproduce the results reported in the original paper in terms of the prediction's subjectively perceived audio quality. For training, we used the following datasets:
Nautilus Speaker Characterization (NSC) Corpus . We chose NSC over DAPS (the dataset used in the original paper) since NSC features 300 individual speakers compared to DAPS's 20. Also, for our application, German speakers are preferable for training.
We were unable to relaibly perform the RT60 augmentation described by Bryan (2019). To ensure enough variety in IR data, we instead used a selection from a series of IR datasets, resulting in a custom collection of ~100,000 individual IRs.
As described in the original paper, we used noise data from the REVERB Challenge database, contained in the Room Impulse Response and Noise Database .