/russian_speech_denoiser

The repository consists of supportive scripts for the Master Thesis research

Primary LanguageJupyter Notebook

Denoiser for Russian speech

Preview

This repository consists of implemented demoisers' forks: DTLN and Demucs, a fork of DNS repository which scripts were used as a base for creating noised datasets with russian speech and additional supportive scripts for dataset prepartion.

Bellow are the steps used for finetuning DTLN-denoiser on russian speech which are needed because as the experiments showed both denoisers perform without finetuning poorly.

Dataset Overview

For model robustness we train denoiser on 2 datasets: Open-stt YouTube speech and Common Voice Version 3.0. We generate in total 100 hours (40h of noised wavs and 10h clean speech for each dataset).

For creating noised dataset we used 3 types of noise from Demand collection:

For proper quality we will add reverb to our clean audio dataset, please load rirs provided in DNS Challange:

Data downloading scripts

  1. Noise, Reverberation and YouTube speech:

Run download.sh script with two arguments: <absolute_path_to_clean_speech_directory> <absolute_path_to_noise_directory>

Note:

Use tar -xvf <downloaded_archive_name.tar.gz> to unpack downloaded files. Due to the server-errors files can be partly downloaded but will be opened correctly.

  1. Mozilla Common voice Version 3.0:

Create noised audio files

For training our denoiser we create dataset that consists of 4 parts 120 hours of speech in total.

  1. 40 hours: configure youtube_noisyspeech_synthesizer.cfg and run youtube_noisyspeech_synthesizer.py.

  2. 40 hours: configure cv_noisyspeech_synthesizer.cfg and run cv_noisyspeech_synthesizer.py.

  3. 20 hours: configure youtube_fake_noisyspeech_synthesizer.cfg and run youtube_fake_noisyspeech_synthesizer.py.

  4. 20 hours: configure cv_fake_noisyspeech_synthesizer.cfg and run cv_fake_noisyspeech_synthesizer.py.

The basic idea of the first 2 scripts:

  • choose 6 or less random audio files from your clean speech folder;

  • stack choosen audio files padded with zeros to create audios of fixed length (provided in transcript);

  • stack transcripts corresponded to the choosen audio files, so you can run ASR and commpute WER;

  • pick random folder from downloaded noise-types;

  • pick random SNR level from the range you provided in a config;

  • generate a mixture of noise and clean audio, considering choosen SNR-level.

The second 2 scripts also add reverberation and create transcritions but the noisy audio created is actually clean. So we add these samples to make sure that our denoiser processeses clean speech in a correct way.

Note: the are 2 cheet-hardcoded places in the script:

  • we use 6 audiofiles as audio files with the clean speech are about 2 sec and the target length of our audio files is 15 seconds.

  • there's a TCAR noise type that sounds much quiter then the other noise types, so we substract 20 from randomly choosen SNR level if use this noise type.