Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments
Implementation of the audio-visual speech enhancement system described in the paper Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments by University of Modena and Reggio Emilia and Istituto Italiano di Tecnologia.
If you are interested in this work check out the project page.
All code is written for Python 3. Create a virtual environment (optional) and install all the requirements running:
pip install -r requirements.txt
The main program is av_speech_enhancement.py
. You can get a list of subcommands typing av_speech_enhancement.py -h
. Try av_speech_enhancement.py <subcommand> -h
for more information about a subcommand.
The audio-visual dataset must have the following directory structure:
s1
/audio
/file1.waw
/file2.wav
...
/video
/file1.mpg
/file2.mpg
...
s2
/audio
/file1.wav
/file2.wav
...
/video
/file1.mpg
/file2.mpg
...
...
Generate mixed-speech for training, validation and test sets separately:
av_speech_enhancement.py mixed_speech_generator
--data_dir <data_dir>
--base_speaker_ids <spk1> <spk2> <...>
[--noisy_speaker_ids <spk1> <spk2> <...>]
--audio_dir <audio_dir>
--dest_dir <dest_dir>
--num_samples <num_samples>
--num_mix <num_mix>
--num_mix_speakers <num_mix_speakers> {1,2}
The generated files are organized as follow:
TRAINING_SET
/s1
/file1_with_s2_file2.wav
/file2_with_s10_file4.wav
...
/s2
/file1_with_s12_file5.wav
/file2_with_s1_file1.wav
...
...
VALIDATION_SET
...
TEST_SET
...
Compute power-law compressed spectrograms of mixed-speech audio samples. Repeat this operation for training, validation and test sets. Files are saved in NPY format.
av_speech_enhancement.py audio_preprocessing
--data_dir <data_dir>
--speaker_ids <spk1> <spk2> <...>
--audio_dir <audio_dir>
--dest_dir <dest_dir>
--sample_rate <sample_rate>
--max_wav_length <max_wav_length>
Extract face landmarks from video using Dlib face detector and face landmark extractor. Files are saved in TXT format (each row has 136 values that represents the flattened x-y values of 68 face landmarks).
av_speech_enhancement.py video_preprocessing
--data_dir <data_dir>
--speaker_ids <spk1> <spk2> <...>
--video_dir <video_dir>
--dest_dir <dest_dir>
--shape_predictor <shape_predictor_file>
--ext <video_file_extension>
<shape_predictor_file>
contains the parameters of the face landmark extractor model. You can download a pre-trained model file here.
If you want to check the result of the face landmark extractor type:
av_speech_enhancement.py show_face_landmarks
--video <video_file>
--fps <fps>
--shape_predictor <shape_predictor_file>
Compute TBMs from clean audio samples. For each speaker Long-Term Average Speech Spectrum (LTASS) is computed and then the threshold is applied to all clean audio samples in <audio_dir>
.
av_speech_enhancement.py tbm_computation
--data_dir <data_dir>
--speaker_ids <spk1> <spk2> <...>
--audio_dir <audio_dir>
--dest_dir <dest_dir>
--sample_rate <sample_rate>
--max_wav_length <max_wav_length>
Before training you have to generate TFRecords of mixed-speech dataset. <data_dir>/<mix_dir>
must have three subdirectories named TRAINING_SET
, VALIDATION_SET
and TEST_SET
created with <mixed_speech_generator>
subcommand. Pre-computed spectrogram (NPY format) must be located in the same directory of audio file.
Set <tfrecords_mode>
to "fixed" if samples of the dataset all have the same length (as in GRID corpus), otherwise use "var" (as in TCD-TIMIT corpus).
av_speech_enhancement.py tfrecords_generator
--data_dir <data_dir>
--num_speakers <number_speakers_mixed> {2,3}
--mode <tfrecords_mode> {fixed,var}
--dest_dir <dest_dir>
--base_audio_dir <base_audio_dir>
--video_dir <video_dir>
--tbm_dir <tbm_dir>
--mix_audio_dir <mix_audio_dir>
--delta <delta_video_feat> {0,1,2]
--norm_data_dir <normalization_data_dir>
Train an audio-visual speech enhancement model described. You can choose between VL2M, VL2M_ref, Audio-Visual Concat and Audio-Visual Concat-ref models.
av_speech_enhancement.py training
--data_dir <data_dir>
--train_set <training_set_subdir>
--validation_set <validation_set_subdir>
--exp <experiment_id>
--mode <tfrecords_mode> {fixed,var}
--audio_dim <audio_frame_dimension>
--video_dim <video_frame_dimension>
--num_audio_samples <num_audio_samples>
--model <model_selection> {vl2m,vl2m_ref,av_concat_mask,av_concat_mask_ref}
--opt <optimizer_choice> {sgd,adam,momentum}
--learning_rate <learning_rate>
--updating_step <updating_step>
--learning_decay <learning_decay>
--batch_size <batch_size>
--epochs <num_epochs>
--hidden_units <num_hidden_lstm_units>
--layers <num_lstm_layers>
--dropout <dropout_rate>
--regularization <regularization_weight>
Test your trained model. Enhanced speech samples and estimated masks are saved in <data_dir>/<output_dir>
. Estimated masks are saved in subdirectories <mask_dir>
of each speaker directory.
av_speech_enhancement.py testing
--data_dir <data_dir>
--test_set <training_set_subdir>
--exp <experiment_id>
--ckp <model_checkpoint>
--mode <tfrecords_mode> {fixed,var}
--audio_dim <audio_frame_dimension>
--video_dim <video_frame_dimension>
--num_audio_samples <num_audio_samples>
--output_dir <output_dir>
--mask_dir <mask_dir>
If this project is useful for your research, please cite:
@inproceedings{morrone2019face,
title={Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments},
author={Morrone, Giovanni and Bergamaschi, Sonia and Pasa, Luca and Fadiga, Luciano and Tikhanoff, Vadim and Badino, Leonardo},
booktitle={2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={6900-6904},
year={2019},
organization={IEEE}
}