2023/01/10
This repository contains an implementation of SpeakerBeam method for target speech extraction, made public during Interspeech 2021 tutorial.
The code is based on the Asteroid toolkit for audio speech separation.
To install requirements:
pip install -r requirements.txt
The code was tested with Python 3.8.6.
The directory egs
contains a recipe for Libri2mix dataset. The recipe assumes that you already have the LibriMix dataset available. If not, please follow the instructions at the LibriMix repository to obtain it.
Before running our recipe, modify path.sh
file to contain path to the repository root.
PATH_TO_REPOSITORY="<path-to-repo>"
Then follow the steps below in the recipe directory
cd egs/libri2mix
To prepare lists of the data run
local/prepare_data.sh <path-to-libri2mix-data>
The <path-to-libri2mix-data>
should contain wav8k/min
subdirectories. The command will create data
directory containing csv
lists describing the data. The preparation of the data follows the data preparation from Asteroid. In addition, it creates a list mapping mixtures to enrollment utterances.
To train the SpeakerBeam model run
. ../../path.sh
python train.py --exp_dir exp/speakerbeam
The training script will by default use parameters in local/conf.yml
. To run with different parameters, you can either change the local/conf.yml
file or pass them directly as command-line arguments, e.g.
python train.py --exp_dir exp/speakerbeam_adaptlay15 --i_adapt_layer 15
to change the position of the adaptation layer in the network.
The training will create directory exp/speakerbeam
. The final model after the training is finished is stored in exp/speakerbeam/best_model.pth
. The training progress can be observed in Tensorboard using logs in exp/speakerbeam/lightning_logs
.
To extract target speech signals on the test set with the trained model and evaluate performance, run
python eval.py --test_dir data/wav8k/min/test --task sep_noisy --model_path exp/speakerbeam/best_model.pth --out_dir exp/speakerbeam/out_best --exp_dir exp/speakerbeam --use_gpu=1
It is also possible to evaluate with an intermediate checkpoint, e.g.
python eval.py --test_dir data/wav8k/min/test --task sep_noisy --model_path exp/speakerbeam/checkpoints/epoch\=24-step\=115824.ckpt --from_checkpoint 1 --out_dir exp/speakerbeam/out_e24_s115824 --exp_dir exp/speakerbeam --use_gpu=1
In the output directory exp/speakerbeam/out_best
, you can find the averaged results in final_metrics.json
and the extracted audio files in <out_dir>/out
.
Please cite our works when using this code:
@ARTICLE{Zmolikova_Spkbeam_STSP19,
author={Žmolíková, Kateřina and Delcroix, Marc and Kinoshita, Keisuke and Ochiai, Tsubasa and Nakatani, Tomohiro and Burget, Lukáš and Černocký, Jan},
journal={IEEE Journal of Selected Topics in Signal Processing},
title={SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures},
year={2019},
volume={13},
number={4},
pages={800-814},
doi={10.1109/JSTSP.2019.2922820}}
@INPROCEEDINGS{delcroix_tdSpkBeam_ICASSP20,
author={Delcroix, Marc and Ochiai, Tsubasa and Zmolikova, Katerina and Kinoshita, Keisuke and Tawara, Naohiro and Nakatani, Tomohiro and Araki, Shoko},
booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam},
year={2020},
volume={},
number={},
pages={691-695},
doi={10.1109/ICASSP40776.2020.9054683}}