Noise-Aware Speech Separation (NASS)

NOTE: This paper has been accepted by ICASSP 2024!

This repository provides the examples of Sepformer (NASS) on Libri2Mix based on SpeechBrain.

Install with GitHub

Once you have created your Python environment (Python 3.7+) you can simply type:

git clone https://github.com/TzuchengChang/NASS
cd NASS/speechbrain
pip install -r requirements.txt
pip install --editable .
pip install mir-eval
pip install pyloudnorm

Introduction


Fig1. The overall pipeline of NASS. $x_n$ and $\hat n$ denote the noisy input and predicted noise. $\hat{s_1}$ and $\hat{s_2}$ are separated speech while ${s_1}$ and ${s_2}$ are the ground-truth. $h_{\hat {s_1}}$, $h_{\hat {s_2}}$ and $h_{\hat n}$ in dashed box are predicted representations, while $h_{s_1}$ and $h_{s_2}$ in solid box are the ground-truth. "P" denotes the mutual information between separated and ground-truth speech is maximized while "N" denotes the mutual information between separated speech and noise is minimized.

Fig1. The overall pipeline of NASS. $x_n$ and $\hat n$ denote the noisy input and predicted noise. $\hat{s_1}$ and $\hat{s_2}$ are separated speech while ${s_1}$ and ${s_2}$ are the ground-truth. $h_{\hat {s_1}}$, $h_{\hat {s_2}}$ and $h_{\hat n}$ in dashed box are predicted representations, while $h_{s_1}$ and $h_{s_2}$ in solid box are the ground-truth. "P" denotes the mutual information between separated and ground-truth speech is maximized while "N" denotes the mutual information between separated speech and noise is minimized.


Fig2. The illustration of patch-wise contrastive learning. For the $i$-th sampling of $K$ times, one query example $r^i_q$, positive example $r^i_p$ and $M$ negative examples ${r_n^{i,j}}$ ($j \in [1,M]$) are sampled from predicted speech representation $h_{\hat s_a}$, ground-truth speech representation $h_{s_a}$ and predicted noise representation $h_{\hat n}$, respectively, "CS" denotes cosine similarity.	Fig3. Spectrum results on Libri2mix with Sepformer. Subplot (a) denotes the mixture; (b), (c) are baseline results; (d), (e), (f) are NASS results. Note that (d) is the noise output.

Fig2. The illustration of patch-wise contrastive learning. For the $i$-th sampling of $K$ times, one query example $r^i_q$, positive example $r^i_p$ and $M$ negative examples ${r_n^{i,j}}$ ($j \in [1,M]$) are sampled from predicted speech representation $h_{\hat s_a}$, ground-truth speech representation $h_{s_a}$ and predicted noise representation $h_{\hat n}$, respectively, "CS" denotes cosine similarity.

Fig3. Spectrum results on Libri2mix with Sepformer. Subplot (a) denotes the mixture; (b), (c) are baseline results; (d), (e), (f) are NASS results. Note that (d) is the noise output.

In this paper, we propose a noise-aware SS (NASS) method, which aims to improve the speech quality for separated signals under noisy conditions. Specifically, NASS views background noise as an additional output and predicts it along with other speakers in a mask-based manner. To effectively denoise, we introduce patch-wise contrastive learning (PCL) between noise and speaker representations from the decoder input and encoder output. PCL loss aims to minimize the mutual information between predicted noise and other speakers at multiple-patch level to suppress the noise information in separated signals. Experimental results show that NASS achieves 1 to 2dB SI-SNRi or SDRi over DPRNN and Sepformer on WHAM! and LibriMix noisy datasets, with less than 0.1M parameter increase.

NASS Example

We also provide a true example from Ted Cruz with -2dB WHAM! noise mixed.

Results are from Sepformer(NASS) trained on Libri2Mix.

Mixture	Speaker 1	Speaker 2	Noise
Download	Download	Download	Download

Run NASS Method

Step1: Prepare datasets. Please refer to LibriMix repository.

Step2: Modify configurations. Configuration files are saved in NASS/recipes/LibriMix/separation/hparams/

Step3: Run NASS method.

cd NASS/speechbrain/recipes/LibriMix/separation/
python train.py hparams/sepformer-libri2mix.yaml --data_folder /yourpath/Libri2Mix/

We also provide a yaml for custom data, and make sure your custom folder structure is like：

python train.py hparams/sepformer-libri2mix-custom.yaml
 --data_folder /yourpath/custom/

Pretrained Model

We provide a pretrained model on github releases.

To use it, download "results.zip" and unzip it to NASS/recipes/LibriMix/separation/

Then run NASS method.

Cite Our Paper

Please cite our paper and star our repository.

@inproceedings{zhang2024noise,
  title={Noise-Aware Speech Separation with Contrastive Learning},
  author={Zhang, Zizheng and Chen, Chen and Chen, Hsin-Hung and Liu, Xiang and Hu, Yuchen and Chng, Eng Siong},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1381--1385},
  year={2024},
  organization={IEEE}
}