Real-Time Target Sound Extraction

This repository provides code for the Waveformer architecture proposed in the paper. Waveformer is a low-latency target sound extraction model implementing streaming inference -- the model process a ~10 ms input audio chunk at each time step, while only looking at past chunks and no future chunks. On a Core i5 CPU using a single thread, real-time factors (RTFs) of different model configurations range from 0.66 to 0.94, with an end-to-end latency less than 20 ms.

Gradio-Demo.mp4

Setup

# Commands in all sections except the Dataset section are run from repo's toplevel directory
conda create --name waveformer python=3.8
conda activate waveformer
pip install -r requirements.txt

Bring Your Own Audio

You could run the model on your audio files using the Waveformer.py script. Example commands below use the sample audio mixture provided at data/Sample.wav. If running for the first time, the script downloads the default configuration file and checkpoint to the current directory.

# Usage: python Waveformer.py [-h] [--targets TARGETS [TARGETS ...]] input output

# Single-target extraction
python Waveformer.py data/Sample.wav output_typing.wav --targets Computer_keyboard

# Multi-target extraction
python Waveformer.py data/Sample.wav output_bark_cough.wav --targets Bark Cough

List of all possible targets can be found using:

python Waveformer.py -h

Training and Evaluation

Dataset

We use Scaper toolkit to synthetically generate audio mixtures. Each audio mixture is generated on-the-fly, during training or evaluation, using Scaper's generate_from_jams function on a .jams specification file. We provide (in the step 3 below) .jams specification files for all training, validation and evaluation samples used in our experiments. The .jams specifications are generated using FSDKaggle2018 and TAU Urban Acoustic Scenes 2019 datasets as sources for foreground and background sounds, respectively. Steps to create the dataset:

Go to the data directory:
```
 cd data
```
Download FSDKaggle2018, TAU Urban Acoustic Scenes 2019, Development dataset and TAU Urban Acoustic Scenes 2019, Evaluation dataset datasets using the data/download.py script:
```
 python download.py
```
Download and uncompress FSDSoundScapes dataset:
```
 wget https://targetsound.cs.washington.edu/files/FSDSoundScapes.zip
 unzip FSDSoundScapes.zip
```
This step creates the data/FSDSoundScapes directory. FSDSoundScapes would contain .jams specifications for training, validation and test samples used in the paper. Training and evaluation pipeline expect source samples (samples in FSDKaggle2018 and TAU Urban Acoustic Scenes 2019 datasets) at specific locations realtive to FSDSoundScapes. Following steps move source samples to appropriate locations.

Uncompress FSDKaggle2018 dataset and create scaper source:

 unzip FSDKaggle2018/\*.zip -d FSDKaggle2018
 python fsd_scaper_source_gen.py FSDKaggle2018 ./FSDSoundScapes/FSDKaggle2018 ./FSDSoundScapes/FSDKaggle2018

Uncompress TAU Urban Acoustic Scenes 2019 dataset to FSDSoundScapes directory:
```
 unzip TAU-acoustic-sounds/\*.zip -d FSDSoundScapes/TAU-acoustic-sounds/
```

Training

python -W ignore -m src.training.train experiments/<Experiment dir with config.json> --use_cuda

Evaluation

Pretrained checkpoints are available at experiments.zip. These can be downloaded and uncompressed to appropriate locations using:

wget https://targetsound.cs.washington.edu/files/experiments.zip
unzip -o experiments.zip -d experiments

Run evaluation script:

python -W ignore -m src.training.eval experiments/<Experiment dir with config.json and checkpoints> --use_cuda

Note

During the sample generation, when the amplitude of mixture sum to greater than 1, peak normalization is used to renormalize the mixtures. This results in a bunch of Scaper warnings during training and evaluation. -W ignore flag is used for a clearner output to the console.

leedaga/Waveformer