This repository provides code for the Waveformer architecture proposed in the paper. Waveformer is a low-latency target sound extraction model implementing streaming inference -- the model process a ~10 ms input audio chunk at each time step, while only looking at past chunks and no future chunks. On a Core i5 CPU using a single thread, real-time factors (RTFs) of different model configurations range from 0.66 to 0.94, with an end-to-end latency less than 20 ms.
Gradio-Demo.mp4
# Commands in all sections except the Dataset section are run from repo's toplevel directory
conda create --name waveformer python=3.8
conda activate waveformer
pip install -r requirements.txt
You could run the model on your audio files using the Waveformer.py
script. Example commands below use the sample audio mixture provided at data/Sample.wav
. If running for the first time, the script downloads the default configuration file and checkpoint to the current directory.
# Usage: python Waveformer.py [-h] [--targets TARGETS [TARGETS ...]] input output
# Single-target extraction
python Waveformer.py data/Sample.wav output_typing.wav --targets Computer_keyboard
# Multi-target extraction
python Waveformer.py data/Sample.wav output_bark_cough.wav --targets Bark Cough
List of all possible targets can be found using:
python Waveformer.py -h
We use Scaper toolkit to synthetically generate audio mixtures. Each audio mixture is generated on-the-fly, during training or evaluation, using Scaper's generate_from_jams
function on a .jams
specification file. We provide (in the step 3 below) .jams
specification files for all training, validation and evaluation samples used in our experiments. The .jams
specifications are generated using FSDKaggle2018 and TAU Urban Acoustic Scenes 2019 datasets as sources for foreground and background sounds, respectively. Steps to create the dataset:
-
Go to the
data
directory:cd data
-
Download FSDKaggle2018, TAU Urban Acoustic Scenes 2019, Development dataset and TAU Urban Acoustic Scenes 2019, Evaluation dataset datasets using the
data/download.py
script:python download.py
-
Download and uncompress FSDSoundScapes dataset:
wget https://targetsound.cs.washington.edu/files/FSDSoundScapes.zip unzip FSDSoundScapes.zip
This step creates the
data/FSDSoundScapes
directory.FSDSoundScapes
would contain.jams
specifications for training, validation and test samples used in the paper. Training and evaluation pipeline expect source samples (samples inFSDKaggle2018
andTAU Urban Acoustic Scenes 2019
datasets) at specific locations realtive toFSDSoundScapes
. Following steps move source samples to appropriate locations. -
Uncompress FSDKaggle2018 dataset and create scaper source:
unzip FSDKaggle2018/\*.zip -d FSDKaggle2018 python fsd_scaper_source_gen.py FSDKaggle2018 ./FSDSoundScapes/FSDKaggle2018 ./FSDSoundScapes/FSDKaggle2018
-
Uncompress TAU Urban Acoustic Scenes 2019 dataset to
FSDSoundScapes
directory:unzip TAU-acoustic-sounds/\*.zip -d FSDSoundScapes/TAU-acoustic-sounds/
python -W ignore -m src.training.train experiments/<Experiment dir with config.json> --use_cuda
Pretrained checkpoints are available at experiments.zip. These can be downloaded and uncompressed to appropriate locations using:
wget https://targetsound.cs.washington.edu/files/experiments.zip
unzip -o experiments.zip -d experiments
Run evaluation script:
python -W ignore -m src.training.eval experiments/<Experiment dir with config.json and checkpoints> --use_cuda
During the sample generation, when the amplitude of mixture sum to greater than 1, peak normalization is used to renormalize the mixtures. This results in a bunch of Scaper warnings during training and evaluation. -W ignore
flag is used for a clearner output to the console.