/USM

A Dataset for Polyphonic Sound Event Tagging in Urban Sound Monitoring Scenarios

OtherNOASSERTION

Urban Sound Monitoring (USM) Dataset - A Dataset for Polyphonic Sound Event Tagging in Urban Sound Monitoring Scenarios

Reference

  • Jakob Abeßer: Classifying Sounds in Polyphonic Urban Sound Scenes, Proceedings of the 152nd AES Convention (2022), Paper pre-print (PDF)

Demo

You can find a 10 minute long demo video with mel spectrogram visualizations of both mixtures and the corresponding stems at https://www.youtube.com/watch?v=pVKB2xeBOJA.

Description

This dataset includes 24,000 5-seconds-long polyphonic stereo soundscapes composed of sounds taken from the FSD50k dataset:

FSD50k samples were selected to allow for commercial usage (see licence file)!

The process of mixing the polyphonic soundscapes is detailed in the AES paper (see "Reference").

USM Dataset Download

💾 The USM dataset can be downloaded from https://zenodo.org/record/6413788.

Urban Sound Application Scenarios

A subset of the original FSD50k sound classes are mapped to 26 new sound classes potentially relevant in urban sound monitoring scenarios such as

  • 🚗 🚌 🚁 traffic monitoring
  • 🚧 construction site monitoring
  • 💣 🚨 😱recognition of rare security-relevant events
  • 🐦 🐶 bioacoustic monitoring
  • ☔ ☁️ ⚡ weather monitoring

Sound Classes

The USM dataset includes 26 sound classes

  • airplane, alarm, birds, bus, car, cheering, church bell, dogs, drilling, glass break, gunshot, hammer, helicopter, jackhammer, lawn mower, motorcycle, music, rain, sawing, scream, siren, speech, thunderstorm, train, truck, wind

Dataset Split

The USM-SED dataset contains 22,000 soundscapes in its development set (composed of sounds of the FSD50k development set) and 2,000 soundscapes in its evaluation set (composed of sounds of the FSD50k evaluation set).

The development set is further divided into a training set (20,000 soundscapes) and a validation set (2,000 soundscapes).

USM dataset FSD50k dataset (source) Soundscapes
Training Development 20,000
Validation Development 2,000
Evaluation Evaluation 2,000

Dataset structure

  • the dataset folder includes 3 folders
    • train (training set)
    • val (validation set)
    • eval (evaluation set)
  • For each soundscape, you can find
    • one (stereo) audio file: e.g. 3_mix.wav
    • multiple (mono) audio files with the isolated stems, e.g. 3_mix_stem_0.wav, 3_mix_stem_1.wav, etc.
    • one binary numpy file with the multi-label targets (26 classes), e.g. 3_mix_targets.npy
    • multiple binary numpy files with the single-label targets for the stems (26 classes), e.g. 3_stem_0_target.npy

Metadata

  • in the metadata folder in this repository you can find
    • class_labels.csv - all sound classes
    • usm_{train/val/eval}.csv - CSV files with details about sample composition of all polyphonic soundscapes in the training (train), validation (val), and evaluation (eval) datasets, the following columns are used in the CSV file:
      • ID
      • class_usm (USM sound class)
      • class_fsd (original sample sound label in FSD50k dataset)
      • file (original filename in FSD50k dataset)
      • licence (licence, the original sample was published under)
      • dur_sec (original sample duration in seconds)
      • part (FSD50k subset (dev / eval) the sample came from)
      • init_silence_sec (initial silence added in the processed sample)
      • on_sec (sample onset in the original sample)
      • off_sec (sample offset in the original sample)
      • is_foreground (bool to indicate whether sound is foreground or background sound, see paper for details)
      • mix_coeff_db (mixing coefficient in dB)
      • stereo_coeff (stereo placement coefficient)
      • usm_id (soundscape ID within the USM dataset, each sample is mixed to)

Acknowledgement

This work has been supported by the German Research Foundation (AB 675/2-2).