TVSM Dataset

The TV Speech and Music (TVSM) dataset contains speech and music activity labels across a variety of TV shows and their corresponding audio features extracted from professionally-produced high-quality audio. The dataset aims to facilitate research on speech and music detection tasks.

Get the dataset

The dataset can be downloaded via Zenodo.org.
The paper can be downloaded via EURASIP open access.
This repo contains materials and codebase to reproduce the baseline experiment in the paper.

License and attribution

@article{Hung2022,
  title={{A Large TV Dataset for Speech and Music Activity Detection},
  author={Hung, Yun-Ning and Wu, Chih-Wei and Orife, Iroro and Hipple, Aaron and Wolcott, William and Lerch, Alexander},
  journal={EURASIP Journal on Audio, Speech, and Music Processing},
  volume={2022},
  number={1},
  pages={21},
  year={2022},
  publisher={Springer}
}

The TVSM dataset is licensed under a Apache License 2.0 license

Dataset introduction

The downloaded dataset has the following structure:

└─── READEME.txt
└─── TVSM-cuesheet/
│    └─── labels/
│    └─── mel_features/
│    └─── mfcc/
│    └─── vgg_features/
│    └─── TVSM-xxxx_metadata.csv
└─── TVSM-pseudo/
└─── TVSM-test/

READEME.txt: basic information about the dataset
TVSM-cuesheet/: smaller subset used for training. The labels are derived from cuesheet information
TVSM-pseudo/: larger subset used for training. The labels are labeled from a pre-trained model trained on TVSM-cuesheet
TVSM-test/: subset for testing. The labels are labeled by human annotators

Each subset folder has the same structure:

labels/: speech and music activation labels for each sample. Each row in a csv file represents "start time", "end time" and "s(speech)/m(music)"
mel_features/: the Mel spectrogram feature extracted from the audio of each sample
mfcc/: the MFCCs feature extracted from the audio of each sample
vgg_features/: the VGGish feature extracted from the audio of each sample
TVSM-xxxx_metadata.csv: the metadata of each sample

For more information, please visit our paper

Codebase introduction

Interested in inferencing existing samples? Please visit predictor.py for usage.

cd training_code
python3 predictor.py --audio_path test.wav

Please install git lfs first then run git-lfs pull to restore the checkpoints

Please replace line 31 in SM_detector.py with self.save_hyperparameters(hparams) if you are using newer pytorch_lightning versions.

└─── Evaluation_Output/
│    └─── AVASpeech/
│    │    └─── T2
│    │    └─── TVSM-cuesheet
│    │    └─── TVSM-pseudo
│    └─── ...
└─── Models/
└─── training_code/

Evaluation_Output: the output generated by three models across five evaluation sets
- T2: baseline method
- TVSM-cuesheet: CRNN-P-Cue method
- TVSM-pseudo: CRNN-P-Pseu method
Models: the pre-trained checkpoint from CRNN-P-Cue and CRNN-P-Pseu methods
training_code: code for training the model

Bug Fix

If you encounter error "batch response: This repository is over its data quota. Account responsible for LFS...", can download the model checkpoint from Google Drive

Contact

Please feel free to contact yhung33@gatech.edu or open an issue here if you have any questions about the dataset or the support code.

zqlsnr/TVSM-dataset