The TV Speech and Music (TVSM) dataset contains speech and music activity labels across a variety of TV shows and their corresponding audio features extracted from professionally-produced high-quality audio. The dataset aims to facilitate research on speech and music detection tasks.
- The dataset can be downloaded via Zenodo.org.
- The paper can be downloaded via EURASIP open access.
- This repo contains materials and codebase to reproduce the baseline experiment in the paper.
@article{Hung2022,
title={{A Large TV Dataset for Speech and Music Activity Detection},
author={Hung, Yun-Ning and Wu, Chih-Wei and Orife, Iroro and Hipple, Aaron and Wolcott, William and Lerch, Alexander},
journal={EURASIP Journal on Audio, Speech, and Music Processing},
volume={2022},
number={1},
pages={21},
year={2022},
publisher={Springer}
}
The TVSM dataset is licensed under a Apache License 2.0 license
The downloaded dataset has the following structure:
└─── READEME.txt
└─── TVSM-cuesheet/
│ └─── labels/
│ └─── mel_features/
│ └─── mfcc/
│ └─── vgg_features/
│ └─── TVSM-xxxx_metadata.csv
└─── TVSM-pseudo/
└─── TVSM-test/
- READEME.txt: basic information about the dataset
- TVSM-cuesheet/: smaller subset used for training. The labels are derived from cuesheet information
- TVSM-pseudo/: larger subset used for training. The labels are labeled from a pre-trained model trained on TVSM-cuesheet
- TVSM-test/: subset for testing. The labels are labeled by human annotators
Each subset folder has the same structure:
- labels/: speech and music activation labels for each sample. Each row in a csv file represents "start time", "end time" and "s(speech)/m(music)"
- mel_features/: the Mel spectrogram feature extracted from the audio of each sample
- mfcc/: the MFCCs feature extracted from the audio of each sample
- vgg_features/: the VGGish feature extracted from the audio of each sample
- TVSM-xxxx_metadata.csv: the metadata of each sample
For more information, please visit our paper
Interested in inferencing existing samples? Please visit predictor.py for usage.
cd training_code
python3 predictor.py --audio_path test.wav
Please install git lfs first then run git-lfs pull
to restore the checkpoints
Please replace line 31
in SM_detector.py
with self.save_hyperparameters(hparams)
if you are using newer pytorch_lightning versions.
└─── Evaluation_Output/
│ └─── AVASpeech/
│ │ └─── T2
│ │ └─── TVSM-cuesheet
│ │ └─── TVSM-pseudo
│ └─── ...
└─── Models/
└─── training_code/
- Evaluation_Output: the output generated by three models across five evaluation sets
- T2: baseline method
- TVSM-cuesheet: CRNN-P-Cue method
- TVSM-pseudo: CRNN-P-Pseu method
- Models: the pre-trained checkpoint from CRNN-P-Cue and CRNN-P-Pseu methods
- training_code: code for training the model
If you encounter error "batch response: This repository is over its data quota. Account responsible for LFS...", can download the model checkpoint from Google Drive
Please feel free to contact yhung33@gatech.edu or open an issue here if you have any questions about the dataset or the support code.