iDRAMA-rumble-2024
is a large-scale dataset of 6,735 podcast videos from Rumble, an alternative Youtube-like platform. Using state-of-the-art models, we extract information across three modalities: 1) text, 2) audio, and 3) video. We detail the methodology for extracting information from podcast videos in the paper and release a first-of-its-kind dataset including data from different modalities:
-
Metadata: Details about podcast videos, e.g., channel name, video name, video description, and more.
-
Text: Transcription (i.e., speech-to-text) of podcast videos.
-
Audio: Speaker diarization information providing speaker detection over time for each video.
-
Video: Sampled representative video frames from each video, totaling 200K images. We also detect more than 400K non-unique faces from these images and release face embeddings.
Repo-links Purpose Zenodo On Zenodo, we provide JSON formatted dataset for all modalities and representative images in a zip file. Github The main repository of this dataset, where we provide code-snippets to get started with this dataset. Huggingface On Huggingface, we provide a dataset that can be accessed through Huggingface APIs in a parquet
format. -
Rumble platform: Rumble
-
Link to paper: CySoc 2024
-
License: CC BY-NC-SA 4.0
Install Datasets
module by pip install datasets
and then use the following code:
from datasets import load_dataset
# Download & Load complete dataset
dataset = load_dataset("iDRAMALab/iDRAMA-rumble-2024")
# Load dataset with specific config
dataset = load_dataset("iDRAMALab/iDRAMA-rumble-2024", name="transcripts")
More code-snippets to load the different variant of datasets efficiently are available on Github rpository.
Dataset is organized by modalities -- transcripts, representative-images, speaker-diarization, and face-embeddings.
Config | Data-points |
---|---|
Podcast videos | 6,735 |
Representative images | 252,387 |
Face embeddings | 399,333 |
Transcripts & Speaker diarization | 6,735 |
- Maintenance Status: Active
- Version Details:
- Current Version: v1.0.0
- First Release: 06/03/2024
- Last Update: 06/03/2024
This dataset is published at "Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media" hosted at Buffalo, NY, USA.
- Academic Organization: iDRAMA Lab
- Authors: Utkucan Balci, Jay Patel, Berkan Balci, Jeremy Blackburn
- Affiliation: Binghamton University
This dataset is available for free to use under terms of the non-commercial license CC BY-NC-SA 4.0.
@article{balci2024idrama,
title = {iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022},
author = {Balci, Utkucan and Patel, Jay and Balci, Berkan and Blackburn, Jeremy},
year = {2024},
journal = {Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media}
}