/iDRAMA-rumble-2024

A large-scale podcast dataset of Rumble social media platform.

Primary LanguageJupyter Notebook

iDRAMA-rumble-2024 Header

Dataset Summary

iDRAMA-rumble-2024 is a large-scale dataset of 6,735 podcast videos from Rumble, an alternative Youtube-like platform. Using state-of-the-art models, we extract information across three modalities: 1) text, 2) audio, and 3) video. We detail the methodology for extracting information from podcast videos in the paper and release a first-of-its-kind dataset including data from different modalities:

  • Metadata: Details about podcast videos, e.g., channel name, video name, video description, and more.

  • Text: Transcription (i.e., speech-to-text) of podcast videos.

  • Audio: Speaker diarization information providing speaker detection over time for each video.

  • Video: Sampled representative video frames from each video, totaling 200K images. We also detect more than 400K non-unique faces from these images and release face embeddings.

    Repo-links Purpose
    Zenodo On Zenodo, we provide JSON formatted dataset for all modalities and representative images in a zip file.
    Github The main repository of this dataset, where we provide code-snippets to get started with this dataset.
    Huggingface On Huggingface, we provide a dataset that can be accessed through Huggingface APIs in a parquet format.
  • Rumble platform: Rumble

  • Link to paper: CySoc 2024

  • License: CC BY-NC-SA 4.0

Quick start with Datasets

Install Datasets module by pip install datasets and then use the following code:

from datasets import load_dataset

# Download & Load complete dataset
dataset = load_dataset("iDRAMALab/iDRAMA-rumble-2024")

# Load dataset with specific config
dataset = load_dataset("iDRAMALab/iDRAMA-rumble-2024", name="transcripts")

More code-snippets to load the different variant of datasets efficiently are available on Github rpository.

Dataset Info

Dataset is organized by modalities -- transcripts, representative-images, speaker-diarization, and face-embeddings.

Config Data-points
Podcast videos 6,735
Representative images 252,387
Face embeddings 399,333
Transcripts & Speaker diarization 6,735

Version

  • Maintenance Status: Active
  • Version Details:
    • Current Version: v1.0.0
    • First Release: 06/03/2024
    • Last Update: 06/03/2024

Authorship

This dataset is published at "Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media" hosted at Buffalo, NY, USA.

  • Academic Organization: iDRAMA Lab
  • Authors: Utkucan Balci, Jay Patel, Berkan Balci, Jeremy Blackburn
  • Affiliation: Binghamton University

Licensing

This dataset is available for free to use under terms of the non-commercial license CC BY-NC-SA 4.0.

Citation

@article{balci2024idrama,
  title  = {iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022},
  author = {Balci, Utkucan and Patel, Jay and Balci, Berkan and Blackburn, Jeremy},
  year   = {2024},
  journal = {Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media}
}