Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2024
License: CC BY NC SA
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY NC SA
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022

Authors: Balci, Utkucan; Patel, Jay; Balci, Berkan; Blackburn, Jeremy;

iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022

Abstract

ABSTRACT ---------------Rumble has emerged as a prominent platform hosting controversial figures facing restrictions on YouTube. Despite this, the academic community’s engagement with Rumble has been minimal. To help researchers address this gap, we introduce a comprehensive dataset of about 6.7K podcast videos from August 2020 to December 2022, amounting to over 5.6K hours of content. Besides covering metadata of these podcast videos, we provide speech-to-text transcriptions for future analysis. We also provide speaker diarization information, a collection of ~250K unique representative images from podcast videos, and face embeddings of ~400K extracted faces. With the rise of the influence of podcasts and populist figures, this dataset provides a rich resource for identifying challenges in cyber social threats in a relatively underexplored space. Rumble platform: http://rumble.com/ Link to paper: https://workshop-proceedings.icwsm.org/abstract.php?id=2024_07 License: CC BY-NC-SA 4.0 Dataset Summary iDRAMA-rumble-2024 is a large-scale dataset of 6,735 podcast videos from Rumble, an alternative Youtube-like platform. Using state-of-the-art models, we extract information across three modalities: 1) text, 2) audio, and 3) video. We detail the methodology for extracting information from podcast videos in the paper and release a first-of-its-kind dataset including data from different modalities: Metadata: Details about podcast videos, e.g., channel name, video name, video description, and more. Text: Transcription (i.e., speech-to-text) of podcast videos. Audio: Speaker diarization information providing speaker detection over time for each video. Video: Sampled representative video frames from each video, totaling 200K images. We also detect ~400K non-unique faces from these images and release face embeddings. Repository links Zenodo: On Zenodo, we provide JSON formatted dataset for all modalities and representative images in compressed files. Github: The main repository of this dataset, where we provide code snippets to get started with this dataset. Link here: https://github.com/idramalab/iDRAMA-rumble-2024 Huggingface: On Huggingface, we provide a dataset that can be accessed through Huggingface APIs in a `parquet` format. Link here: https://hf.co/datasets/iDRAMALab/iDRAMA-rumble-2024 Dataset Info The dataset is organized by modalities -- transcripts, representative images, speaker diarization, and face embeddings. Config Data-points Podcast videos 6,735 Representative images 252,387 Face embeddings 399,333 Transcripts & Speaker diarization 6,735 Zenodo Dataset Files Info #Files File names Metadata 1 iDRAMA-rumble-2024-metadata.ndjson Speaker diarization 1 iDRAMA-rumble-2024-speaker-dirization.zip Face embeddings 1 iDRAMA-rumble-2024-face-embeddings.ndjson Representation images 5 iDRAMA-rumble-2024-repr-images-set1.tar.gz iDRAMA-rumble-2024-repr-images-set2.tar.gz iDRAMA-rumble-2024-repr-images-set3.tar.gz iDRAMA-rumble-2024-repr-images-set4.tar.gz iDRAMA-rumble-2024-repr-images-set5.tar.gz Transcription Lite (Minimal information) 3 iDRAMA-rumble-2024-transcription-lite_part_1.ndjson iDRAMA-rumble-2024-transcription-lite_part_2.ndjson iDRAMA-rumble-2024-transcription-lite_part_3.ndjson Transcription 3 iDRAMA-rumble-2024-transcription_part_1.ndjson iDRAMA-rumble-2024-transcription_part_2.ndjson iDRAMA-rumble-2024-transcription_part_3.ndjson Authorship This dataset is published in the "Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media" hosted in Buffalo, NY, USA. Academic Organization: iDRAMA Lab Authors: Utkucan Balci, Jay Patel, Berkan Balci, Jeremy Blackburn Affiliation: Binghamton University, Middle East Technical University Licensing This dataset is available for free to use under terms of the non-commercial license CC BY-NC-SA 4.0. Citation @article{balci2024idrama, title = {iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022}, author = {Balci, Utkucan and Patel, Jay and Balci, Berkan and Blackburn, Jeremy}, year = {2024}, journal = {Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media}}

Related Organizations
  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average