Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Bot Into The Fediverse Dataset

Authors: MORENO GARCIA, FRANCISCO;

Bot Into The Fediverse Dataset

Abstract

This dataset contains anonymized features for bot detection on Mastodon (Fediverse). It was created for the accompanying paper and consists of accounts labeled as bot or non-bot, collected from publicly accessible content via the Mastodon Application Programming Interface (API) during January–February 2025. To reduce privacy risks and facilitate reuse, the dataset does not include raw usernames, user IDs, or raw text. Instead, we provide (i) engineered account/profile and activity features (e.g., follower/following counts and posting statistics), and (ii) text representations derived from public content. Specifically, the account profile description (“note”) was converted into fixed-length embeddings using bert-base-multilingual-cased. In addition, post-level textual information was converted into embeddings (see twets_emb), enabling downstream modeling without access to the original text. The dataset is intended for research on bot detection, feature engineering, and multilingual representation learning on decentralized social networks, and supports reproducibility of experiments reported in the paper. Data collection and processing Source platform: Mastodon (public content only). Collection period: January–February 2025. Access method: Platform API. Anonymization: Removal of direct identifiers (e.g., usernames and raw profile text). Only derived numeric features and embeddings are shared. Text embeddings: bert-base-multilingual-cased applied to the profile description (“note”); post embeddings provided as twets_emb. Intended use Supervised bot detection and benchmarking on Mastodon-derived features. Feature importance/ablation studies on profile and behavioral signals. Experiments using multilingual text embeddings without releasing raw text. Limitations and notes Labels reflect the definition and labeling procedure described in the accompanying paper and may contain noise or bias. The dataset contains derived representations, so it may not support tasks that require raw text (e.g., linguistic audits, toxicity annotation, qualitative analyses). Some features (e.g., averages over interactions) may depend on the observation window and API availability at collection time. Column dictionary Below are the dataset columns included in each row (one row per account): Username-based (derived, no raw username shared) username_length: Length of the (anonymized) username string. username_num_digits: Count of numeric characters in username. username_num_letters: Count of alphabetic characters in username. username_num_special: Count of non-alphanumeric characters in username. username_starts_with_digit: Binary indicator (1 if username starts with a digit). username_ends_with_digit: Binary indicator (1 if username ends with a digit). fuzzy_score: Fuzzy string similarity score between username and screen name computed during preprocessing (as defined in the paper/processing scripts). Network / account metadata followers_count: Number of followers at collection time. following_count: Number of accounts followed at collection time. statuses_count: Total number of statuses/posts at collection time. days: Account age or days since creation/first observed. Activity and interaction aggregates (computed over last 40 collected posts in the observation window) avg_reply_count: Average replies per post. avg_retweet_count: Average boosts/reblogs per post. avg_favorite_count: Average favorites/likes per post. avg_num_tags: Average number of hashtags per post. avg_num_urls: Average number of URLs per post. avg_num_mentions: Average number of mentions per post. avg_possibly_sensitive: Average fraction/indicator of sensitive content (if available/derived). Language and text embeddings language: Language code associated with the account/posts (when available). note_emb: Embedding vector of the profile description (“note”) computed with bert-base-multilingual-cased. twets_emb: Embedding vector(s) derived from the account’s posts (average embedding over recent posts). Label bot: Binary label (1 = bot, 0 = non-bot). Citation If you use this dataset, please cite: The dataset DOI (10.5281/zenodo.17987595) The accompanying paper (DOI 10.1007/s13278-025-01567-z)

Related Organizations
Keywords

Fediverse, Bot Detection, Social Bots, Mastodon

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Funded by