Bot Into The Fediverse Dataset

This dataset contains anonymized features for bot detection on Mastodon (Fediverse). It was created for the accompanying paper and consists of accounts labeled as bot or non-bot, collected from publicly accessible content via the Mastodon Application Programming Interface (API) during January–February 2025. To reduce privacy risks and facilitate reuse, the dataset does not include raw usernames, user IDs, or raw text. Instead, we provide (i) engineered account/profile and activity features (e.g., follower/following counts and posting statistics), and (ii) text representations derived from public content. Specifically, the account profile description (“note”) was converted into fixed-length embeddings using bert-base-multilingual-cased. In addition, post-level textual information was converted into embeddings (see twets_emb), enabling downstream modeling without access to the original text. The dataset is intended for research on bot detection, feature engineering, and multilingual representation learning on decentralized social networks, and supports reproducibility of experiments reported in the paper. Data collection and processing Source platform: Mastodon (public content only). Collection period: January–February 2025. Access method: Platform API. Anonymization: Removal of direct identifiers (e.g., usernames and raw profile text). Only derived numeric features and embeddings are shared. Text embeddings: bert-base-multilingual-cased applied to the profile description (“note”); post embeddings provided as twets_emb. Intended use Supervised bot detection and benchmarking on Mastodon-derived features. Feature importance/ablation studies on profile and behavioral signals. Experiments using multilingual text embeddings without releasing raw text. Limitations and notes Labels reflect the definition and labeling procedure described in the accompanying paper and may contain noise or bias. The dataset contains derived representations, so it may not support tasks that require raw text (e.g., linguistic audits, toxicity annotation, qualitative analyses). Some features (e.g., averages over interactions) may depend on the observation window and API availability at collection time. Column dictionary Below are the dataset columns included in each row (one row per account): Username-based (derived, no raw username shared) username_length: Length of the (anonymized) username string. username_num_digits: Count of numeric characters in username. username_num_letters: Count of alphabetic characters in username. username_num_special: Count of non-alphanumeric characters in username. username_starts_with_digit: Binary indicator (1 if username starts with a digit). username_ends_with_digit: Binary indicator (1 if username ends with a digit). fuzzy_score: Fuzzy string similarity score between username and screen name computed during preprocessing (as defined in the paper/processing scripts). Network / account metadata followers_count: Number of followers at collection time. following_count: Number of accounts followed at collection time. statuses_count: Total number of statuses/posts at collection time. days: Account age or days since creation/first observed. Activity and interaction aggregates (computed over last 40 collected posts in the observation window) avg_reply_count: Average replies per post. avg_retweet_count: Average boosts/reblogs per post. avg_favorite_count: Average favorites/likes per post. avg_num_tags: Average number of hashtags per post. avg_num_urls: Average number of URLs per post. avg_num_mentions: Average number of mentions per post. avg_possibly_sensitive: Average fraction/indicator of sensitive content (if available/derived). Language and text embeddings language: Language code associated with the account/posts (when available). note_emb: Embedding vector of the profile description (“note”) computed with bert-base-multilingual-cased. twets_emb: Embedding vector(s) derived from the account’s posts (average embedding over recent posts). Label bot: Binary label (1 = bot, 0 = non-bot). Citation If you use this dataset, please cite: The dataset DOI (10.5281/zenodo.17987595) The accompanying paper (DOI 10.1007/s13278-025-01567-z)

Related Organizations

Universidad Politécnica de Madrid
Spain

Keywords

Fediverse, Bot Detection, Social Bots, Mastodon

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Funded by

EC| AI-CODE