Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

A Benchmark Dataset for Multilingual Tokenization Energy and Efficiency Across 23 Models and 325 Languages

Authors: Quesada Granja, Carlos;

A Benchmark Dataset for Multilingual Tokenization Energy and Efficiency Across 23 Models and 325 Languages

Abstract

This repository contains a benchmark dataset and processing scripts for analyzing the energy consumption and processing efficiency of 23 Hugging Face tokenizers applied to 8,212 standardized text chunks across 325 languages. The dataset includes raw and normalized energy measurements, tokenization times, token counts, and rich structural metadata for each chunk, including script composition, entropy, compression ratios, and character-level features. All experiments were conducted in a controlled environment using PyRAPL on a Linux workstation, with baseline CPU consumption subtracted via linear interpolation. The resulting data enables fine-grained comparison of tokenizers across linguistic and script diversity. It also supports downstream tasks such as energy-aware NLP, script-sensitive modeling, and benchmarking of future tokenization methods. Accompanying R scripts are provided to reproduce data processing, regression models, and clustering analyses. Visualization outputs and cluster-level summaries are also included. All contents are organized into clearly structured folders to support easy access, interpretability, and reuse by the research community. 01_processing_scripts/ R scripts to transform raw data, subtract baseline energy, and produce clean metrics. multimodel_tokenization_energy.py⤷ Python script used to tokenize all chunks with 23 models while logging energy and time. adapting_original_dataset.R⤷ Reads raw logs and metadata, computes net energy, and outputs cleaned files. energy_patterns.R⤷ Performs clustering, regression, t-SNE, and generates all visualizations. 02_raw_data/ Raw output from the tokenization experiment and baseline profiler. all_models_tokenization.csv⤷ Full log of 42M+ tokenization runs (23 models × 225 reps × 8,212 chunks). baseline.csv⤷ Background CPU energy samples, one per 50 chunks. Used for normalization. 03_clean_data/ Cleaned, enriched, and reshaped datasets ready for analysis. net_energy.csv⤷ Raw tokenization results after baseline energy subtraction (per run). tokenization_long.csv⤷ One row per chunk × tokenizer, with medians + token counts. tokenization_wide.csv⤷ Wide-format matrix: one row per chunk, one column per tokenizer × metric. complete.csv⤷ Fully enriched dataset joining all metrics, metadata, and script distributions. metadata.csv⤷ Structural features and script-based character stats per chunk. 04_cluster_outputs/ Outputs from clustering and dimensionality reduction over tokenizer energy profiles. tokenizer_dendrogram.pdf⤷ Hierarchical clustering of 23 tokenizers based on energy profiles. tokenizer_tsne.pdf⤷ t-SNE projection of tokenizers grouped by energy usage. mean_energy_per_cluster.csv⤷ Mean energy consumption (mJ) per language × tokenizer cluster. sd_energy_per_cluster.csv⤷ Standard deviation of energy consumption (mJ) per language × cluster. grid.pdf⤷ Heatmap of script-wise energy deltas (relative to Latin) for all tokenizers.

Related Organizations
Keywords

multilingual tokenization, benchmark dataset, Hugging Face, tokenizer comparison, PyRAPL, computational cost, energy efficiency

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities