A Benchmark Dataset for Multilingual Tokenization Energy and Efficiency Across 23 Models and 325 Languages

This repository contains a benchmark dataset and processing scripts for analyzing the energy consumption and processing efficiency of 23 Hugging Face tokenizers applied to 8,212 standardized text chunks across 325 languages. The dataset includes raw and normalized energy measurements, tokenization times, token counts, and rich structural metadata for each chunk, including script composition, entropy, compression ratios, and character-level features. All experiments were conducted in a controlled environment using PyRAPL on a Linux workstation, with baseline CPU consumption subtracted via linear interpolation. The resulting data enables fine-grained comparison of tokenizers across linguistic and script diversity. It also supports downstream tasks such as energy-aware NLP, script-sensitive modeling, and benchmarking of future tokenization methods. Accompanying R scripts are provided to reproduce data processing, regression models, and clustering analyses. Visualization outputs and cluster-level summaries are also included. All contents are organized into clearly structured folders to support easy access, interpretability, and reuse by the research community. 01_processing_scripts/ R scripts to transform raw data, subtract baseline energy, and produce clean metrics. multimodel_tokenization_energy.py⤷ Python script used to tokenize all chunks with 23 models while logging energy and time. adapting_original_dataset.R⤷ Reads raw logs and metadata, computes net energy, and outputs cleaned files. energy_patterns.R⤷ Performs clustering, regression, t-SNE, and generates all visualizations. 02_raw_data/ Raw output from the tokenization experiment and baseline profiler. all_models_tokenization.csv⤷ Full log of 42M+ tokenization runs (23 models × 225 reps × 8,212 chunks). baseline.csv⤷ Background CPU energy samples, one per 50 chunks. Used for normalization. 03_clean_data/ Cleaned, enriched, and reshaped datasets ready for analysis. net_energy.csv⤷ Raw tokenization results after baseline energy subtraction (per run). tokenization_long.csv⤷ One row per chunk × tokenizer, with medians + token counts. tokenization_wide.csv⤷ Wide-format matrix: one row per chunk, one column per tokenizer × metric. complete.csv⤷ Fully enriched dataset joining all metrics, metadata, and script distributions. metadata.csv⤷ Structural features and script-based character stats per chunk. 04_cluster_outputs/ Outputs from clustering and dimensionality reduction over tokenizer energy profiles. tokenizer_dendrogram.pdf⤷ Hierarchical clustering of 23 tokenizers based on energy profiles. tokenizer_tsne.pdf⤷ t-SNE projection of tokenizers grouped by energy usage. mean_energy_per_cluster.csv⤷ Mean energy consumption (mJ) per language × tokenizer cluster. sd_energy_per_cluster.csv⤷ Standard deviation of energy consumption (mJ) per language × cluster. grid.pdf⤷ Heatmap of script-wise energy deltas (relative to Latin) for all tokenizers.

Related Organizations

University of Deusto
Spain

Keywords

multilingual tokenization, benchmark dataset, Hugging Face, tokenizer comparison, PyRAPL, computational cost, energy efficiency

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

Energy Research