Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY SA
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY SA
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY SA
Data sources: Datacite
versions View all 2 versions
addClaim

Wikipedia Multilingual Fragments (WMF)

Authors: Quesada Granja, Carlos;

Wikipedia Multilingual Fragments (WMF)

Abstract

Wikipedia Multilingual Fragments (WMF) is a multilingual corpus of clean text fragments extracted from 340 different Wikipedia language editions using the official MediaWiki API. The dataset contains over 373,000 plain-text fragments, each between 500 and 1500 characters, collected by randomly sampling Wikipedia articles in encyclopedic namespaces. For each language, up to 2 million characters were collected, discarding articles that were too short or contained excessive formatting. Fragments were cleaned by removing section headers, citations, LaTeX expressions, and redundant whitespace. Contents The repository includes: wmf.csv: a unified CSV file with one row per text fragment. Each row includes metadata such as: language code article ID and title character count of the excerpt timestamp of extraction a flag indicating whether the excerpt was truncated the cleaned text itself chunks/: 8,212 plain-text files, each containing exactly 50,000 characters. These were generated by concatenating all excerpts per language and splitting them into fixed-length, non-overlapping segments. Only complete chunks were retained. Languages that could not produce at least one full chunk were excluded, resulting in a final count of 325 languages in this stage. analysis/n_grams/: character-level n-gram frequency tables (n=1 to 20) for each language, based on the cleaned text. analysis/tfidf/: word- and character-based TF, IDF, and TF-IDF scores per language, depending on the script (e.g., word-based for English, character-based for Chinese, Japanese, etc.). scripts/: Python scripts to reproduce the dataset and analyses: generate_fragments_from_wikipedia.py: collects and filters plain-text excerpts from Wikipedia. merge_language_csvs.py: merges individual per-language CSVs into a unified dataset (wmf.csv). chunk_creation.py: creates fixed-length (50,000 characters) text chunks from the full dataset, removing invisible Unicode characters and discarding incomplete segments. n_grams_calculator.py: computes character-level n-gram frequencies (1 to 20) per language. tfidf_calculator.py: calculates TF, IDF, and TF-IDF scores per language using language-aware tokenization. Applications This dataset is designed for: multilingual language modeling benchmarking tokenization strategies cross-linguistic comparison low-resource NLP statistical and linguistic analysis

Related Organizations
Keywords

text chunks, text fragments, Wikipedia

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities