Wikipedia Multilingual Fragments (WMF)

Wikipedia Multilingual Fragments (WMF) is a multilingual corpus of clean text fragments extracted from 340 different Wikipedia language editions using the official MediaWiki API. The dataset contains over 373,000 plain-text fragments, each between 500 and 1500 characters, collected by randomly sampling Wikipedia articles in encyclopedic namespaces. For each language, up to 2 million characters were collected, discarding articles that were too short or contained excessive formatting. Fragments were cleaned by removing section headers, citations, LaTeX expressions, and redundant whitespace. Contents The repository includes: wmf.csv: a unified CSV file with one row per text fragment. Each row includes metadata such as: language code article ID and title character count of the excerpt timestamp of extraction a flag indicating whether the excerpt was truncated the cleaned text itself chunks/: 8,212 plain-text files, each containing exactly 50,000 characters. These were generated by concatenating all excerpts per language and splitting them into fixed-length, non-overlapping segments. Only complete chunks were retained. Languages that could not produce at least one full chunk were excluded, resulting in a final count of 325 languages in this stage. analysis/n_grams/: character-level n-gram frequency tables (n=1 to 20) for each language, based on the cleaned text. analysis/tfidf/: word- and character-based TF, IDF, and TF-IDF scores per language, depending on the script (e.g., word-based for English, character-based for Chinese, Japanese, etc.). scripts/: Python scripts to reproduce the dataset and analyses: generate_fragments_from_wikipedia.py: collects and filters plain-text excerpts from Wikipedia. merge_language_csvs.py: merges individual per-language CSVs into a unified dataset (wmf.csv). chunk_creation.py: creates fixed-length (50,000 characters) text chunks from the full dataset, removing invisible Unicode characters and discarding incomplete segments. n_grams_calculator.py: computes character-level n-gram frequencies (1 to 20) per language. tfidf_calculator.py: calculates TF, IDF, and TF-IDF scores per language using language-aware tokenization. Applications This dataset is designed for: multilingual language modeling benchmarking tokenization strategies cross-linguistic comparison low-resource NLP statistical and linguistic analysis

Related Organizations

University of Deusto
Spain

Keywords

text chunks, text fragments, Wikipedia

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

Energy Research