Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research

This record provides a cleaned and genre-annotated corpus of contemporary Bosnian, designed for quantitative analysis of language entropy, "language energy" and modern NLP tasks. The corpus is built from three publicly available resources released in the CLARIN.SI repository:(1) The Sarajevo Corpus of SMS Messages in Bosnian 1.1,(2) Bosnian web corpus bsWaC 1.1, and(3) Bosnian web corpus CLASSLA-web.bs 1.0.All sources were converted to plain text, cleaned, normalised, partially deduplicated, and merged into a single consistent dataset. The final corpus contains approximately 6.18 GB of text (≈ 6,182,905,888 bytes), 46,258,935 lines and 942,515,845 tokens.The web portion is organised into several “super-genres” (News, Opinion, Forum/Chat, Info/HowTo, Legal/Admin, Literature, Ads/Promo, Mix/Other).For each super-genre a separate text file is provided, together with one global file that concatenates all genres for entropy estimation and language-model training. Cleaning focuses on removing technical noise that would bias frequency distributions and entropy estimates, while preserving the linguistic signal:– Unicode normalisation (UTF-8, NFC),– correction of common mojibake artefacts,– removal of URLs, e-mail addresses, file names, boilerplate and CMS/navigation lines,– filtering of lines with a high proportion of non-letter characters,– optional digit normalisation and lowercasing,– language filtering to keep primarily Bosnian text. Files in this record:– bosnian_corpus_all.txt (full corpus, all genres),– per-genre text files (news, forum, opinion, info/howto, legal/admin, literature, ads/promo, mix/other),– README.txt with dataset description,– two accompanying research papers (Bosnian and English), uploaded separately as PDF files. Code availability:Preprocessing, cleaning and entropy-calculation scripts are publicly available on GitHub:https://github.com/H4sK0/bosnian-corpus-pipeline Licence:This corpus is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) licence.Users must credit this Zenodo record and the original source corpora (Sarajevo SMS 1.1, bsWaC 1.1, CLASSLA-web.bs 1.0), and must distribute derivative corpora under the same or a compatible licence. Suggested citation:Hasan Kahrimanović (2025). Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research. Zenodo. DOI: [assigned by Zenodo].

Keywords

Bosnian, Bosnian language, text corpus, web corpus, SMS corpus, entropy, language modeling, NLP, CLARIN.SI

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average