Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Dataset . 2025
License: CC BY SA
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY SA
Data sources: Datacite
versions View all 2 versions
addClaim

Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research

Authors: Kahrimanovic, Hasan;

Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research

Abstract

This record provides a cleaned and genre-annotated corpus of contemporary Bosnian, designed for quantitative analysis of language entropy, "language energy" and modern NLP tasks. The corpus is built from three publicly available resources released in the CLARIN.SI repository:(1) The Sarajevo Corpus of SMS Messages in Bosnian 1.1,(2) Bosnian web corpus bsWaC 1.1, and(3) Bosnian web corpus CLASSLA-web.bs 1.0.All sources were converted to plain text, cleaned, normalised, partially deduplicated, and merged into a single consistent dataset. The final corpus contains approximately 6.18 GB of text (≈ 6,182,905,888 bytes), 46,258,935 lines and 942,515,845 tokens.The web portion is organised into several “super-genres” (News, Opinion, Forum/Chat, Info/HowTo, Legal/Admin, Literature, Ads/Promo, Mix/Other).For each super-genre a separate text file is provided, together with one global file that concatenates all genres for entropy estimation and language-model training. Cleaning focuses on removing technical noise that would bias frequency distributions and entropy estimates, while preserving the linguistic signal:– Unicode normalisation (UTF-8, NFC),– correction of common mojibake artefacts,– removal of URLs, e-mail addresses, file names, boilerplate and CMS/navigation lines,– filtering of lines with a high proportion of non-letter characters,– optional digit normalisation and lowercasing,– language filtering to keep primarily Bosnian text. Files in this record:– bosnian_corpus_all.txt (full corpus, all genres),– per-genre text files (news, forum, opinion, info/howto, legal/admin, literature, ads/promo, mix/other),– README.txt with dataset description,– two accompanying research papers (Bosnian and English), uploaded separately as PDF files. Code availability:Preprocessing, cleaning and entropy-calculation scripts are publicly available on GitHub:https://github.com/H4sK0/bosnian-corpus-pipeline Licence:This corpus is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) licence.Users must credit this Zenodo record and the original source corpora (Sarajevo SMS 1.1, bsWaC 1.1, CLASSLA-web.bs 1.0), and must distribute derivative corpora under the same or a compatible licence. Suggested citation:Hasan Kahrimanović (2025). Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research. Zenodo. DOI: [assigned by Zenodo].

Keywords

Bosnian, Bosnian language, text corpus, web corpus, SMS corpus, entropy, language modeling, NLP, CLARIN.SI

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average