Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Report . 2026
License: CC BY
Data sources: Datacite
ZENODO
Report . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Bosnian CORE NLP Standard (BCS-compatible): Text Normalization & Tokenization — v1.0-LTS

Authors: Kahrimanovic, Hasan;

Bosnian CORE NLP Standard (BCS-compatible): Text Normalization & Tokenization — v1.0-LTS

Abstract

Bosnian CORE NLP Standard v1.0-LTS defines a deterministic, loss-controlled normalization, segmentation, and tokenization specification for Bosnian/BCS text intended for reproducible corpus statistics and comparable NLP experiments across heterogeneous sources (web, PDFs, subtitles, social media, OCR). The standard fixes rule ordering, artifact contracts, token typing, offset conventions, and export schemas, while allowing a small set of explicitly logged policy switches (case/number/emoji/newline). The specification introduces a three-level normalization stack (text_raw → text_nfc → text_clean), a strict run directory layout with immutable outputs, and machine-readable metadata (run_metadata.json), manifests, and SHA-256 checksums. Tokenization outputs are delivered as JSONL streams (tokens_core.jsonl, segments_core.jsonl) designed for scalable processing and stable downstream ingestion (Python/R/SQL). Key comparability contract: Canonical token inclusion sets for metrics: Set A (lexical), Set B (lexical+punctuation), Set C (full stream), with explicit URL/EMAIL handling. Normative definitions of denominators N and type counts V to prevent cross-study mismatch. Normative frequency and n-gram export formats and deterministic sorting rules. Included in this release: Specification PDF (compile-ready, referenceable). LaTeX source bundle (sections/, bib/, assets/ test files). Citation metadata (CITATION.cff) and licensing guidance for spec/code/assets.

Keywords

Bosnian, BCS, NLP, text normalization, tokenization, segmentation, reproducibility, n-grams, corpus linguistics, information theory, entropy

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!