
Bosnian CORE NLP Standard v1.0-LTS defines a deterministic, loss-controlled normalization, segmentation, and tokenization specification for Bosnian/BCS text intended for reproducible corpus statistics and comparable NLP experiments across heterogeneous sources (web, PDFs, subtitles, social media, OCR). The standard fixes rule ordering, artifact contracts, token typing, offset conventions, and export schemas, while allowing a small set of explicitly logged policy switches (case/number/emoji/newline). The specification introduces a three-level normalization stack (text_raw → text_nfc → text_clean), a strict run directory layout with immutable outputs, and machine-readable metadata (run_metadata.json), manifests, and SHA-256 checksums. Tokenization outputs are delivered as JSONL streams (tokens_core.jsonl, segments_core.jsonl) designed for scalable processing and stable downstream ingestion (Python/R/SQL). Key comparability contract: Canonical token inclusion sets for metrics: Set A (lexical), Set B (lexical+punctuation), Set C (full stream), with explicit URL/EMAIL handling. Normative definitions of denominators N and type counts V to prevent cross-study mismatch. Normative frequency and n-gram export formats and deterministic sorting rules. Included in this release: Specification PDF (compile-ready, referenceable). LaTeX source bundle (sections/, bib/, assets/ test files). Citation metadata (CITATION.cff) and licensing guidance for spec/code/assets.
Bosnian, BCS, NLP, text normalization, tokenization, segmentation, reproducibility, n-grams, corpus linguistics, information theory, entropy
Bosnian, BCS, NLP, text normalization, tokenization, segmentation, reproducibility, n-grams, corpus linguistics, information theory, entropy
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
