Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
ZENODO
Preprint . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Building a Reproducible Mining Pipeline for heimskringla.no (MediaWiki) via HTML snapshots + the MediaWiki Action API: Whether the Heimskringla corpus (including the Frostaþingslög) can be searched automatically for an óðal–aþal lexical complex?

Authors: Narimani, Arvid;

Building a Reproducible Mining Pipeline for heimskringla.no (MediaWiki) via HTML snapshots + the MediaWiki Action API: Whether the Heimskringla corpus (including the Frostaþingslög) can be searched automatically for an óðal–aþal lexical complex?

Abstract

Technical A reproducible, script-driven pipeline is specified for mining the MediaWiki corpus at heimskringla.no for attestations belonging to the curated óðal/aþal lexical complex. The workflow enforces a three-stage separation—(i) URL enumeration, (ii) per-page acquisition, and (iii) extraction plus matching—to isolate coverage decisions from network volatility and to preserve auditability. Corpus-wide coverage is obtained via the MediaWiki Action API using action=query&list=allpages with full continuation handling (apcontinue) and optional redirect exclusion (apfilterredir=nonredirects); a bounded category-harvesting mode is also supported. Each enumerated page is fetched once and persisted as a raw HTML snapshot with accompanying metadata (requested/resolved URL, timestamps, HTTP status, and captured revision identifiers). Text extraction is MediaWiki-aware, preferring the main content container and excluding predictable UI/editorial scaffolding; reference/notes strata can be separated and are excluded by default. Mining is performed against the derived clean-text layer using an invariant philological core (athal_core), while Heimskringla-specific adaptations are confined to span-safe keying normalization to reduce false negatives without rewriting evidential spans. Outputs include an append-only TSV concordance with KWIC context and stable character offsets, per-page text hashes for drift detection, and JSONL run manifests enabling resumable execution and revision-stable replay via captured oldid permalinks. Non-technical A practical method is presented for searching the Heimskringla website—an online library built on wiki software—for a specific family of Old Norse words related to inherited land and lineage (óðal/aþal). The approach is designed to be repeatable and trustworthy: first, it makes a complete list of the pages to examine; second, it saves an exact copy of each page as it was retrieved; third, it strips away menus, categories, and other website “scaffolding” so that only the real text is searched. The actual word-search logic is kept stable and unchanged, so results from different runs or different corpora remain comparable. Every finding is recorded with surrounding context and with enough provenance information to trace it back to the exact page version used, even if the website later changes. The end product is a transparent concordance—essentially a searchable evidence table—that supports philological analysis without relying on manual browsing or unreliable site-wide search boxes.

Keywords

computational corpus linguistics, Digital philology, Old Norse lexical semantics, Germanic legal vocabulary, medieval Scandinavian law texts

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!