Polyglot Concordance — Gospel of Mark Alignment Corpus

Polyglot Concordance is a word-level alignment of the Gospel of Mark across three textual witnesses — the Greek New Testament, the Syriac Peshitta, and the Latin Clementine Vulgate — with AI-generated, apparatus-style annotations on every divergence. This deposit is corpus version 2.0.0: all 678 Mark verses, comprising 8,828 alignment groups (4,800 of them non-aligned divergences), as one JSON file per verse. ⚠ Status — machine-generated draft, not a critical edition. The corpus is produced entirely by a large language model (Anthropic Claude Opus 4.8) and has not been peer-reviewed by a biblical-textual scholar. It is best treated as a machine-generated alignment draft, a starting point for scholar review, rather than as an authoritative critical edition. Self-reported confidence values are not calibrated against correctness and should not be read as a quality signal. Provenance. Generated by claude-opus-4-8 via the Anthropic Messages Batch API (output_config.effort = medium). Each verse was schema-validated; structurally invalid responses were quarantined and regenerated until valid, so the published corpus has zero quarantined verses. v2.0.0 supersedes v1.0.0 (generated by Claude Sonnet 4.5): the model upgrade measurably improved accuracy (4 of 4 hand-verified apparatus errors avoided, including one Sonnet repeated systematically) and run-to-run self-consistency (+11 percentage points on group membership). Full details, the v1→v2 delta, and known limitations are in CHANGELOG.md within the archive. Contents. One JSON file per verse under mark/<chapter>/<verse>.json; each records the tokenized witnesses, the alignment groups (with variant verdict, semantic type, and an apparatus note for every non-aligned group), and generation metadata. The archive also bundles CORPUS_VERSION, CHANGELOG.md, the exact generation prompt (align_3way.md), and DATASET_README.md documenting the full schema. Sources. Derived from the STEP Bible TAGNT tagged Greek NT (Tyndale House Cambridge, CC BY 4.0), the Syriac Peshitta and root data from the Aramaic Root Atlas (DOI 10.5281/zenodo.19358625), and the Latin Clementine Vulgate (seven1m/open-bibles, public domain). This derived corpus is released under CC BY 4.0; redistribution must credit this corpus and the upstream sources. Live viewer: polyglotconcordance.com (multilingual UI in English, Spanish, Simplified and Traditional Chinese; public read-only JSON API at /api/v1/). How to cite: Fresco Benaim, Jose. (2026). Polyglot Concordance — Gospel of Mark Alignment Corpus (v2.0.0) [Data set]. Zenodo.

Found an issue? Give us feedback