Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY NC SA
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY NC SA
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY NC SA
Data sources: Datacite
versions View all 2 versions
addClaim

Multilingual Segmentation Dataset for Historical Prose (13th–16th c.)

From manuscripts to models: a multilingual corpus for sentence segmentation in historical prose
Authors: Ing, Lucence; Gille Levenson, Matthias; Macedo, Carolina;

Multilingual Segmentation Dataset for Historical Prose (13th–16th c.)

Abstract

This dataset was developed to train a multilingual sentence segmentation model, used as a pre-processing step in the automatic alignment of historical texts with Aquilign, a multilingual alignment tool developed by our team. The corpus provides training material for sentence-level segmentation in historical prose from the 13th to 16th centuries. Texts were selected for their genre diversity (narrative, didactic, legal, theological, scholarly prose) and for their ability to reflect editorial, orthographic, and linguistic variation across time, geography, and scribal practices. The current version of the corpus (v1) includes approximately 50,000 segmented excerpts across seven historical languages (Latin, French, Castilian, Catalan, Portuguese, Italian, and English). Segment boundaries are annotated using the pound sign (£), typically corresponding to sentences or syntactic units. The corpus does not include part-of-speech tagging or syntactic annotation — only sentence-level segmentation.

Keywords

digital philology, historical corpora, boundary detection, annotated dataset, multilingual segmentation, Aquilign

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average