<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)

Name: TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)
Keywords: Handwritten text recognition, Handwritten text recognition for Medieval manuscripts, Digital Paleography

Research datakeyboard_double_arrow_right Dataset 09 Mar 2024 Latin Publisher:Zenodo

Authors: Torres Aguilar, Sergio; Jolivet, Vincent;

doi: 10.5281/zenodo.10800223 , 10.5281/zenodo.7547437

TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

TRIDIS (Tria Digita Scribunt) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards). It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies providing a versatile tool for historians and philologists in transforming and analyzing historical texts. A paper presenting the first version of the model is available here: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163 Transcriptions rules : Since the majority of the training documents come from diplomatic editions, the transcriptions were normalized to contemporary reading standards, and abbreviations were expanded with the aim of facilitating a more fluid reading of the document. The following rules were applied: The abbreviations have been expanded, both those by suspension (facimꝰ ---> facimus) and by contraction (dñi --> domini). Likewise, those using conventional signs (⁊ --> et ; ꝓ --> pro) have been resolved. The named entities (names of persons, places and institutions) have been capitalized. The beginning of a block of text as well as the original capitals used by the scribe are also capitalized. The consonantal i and u characters have been transcribed as j and v in both French and Latin. The punctuation marks used in the manuscript like: . or / or | have not been systematically transcribed as the transcription has been standardized with modern punctuation. Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign $ at the beginning and at the end. Versions : Version 1 of the model was trained on charters and registers dataset from the Late Medieval period (12th-15th centuries). The training and evaluation involved 1855 pages, 120k lines of text, and almost 1M tokens, conducted using three freely available ground-truth corpora: The Alcar-HOME database: https://zenodo.org/record/5600884 The e-NDP corpus: https://zenodo.org/record/7575693 The Himanis project: https://zenodo.org/record/5535306 Version 2 of the model has added new datasets from feudal books and legal proceedings (14th-16th centuries), incorporating an additional 115k lines and more than 1.2M tokens to the previous version using other corpora like: Königsfelden Abbey corpus: https://zenodo.org/record/5179361 Monumenta Luxemburgensia. Accuracy TRIDIS was trained using a CNN+RNN+CTC architecture within the Kraken suite (https://kraken.re/). This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries. During evaluation, the model showed an accuracy of 93.1% on the validation set and a CER (Character Error Ratio) of about 0.11 to 0.15 on four external unseen datasets. Fine-tuning the model with 10 ground-truth pages can improve these results to a CER of between 0.06 to 0.10, respectively. Other formats The ground truth used for version 2 was also employed to train a Transformer HTR model that combines TrOCR as the encoder with a RoBERTa medieval model as the decoder. This model exhibits a slighly better performance in terms of CER metrics to the current TRIDIS version and shows an improved WER by about 25%. The model is available on the Hugging Face Hub: magistermilitum/tridis_HTR

Related Organizations

École Nationale des Chartes
France
University of Luxembourg
Luxembourg

Keywords

Handwritten text recognition, Handwritten text recognition for Medieval manuscripts, Digital Paleography

Filter by relation

All relations

arrow_drop_down

2 Research products, page 1 of 1

HTR model for Latin and French Medieval Documentary Manuscripts (12th-15th)
2023IsAmongTopNSimilarDocuments
HTR model for Latin and French Medieval Documentary Manuscripts (12th-15th)
2023HasVersion

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)

TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th)

2 Research products, page 1 of 1

HTR model for Latin and French Medieval Documentary Manuscripts (12th-15th)

HTR model for Latin and French Medieval Documentary Manuscripts (12th-15th)