This project was funded by the DIM MAP in the context of the CREMMA project (https://www.dim-map.fr/projets-soutenus/cremma/) The cremma-medieval repository was created in order to make available transcription corpora for training HTR models for medieval manuscripts from the 12th to the 14th century. The CREMMA Medieval dataset has been built with eScriptorium (http://traces6.paris.inria.fr), an interface for HTR ground truth production, and, an HTR and layout segmentation engine. It is composed of ten Old French manuscripts written between the 13th and 14th centuries, mainly scanned in high definition and color except for one manuscript (Vatican) which is a black and white document and BnF fr. 17229, 13496 and 411 that come from microfilm scans. The datasets is mostly made from pre-existing transcribed texts and the samples size can be very different from one source manuscript to the other. The basis of the dataset is composed of the following transcriptions : Bibliothèqe nationale de France, Arsenal 3516, Crowdsourced transcriptions of the collaborative projects of the Standford Library: Bestiaire de Guillaume le Clerc de Normandie (https://fromthepage.com/stanfordlibraries/guillaume-le-clerc-de-normandie-s-bestiary) Bibliothèqe nationale de France, fr. 411, Vie de saint Lambert transcribed by A. Pinche (ENC) Bibliothèqe nationale de France, fr. 412, Li Seint Confessor de Wauchier de Denain transcribed by A. Pinche (ENC) Bibliothèqe nationale de France, fr. 844, Manuscrit du Roi, Maritem project(https://anr.fr/Projet-ANR-18-CE27-0016) transcribed by V. Mariotti (projet Maritem) Bibliothèqe nationale de France, fr. 13496, Vie de saint Jérôme transcribed by A. Pinche (ENC) Bibliothèqe nationale de France, fr. 17229, Vie de saint Jérôme transcribed by A. Pinche (ENC) Bibliothèqe nationale de France, fr. 25516, Beuve de Hantone transcribed By A. Nolibois (Université d'Aix-Marseille) Bibliothèqe nationale de France, fr.22550, Les Sept Sages de Thèbes, this project just started in Geneva under the direction of Y. Foehr-Janssen (UNIGE), the different have been transcribed by Camille Carnaille (ULB/UNIGE) (fol.157r, 163v, 174v, 178v, 186v, 200v), Prunelle Deleville (UNIGE) (fol. 157v, 178r, 186r, 200r, 204r, 343v), Sophie Lecomte (ULB) (fol. 174v), Aminoel Meylan (UNIGE) (169r), Simone Ventura (ULB) (fol. 163r). Cologny, Bodmer, 168 and Vatican, Reg. Lat., 1616, Chanson d'Otinel transcribed by J. -B. Camps (ENC) from the Geste project (https://github.com/Jean-Baptiste-Camps/Geste) University of pennsylvania, codex 660, pelerinage de mademoiselle Sapience, transcribe by Ariane Pinche (ENC) University of pennsylvania, codex 909, Énéide, transcribed by Lucien Dugaz (ENC) As the data come from different projects, transcriptions have been standardized to strengthen HTR models. We chose a graphemic transcription method, following D. Stutzmann definitions (see bibliography), to have a sign in the image corresponding to a sign in our text: all the abbreviations are kept, and u/v or i/j are not distinguished. The spaces in the dataset are not homogeneously represented, sometimes transcriptions reproduce the manuscript spacing while others use lexical spaces. It must be stressed that spaces are the most important source of error in medieval HTR models. Most of the transcription follow the layout segmentation of the SegmOnto ontology (https://github.com/SegmOnto/examples), separating the main column, margin, numbering, drop capital, etc. To ensure the quality of the data, continuous integration workflow (Github Actions) has been put in place checking the segmentation vocabulary : SegmentoKraken, XML schema validator (segmentoAltoValidator.xsd), but also the homogeneity of the signs of the characters used in the dataset through a list of authorized signs and translation table (table.csv) with ChocoMufin.

Related Organizations

École Nationale des Chartes
France
PSL Research University
France

Keywords

kraken_pytorch

3 Research products, page 1 of 1

CATMuS Medieval
2023IsContinuedBy
CATMuS Medieval
2025IsContinuedBy
HTR-United/cremma-medieval: 1.0.1 Bicerin (DOI)
2021IsCompiledBy

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average