Name: Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning
Keywords: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Other literature type 05 Sep 2021Embargo end date: 01 Jan 2021Publisher:ACMJournal:The 6th International Workshop on Historical Document Imaging and Processing

Authors: Springmann Uwe; Buettner Andreas; Noeth Maximilian; Reul Christian; Wick Christoph; Wehner Maximilian;

doi: 10.1145/3476887.3476910 , 10.48550/arxiv.2106.07881

arXiv: http://arxiv.org/abs/2106.07881

Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning

- Summary
- Subjects
- Related research
  (7)
- Metrics

Abstract

In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also implemented a two-stage approach which first trains on all available, considerably unbalanced data and then refines the output by training on a selected more balanced subset. Evaluations on 29 previously unseen books resulted in a CER of 1.73%, outperforming a widely used standard model with a CER of 2.84% by almost 40%. Training a more specialized model for some unseen Early Modern Latin books starting from our mixed model led to a CER of 1.47%, an improvement of up to 50% compared to training from scratch and up to 30% compared to training from the aforementioned standard model. Our new mixed model is made openly available to the community.

submitted to HIP'21

Related Organizations

University Hospital Würzburg
Germany
Ludwig-Maximilians-Universität München
Germany
University of Würzburg
Germany

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

7 Research products, page 1 of 1

ocrd_olena software on GitHub
IsRelatedTo
calamari software on GitHub
IsRelatedTo
calamari_models_experimental software on GitHub
IsRelatedTo
archiscribe-corpus software on GitHub
IsRelatedTo
sbb_binarization software on GitHub
IsRelatedTo
ocropus-model_fraktur software on GitHub
IsRelatedTo
ocrodeg software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	6
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%