<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Synthetic dataset for multi-script text line recognition

Name: Synthetic dataset for multi-script text line recognition
Creator: NAJEM-MEYER, SVEN
Keywords: Machine Learning, Document Analysis

Research datakeyboard_double_arrow_right Dataset 09 Feb 2025 French Publisher:ZenodoFunded by:SNSF | How does a classical hero...

Authors: NAJEM-MEYER, SVEN;

doi: 10.5281/zenodo.14840349 , 10.5281/zenodo.14840348

Synthetic dataset for multi-script text line recognition

- Summary
- Subjects
- Metrics

Abstract

Optical Character Recognition (OCR) systems frequently encounter difficulties when processing rare or ancient scripts, especially when they occur in historical contexts involving multiple writing systems. These challenges often constrain researchers to fine-tune or to train new OCR models tailored to their specific needs. To support these efforts, we introduce a synthetic dataset comprising 6.2 million lines, specifically geared towards mixed polytonic Greek and Latin scripts. Being augmented with artificially degraded lines, the dataset bolsters strong results when used to train historical OCR models. This resource can be used both for training and testing purposes, and is particularly valuable for researchers working with ancient Greek and limited annotated data. The software used to generate this datasets is linked to below on our Git. This is a sample, but please contact us if you would like access to the whole dataset.

Keywords

Machine Learning, Document Analysis

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Funded by

SNSF| How does a classical hero die in the digital age? Using Sophocles’ Ajax to create a commentary on commentaries