research data . Dataset . 2021

Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"

Lassner, David; Coburger, Julius; Neudecker, Clemens; Baillot, Anne;
Open Access
  • Published: 07 May 2021
  • Publisher: Zenodo
Abstract
The data set consists of a METS file for each of the PDFs that were used for transcription and a directory data/page_xml that contains the transcriptions of the ground truth in PAGE-XML format. In parallel to the data set publication, a data paper will be published that contains a detailed description of the data set. As soon as it is published, we will link to it. The corresponding source code can be found here https://github.com/millawell/ocr-data/tree/1.1
Subjects
ACM Computing Classification System: ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
free text keywords: OCR ground-truth
Download fromView all 2 versions
Open Access
Zenodo
Dataset . 2021
Provider: Datacite
Open Access
Zenodo
Dataset . 2021
Provider: Datacite
Any information missing or wrong?Report an Issue