Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2019
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2019
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

Authors: Rigaud, Christophe; Doucet, Antoine; Coustaty, Mickaël; Moreux, Jean-Philippe;

Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

Abstract

Corpus for the ICDAR2019 Competition on Post-OCR Text Correction (October 2019) Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreux http://l3i.univ-larochelle.fr/ICDAR2019PostOCR ------------------------------------------------------------------------------- These are the supplementary materials for the ICDAR 2019 paper ICDAR 2019 Competition on Post-OCR Text Correction Please use the following citation: @inproceedings{rigaud2019pocr, title="ICDAR 2019 Competition on Post-OCR Text Correction", author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe}, year={2019}, booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)} } Description The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource. Repartition of the dataset - ICDAR2019_Post_OCR_correction_training_18M.zip: 80% of the full dataset, provided to train participants' methods. - ICDAR2019_Post_OCR_correction_evaluation_4M: 20% of the full dataset used for the evaluation (with Gold Standard made publicly after the competition). - ICDAR2019_Post_OCR_correction_full_22M: full dataset made publicly available after the competition. Special case for Finnish language Material from the National Library of Finland (Finnish dataset FI > FI1) are not allowed to be re-shared on other website. Please follow these guidelines to get and format the data from the original website. 1. Go to https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en; 2. Download OCR Ground Truth Pages (Finnish Fraktur) [v1](4.8GB) from Digitalia (2015-17) package; 3. Convert the Excel file "~/metadata/nlf_ocr_gt_tescomb5_2017.xlsx" as Comma Separated Format (.csv) by using save as function in a spreadsheet software (e.g. Excel, Calc) and copy it into "FI/FI1/HOWTO_get_data/input/"; 4. Go to "FI/FI1/HOWTO_get_data/" and run "script_1.py" to generate the full "FI1" dataset in "output/full/"; 4. Run "script_2.py" to split the "output/full/" dataset into "output/training/" and "output/evaluation/" sub sets. At the end of the process, you should have a "training", "evaluation" and "full" folder with 1579528, 380817 and 1960345 characters respectively. Licenses: free to use for non-commercial uses, according to sources in details - BG1: IMPACT - National Library of Bulgaria: CC BY NC ND - CZ1: IMPACT - National Library of the Czech Republic: CC BY NC SA - DE1: Front pages of Swiss newspaper NZZ: Creative Commons Attribution 4.0 International (https://zenodo.org/record/3333627) - DE2: IMPACT - German National Library: CC BY NC ND - DE3: GT4Hist-dta19 dataset: CC-BY-SA 4.0 (https://zenodo.org/record/1344132) - DE4: GT4Hist - EarlyModernLatin: CC-BY-SA 4.0 (https://zenodo.org/record/1344132) - DE5: GT4Hist - Kallimachos: CC-BY-SA 4.0 (https://zenodo.org/record/1344132) - DE6: GT4Hist - RefCorpus-ENHG-Incunabula: CC-BY-SA 4.0 (https://zenodo.org/record/1344132) - DE7: GT4Hist - RIDGES-Fraktur: CC-BY-SA 4.0 (https://zenodo.org/record/1344132) - EN1: IMPACT - British Library: CC BY NC SA 3.0 - ES1: IMPACT - National Library of Spain: CC BY NC SA - FI1: National Library of Finland: no re-sharing allowed, follow the above section to get the data. (https://digi.kansalliskirjasto.fi/opendata) - FR1: HIMANIS Project: CC0 (https://www.himanis.org) - FR2: IMPACT - National Library of France: CC BY NC SA 3.0 - FR3: RECEIPT dataset: CC0 (http://findit.univ-lr.fr) - NL1: IMPACT - National library of the Netherlands: CC BY - PL1: IMPACT - National Library of Poland: CC BY - SL1: IMPACT - Slovak National Library: CC BY NC Text post-processing such as cleaning and alignment have been applied on the resources mentioned above, so that the Gold Standard and the OCRs provided are not necessarily identical to the originals. Structure - **Content** [./lang_type/sub_folder/#.txt] - "[OCR_toInput] " => Raw OCRed text to be de-noised. - "[OCR_aligned] " => Aligned OCRed text. - "[ GS_aligned] " => Aligned Gold Standard text. The aligned OCRed/GS texts are provided for training and test purposes. The alignment was made at the character level using "@" symbols. "#" symbols correspond to the absence of GS either related to alignment uncertainties or related to unreadable characters in the source document. For a better view of the alignment, make sure to disable the "word wrap" option in your text editor. The Error Rate and the quality of the alignment vary according to the nature and the state of degradation of the source documents. Periodicals (mostly historical newspapers) for example, due to their complex layout and their original fonts have been reported to be especially challenging. In addition, it should be mentioned that the quality of Gold Standard also varies as the dataset aggregates resources from different projects that have their own annotation procedure, and obviously contains some errors. ICDAR2019 competition Information related to the tasks, formats and the evaluation metrics are details on : https://sites.google.com/view/icdar2019-postcorrectionocr/evaluation References - IMPACT, European Commission's 7th Framework Program, grant agreement 215064 - Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. - https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland - EU Horizon 2020 research and innovation programme grant agreement No 770299 Contact - christophe.rigaud(at)univ-lr.fr - antoine.doucet(at)univ-lr.fr - mickael.coustaty(at)univ-lr.fr - jean-philippe.moreux(at)bnf.fr L3i - University of la Rochelle, http://l3i.univ-larochelle.fr BnF - French National Library, http://www.bnf.fr

Corresponding paper: https://zenodo.org/record/3459116

Keywords

OCR, Text correction

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 262
    download downloads 40
  • 262
    views
    40
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
262
40