Downloads provided by UsageCounts
The dataset consists of a multilingual noisy corpora for named entity recognition (NER). The noisy versions are simulated from the CoNLL-02 (Spanish and Dutch) and CoNLL-03 (English) NER corpora. The original collections are re-OCRed and four types of noises at two different levels are added in order to simulate various OCR output. More precisely, we first extracted raw texts and converted them into images. These images have been contaminated by adding some common noises when using a scanner. We further extract OCRed data using tesseract open source OCR engine v-3.04.01. Consequently to the image noise insertions, OCRed data contains degradations. Original and noisy texts are finally aligned. This archive contains three folders (one per language). The folders contain the degraded images, the noisy texts extracted by the OCR and their aligned version with clean data. These are the supplementary materials for the TPDL 2020 paper Assessing and minimizing the impact of OCR quality on named entity recognition. If you end up using whole or parts of this resource, please cite this paper: @InProceedings{10.1007/978-3-030-54956-5_7, author="Hamdi, Ahmed and Jean-Caurant, Axel and Sid{\`e}re, Nicolas and Coustaty, Micka{\"e}l and Doucet, Antoine", editor="Hall, Mark and Mer{\v{c}}un, Tanja and Risse, Thomas and Duchateau, Fabien", title="Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition", booktitle="Digital Libraries for Open Knowledge", year="2020", publisher="Springer International Publishing", address="Cham", pages="87--101", isbn="978-3-030-54956-5" } Acknowledgments This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant 770299 [NewsEye](https://www.newseye.eu/).
OCR, named entity recognition, noisy, degradation
OCR, named entity recognition, noisy, degradation
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
| views | 32 | |
| downloads | 3 |

Views provided by UsageCounts
Downloads provided by UsageCounts