
Photographing fiscal receipts has become increasingly common with the rise of online storage and accounting services. However, capturing images in uncontrolled environments often leads to distortions that can compromise Optical Character Recognition (OCR) techniques, rendering the output text unreadable. To address this problem, we propose an open-source expert filtering approach based on low-level features to identify and discard low-quality invoice images, select high-quality images, and flag images that require preparation prior to being processed for OCR. The dataset used in this work is an extension of the Express Expense SRD dataset, which consists of 200 hand-photographed images of restaurant receipts. The free version of the original dataset has no OCR task labels. Since this information is needed to calculate the accuracy of the OCR and to analyze the effects of the proposed approach, we created a new version of the existing dataset with manual annotations for the receipts and also for the four corners of the documents. More information can be found at the following link: https://github.com/MaVILab-UFV/Filtering-Preparation-for-OCR_SIBGRAPI-2024 If you use this data, please cite our paper as follows Auad, Manoela; Alves, Sarah; Kakizaki, Gabriel; Reis, Julio C. S.; Silva, Michel. A Filtering and Image Preparation Approach to Enhance OCR for Fiscal Receipts. In 37th Conference on Graphics, Patterns and Images (SIBGRAPI), 2024.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
