Industrial Label Dataset for Structured Information Extraction

Overview This dataset supports the evaluation of structured information extraction approaches from industrial labels. It was designed to enable systematic comparison between classical OCR-based processing pipelines and Vision-Language Model (VLM) approaches. The dataset consists of three variants that progressively increase in visual complexity while sharing a consistent semantic structure and annotation schema. This gradual transition from controlled synthetic data to realistic captures allows targeted evaluation of how different visual conditions affect extraction performance. Dataset Variants 1. Synthetic Synthetically generated industrial label images providing idealized, artifact-free renderings. Labels were created using an automated generation script and reflect typical industrial logistics labeling scenarios with variable textual fields and layout structures. Key properties: Full control over content and ground truth Wide layout variability: square, portrait, and landscape formats, varying canvas sizes Randomized typography, font sizes, alignments, and border styles Structured header regions (sender/recipient) and tabular middle sections Machine-readable elements: barcodes (with human-readable text) and QR codes Semantic field variability (e.g., "Quantity", "QTY", "Count", "Count number" all refer to the same field) Content generated using the Faker library for realistic logistics data (addresses, IDs, weights, etc.) Each image has a corresponding JSON annotation file. 2. Augmented The complete synthetic dataset with document-specific augmentation techniques applied, simulating degradations encountered in practical settings. Textual content is identical to the synthetic base, only visual appearance changes. The augmentations were generated with the Augraphy tool. Augmentation types: Double Exposure — simulates double exposure artifacts Letterpress and Dirty Drum — simulates letterpress printing and dirty drum roller artifacts Lighting Gradient and Shadowcast — simulates uneven lighting, gradients, and cast shadows Note: Augmented images share annotations with their synthetic counterparts. No separate JSON files are included in the augmented subfolders; annotations from synthetic subset apply directly. 3. Real Physically captured images created by printing a subset of synthetic labels and photographing them with an iPhone 11 Pro Max camera. This variant introduces realistic effects that are difficult to replicate digitally, including variations in lighting, perspective distortion, reflections, and camera-induced artifacts. Unlike the augmented variant, the photographed labels were independently generated with new content, extending the dataset beyond visual variations of existing samples. Each image has a corresponding JSON annotation file. Annotation Format Every image is annotated in a unified JSON format. Each annotation file contains: label_id — unique identifier for the label instance image_file — filename of the corresponding image objects — list of annotated elements, each with: type — element type (e.g., "text", "barcode", "qrcode") value — semantic content (literal text or decoded symbol sequence) bbox — bounding box as [x_min, y_min, x_max, y_max] in pixel coordinates metadata — image dimensions, generation flag, and creation timestamp License Creative Commons Attribution 4.0 International Use Cases Detect and extract text fields, barcodes, and QR codes from structured industrial documents Analyze and parse diverse label layouts including headers, tabular sections, and mixed content regions Evaluate model robustness under visual degradation such as noise, blur and lighting variations Acknowledgements This dataset was created during a master's thesis at the University of Leipzig, conducted within the ScaDS.AI Center for Scalable Data Analytics and Artificial Intelligence. The work was carried out in collaboration with Deutsche Telekom MMS as part of the IPCEI-CIS (Important Project of Common European Interest on Next Generation Cloud Infrastructure and Services) project. The synthetic label generation script was originally developed by Rafael Gagarin within the IPCEI-CIS project team. For this dataset, the script was adapted to additionally save precise ground-truth annotations alongside the generated images.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering