
doi: 10.7479/khac-x956
Natural history museums curate billions of insect specimens, forming a vast but underutilized resource for biodiversity research. While digitization initiatives have increased the availability of high-resolution specimen images, extracting structured metadata from specimen labels remains a significant bottleneck, often requiring manual transcription. To address this challenge, we developed ELIE (Entomological Label Information Extraction), a semi-automated pipeline that combines computer vision, convolutional neural networks (CNNs), optical character recognition (OCR), and clustering algorithms to streamline the extraction of entomological label data. ELIE operates in three stages: 1. Label detection and classification (e.g., printed vs. handwritten) 2. OCR-based text extraction from printed labels using Tesseract and the Google Vision API 3. Text-based clustering of OCR output using the K-Medoids algorithm at a 0.9 similarity threshold, allowing for optional human validation of clustered outliers. This dataset release supports the ELIE pipeline and includes annotated JPEG images and corresponding XML files, structured into training (80%), validation (20%), and testing (10%) subsets. All annotations are based on the “label” class, enabling robust model training for multi-label image (MLI) detection and object segmentation. In addition to image and XML data, this repository includes derived OCR output files (.json) and clustering results (.csv) for selected datasets. These resources facilitate downstream tasks such as label text parsing, automated record linkage, metadata deduplication, and large-scale content analysis. The data spans seven digitization projects, totaling over 43,000 labeled images from diverse insect orders and geographic regions, including: • AntWeb – Formicidae labels from global collections • Bees Bytes – Apoidea labels digitized by the Museum für Naturkunde Berlin • LEPPHIL – Lepidoptera labels from the Philippines by the Museum für Naturkunde Berlin • MCZ_ENT_Boston – Hexapoda labels from the Museum of Comparative Zoology, Harvard • MfN_LEP_SEASIA – Pyraloidea labels from Southeast Asia digitized by the Museum für Naturkunde Berlin • Picturae_MfN – Hexapoda labels from the Museum für Naturkunde Berlin • USNM_COL_CAM – Beetle labels from South and Central America digitized by the Smithsonian National Museum of Natural History Benchmarking on this diverse dataset showed that ELIE successfully detected and clustered up to 98% of printed labels, significantly reducing manual effort in digitization workflows. By integrating AI-driven methods with structured OCR output and automated clustering, our approach enhances label metadata capture, accelerates biodiversity data accessibility, and supports scalable research in ecology, taxonomy, and biodiversity informatics.
OCR, label extraction, AI, museum collections, CNN, insect digitization
OCR, label extraction, AI, museum collections, CNN, insect digitization
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 1 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
