Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC 0
Data sources: ZENODO
https://dx.doi.org/10.7479/kha...
Dataset . 2025
License: CC 0
Data sources: Datacite
versions View all 2 versions
addClaim

ELIE – Entomological Label Information Extraction

Authors: Belot, Margot; Tuberosa, Joël; Preuss, Leonardo; Svezhentseva, Olha; Claessen, Magdalena; Bölling, Christian; Schuster, Franziska; +1 Authors

ELIE – Entomological Label Information Extraction

Abstract

Natural history museums curate billions of insect specimens, forming a vast but underutilized resource for biodiversity research. While digitization initiatives have increased the availability of high-resolution specimen images, extracting structured metadata from specimen labels remains a significant bottleneck, often requiring manual transcription. To address this challenge, we developed ELIE (Entomological Label Information Extraction), a semi-automated pipeline that combines computer vision, convolutional neural networks (CNNs), optical character recognition (OCR), and clustering algorithms to streamline the extraction of entomological label data. ELIE operates in three stages: 1. Label detection and classification (e.g., printed vs. handwritten) 2. OCR-based text extraction from printed labels using Tesseract and the Google Vision API 3. Text-based clustering of OCR output using the K-Medoids algorithm at a 0.9 similarity threshold, allowing for optional human validation of clustered outliers. This dataset release supports the ELIE pipeline and includes annotated JPEG images and corresponding XML files, structured into training (80%), validation (20%), and testing (10%) subsets. All annotations are based on the “label” class, enabling robust model training for multi-label image (MLI) detection and object segmentation. In addition to image and XML data, this repository includes derived OCR output files (.json) and clustering results (.csv) for selected datasets. These resources facilitate downstream tasks such as label text parsing, automated record linkage, metadata deduplication, and large-scale content analysis. The data spans seven digitization projects, totaling over 43,000 labeled images from diverse insect orders and geographic regions, including: • AntWeb – Formicidae labels from global collections • Bees Bytes – Apoidea labels digitized by the Museum für Naturkunde Berlin • LEPPHIL – Lepidoptera labels from the Philippines by the Museum für Naturkunde Berlin • MCZ_ENT_Boston – Hexapoda labels from the Museum of Comparative Zoology, Harvard • MfN_LEP_SEASIA – Pyraloidea labels from Southeast Asia digitized by the Museum für Naturkunde Berlin • Picturae_MfN – Hexapoda labels from the Museum für Naturkunde Berlin • USNM_COL_CAM – Beetle labels from South and Central America digitized by the Smithsonian National Museum of Natural History Benchmarking on this diverse dataset showed that ELIE successfully detected and clustered up to 98% of printed labels, significantly reducing manual effort in digitization workflows. By integrating AI-driven methods with structured OCR output and automated clustering, our approach enhances label metadata capture, accelerates biodiversity data accessibility, and supports scalable research in ecology, taxonomy, and biodiversity informatics.

Related Organizations
Keywords

OCR, label extraction, AI, museum collections, CNN, insect digitization

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    1
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
1
Average
Average
Average
Related to Research communities
Italian National Biodiversity Future Center