IGNITE data toolkit: a tissue and cell-level annotated H&E and PD-L1 histopathology image dataset in non-small cell lung cancer

Authors: Spronck, Joey Matheus Antonius; van Eekelen, Leander; van Midden, Dominique; Bogaerts, Joep; Tessier, Leslie; Dechering, Valerie; Demirel-Andishmand, Muradije; +12 Authors

doi: 10.5281/zenodo.17735903 , 10.5281/zenodo.15674784 , 10.5281/zenodo.15674785

IGNITE data toolkit: a tissue and cell-level annotated H&E and PD-L1 histopathology image dataset in non-small cell lung cancer

- Summary
- Subjects
- Metrics

Abstract

Repository content This repository contains four zip files, with each of the files having the following directory structure when unpacked: . └── {images,annotations,models,inference,figures}/ ├── he/ # Files pertaining to the H&E tissue compartment segmentation dataset... └── pdl1/ ├── nuclei/ # ... the IHC nuclei detection dataset.. └── pdl1/ # ... and the PD-L1 positive tumor cell detection dataset The four zip files contain the following: 'annotations.zip' contains single-channel PNG masks for the H&E tissue compartment segmentation dataset (every pixel is labeled with a positive integer, the label map is under `annotations/he/he_label_map.json`); the zip file also contains MS COCO-formatted JSON files for the IHC nuclei/PD-L1 positive tumor cell detection datasets 'figures.zip' contains neatly visualized inference and evaluation metric figures from our paper 'images.zip' contains PNG images of the ROIs released in the toolkit. Images were extracted at a resolution of 0.5 micrometers and cropped to the extent of the ROI 'inference.zip' contains raw inference of the models for the respective datasets 'models.zip' contains the weights for our final models used for the technical validation of the toolkit File ID nomenclature All patients were assigned a unique anonymous patient ID incrementing from 1. Images/masks are named according to the patient/dataset/ROI they originate from, following the naming scheme __., e.g 'patient1_he_roi1.png'. Note that some patients occur in multiple datasets, but always keep the same anonymous patient ID. However, their ROIs are always different across datasets, e.g. 'patient1_he_roi1.png' , 'patient1_nuclei_roi1.png' and 'patient1_pdl1_roi1.png' all refer to separate, non-overlapping regions. Reader study answers for IHC nuclei and PD-L1 positive tumor cell detection datasets For the IHC nuclei and PD-L1 positive tumor cell detection test sets, we provide the annotations for all readers to allow comparative studies between AI and experts (see `annotations/pdl1/nuclei/nuclei_test_set_all_readers.json` and `annotations/pdl1/pdl1/pdl1_test_set_all_readers.json`). Moreover, we propose to use a single reader per set as the canonical annotator. This canonical annotator functions as a proposed reference standard for future benchmarks and as a way for users of the data to concisely report their own benchmarking results. For this purpose, we choose the three readers who have the best combined ranking of two outcomes: i) highest F1 score among the readers and ii) the highest F1 score versus the respective baseline algorithms. The canonical annotators are R4 for the IHC nuclei detection test set (highest mean reader-reader F1 score: 0.87, tied highest F1 score versus baseline model of 0.87), P2 for the RUMC cases of the PD-L1 detection test set (highest mean reader-reader F1 score: 0.67, highest F1 score versus baseline model: 0.7) and P5 for the SCDC cases (highest mean reader-reader F1 score: 0.765, second highest F1 score versus baseline model: 0.59). We release the training/validation/test set annotations with canonical readers for the IHC nuclei and PD-L1 detection datasets in `annotations/nuclei/nuclei_annotations.json` and ‘annotations/pdl1/pdl1/pdl1_annotations.json`. Miscellaenous points For the H&E tissue segmentation dataset, classes were intentionally split into granular categories. With the TIL biomarker use case as an example, users may choose to group classes according to their specific interests and task requirements. Dataset overview Lastly, we include a 'data_overview.csv' file that documents metadata per ROI. We provide a table below that lists what metadata each column contains. Column Contents ‘patient_id’ Unique anonymous patient ID. See ‘File ID nomenclature’. ‘roi_id’ ROI ID, see ‘File ID nomenclature’. ‘name’ Full name of ROI, e.g. ‘patient1_he_roi1’ ‘task’ Dataset label: ’he_tissue_segmentation’, ‘nuclei_detection’ or ‘pdl1_detection’ ‘source’ (Hospital) data source: ‘rumc’, ‘scdc’ or ‘tcga’ ‘specimen_type’ Specimen type: ‘biopsy’, ‘resection’ or ‘tissue_micro_array’ ‘organ’ Organ the tissue originated from ‘histological_subtype’ NSCLC subtype of the parent slide (not necessarily of the ROI, as it may not contain tumor cells). ‘stain’ ‘H&E’ or ‘PDL1_{monoclone}’ ‘scanner’ Scanner used to digitize the image ‘image_path’ Image path relative to ‘data/’ ‘annotation_path’ Annotation path relative to ‘data/’ ‘shape’ (width,height) shape of the ROI. Important caveat: for ROIs released with non-annotated context borders, this shape refers only to the annotated part of the image. ‘area_mm2’ Annotated ROI area in mm^2 ‘split’ Dataset split: train/validation/test ‘validation_fold’ For H&E tissue compartment segmentation dataset, validation fold of 5 fold cross validation 'original_tcga_id' For cases originating from the TCGA dataset, we list their original TCGA ID. Changelog of Zenodo repository versions Version Date Changes v1 2025-06-20 Initial version v2 2025-11-27 Added missing validation fold information for 5 H&E ROIs. Made small textual changes/clarifications in repository descriptions. Added link to preprint.

We introduce the IGNITE data toolkit, a multi-stain, multi-centric, and multi-scanner dataset of annotated non-small cell lung cancer (NSCLC) digital pathology images. We publicly release 887 fully annotated regions of interest (ROI) from 155 unique patients across three complementary tasks: Multi-class semantic segmentation of tissue compartments in H&E-stained slides, with 16 classes spanning primary and metastatic NSCLC Nuclei detection in PD-L1 stained immunohistochemistry (IHC) PD-L1 positive tumor cell detection in PD-L1 IHC

Related Organizations

University of Verona
Italy
Biopticka Laborator (Czechia)
Czech Republic
Karolinska Institute
Sweden
Ospedale Sacro Cuore Don Calabria
Italy
Radboud University Nijmegen Medical Centre
Netherlands

View all View all

Keywords

Machine Learning, Carcinoma, Non-Small-Cell Lung/pathology

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

Netherlands Research Portal

Cancer Research