
Overview This dataset is released alongside our paper:“MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models”,where we introduce a model for predicting mIF marker expression directly from H&E morphology using vision transformer (ViT) foundation models. We provide a carefully preprocessed version of two existing public datasets, specifically tailored for the task of H&E-to-mIF image translation. The dataset contains aligned and restained Hematoxylin & Eosin (H&E) and multiplex immunofluorescence (mIF) image tiles. These preprocessed tiles, along with associated metadata, enable full reproducibility of our experiments—including model training, evaluation, and cell-level analysis—and can serve as a ready-to-use resource for future work in H&E-to-mIF translation and multimodal learning. Source Datasets The dataset is derived from the following open-source datasets, containing aligned restained H&E and mIF images: ORION-CRCSource: labsyspharm/ORION-CRC – ZenodoCitation:Lin J. labsyspharm/ORION-CRC [dataset]. Zenodo. 2023. doi:10.5281/zenodo.7637988Lin, J., et al. "High-plex immunofluorescence imaging and traditional histology of the same tissue section for discovering image-based biomarkers," in Nature Cancer, vol. 4, no. 7, pp. 1036–1052, 2023.License: MIT Licence HEMITSource: Mendeley DataCitation:Bian C, Philips B, Cootes T, et al. HEMIT: H&E to Multiplex-immunohistochemistry Image Translation with Dual-Branch Pix2pix Generator. arXiv preprint arXiv:2403.18501, 2024.License: CC BY 4.0 We also used data from the IMMUcan dataset for validation purposes; however, it is not redistributed here as the dataset remains private. Preprocessing Pipeline Our preprocessing steps include: On ORION-CRC and HEMIT Nucleus segmentation with Cellpose on the DAPI channel Single-cell pseudo-labeling using Gaussian Mixture Models (GMM) based on mIF marker expression On ORION-CRC only WSI-to-tile extraction at 20× magnification (ORION-CRC) Artifact filtering using: Channel-based noise detection on mIF Foundation model (H-optimus-0) feature clustering to remove H&E artifact tiles Autofluorescence subtraction using a custom Napari tool with marker-specific correction formulas Channel normalization via percentile clipping and log-transformation These steps aim to produce high-confidence marker-positive cell annotations from noisy mIF data, enabling robust learning and evaluation on paired H&E images. File Structure The dataset is organized into the following archives and directories: HEMIT_nuclei_analysis.zipPreprocessed HEMIT data containing: Nuclei segmentation masks generated using Cellpose (40× resolution) Corresponding single-cell data in .csv format (per-cell intensities and cell types) ⚠️ This archive does not include the raw images from the original HEMIT dataset (License under the Creative Commons Attribution 4.0 International License (CC BY 4.0)). You must download the original dataset separately from Mendeley Data. The internal structure of this archive is designed to match the original, allowing direct integration of our nuclei and single-cell annotations. ORION_dataset_20x.zipORION tile dataset at 20× magnification. Folder structure: he/ – JPEG tiles of H&E images if/ – 8-bit TIFF cleaned mIF images nuclei/ – Label TIFF nucleus masks csv_nuclei_pos/ – Per-WSI CSV files containing single-cell data: Nucleus position and cell types slide_dataframe.csvDataframe that maps each slide (identified by slide_name) to its corresponding H&E, mIF, and nuclei WSI and CSV names.Columns include: slide_name: Unique slide IDs he_path: Paths to the H&E WSI if_path: Paths to the mIF WSI nuclei_path: Paths to the nucleus label WSI nuclei_csv_path: Paths to the CSV file containing single-cell data (nucleus positions and cell types) train_dataframe.csv, val_dataframe.csv, test_dataframe.csvDataframes containing tile-level metadata for each dataset split. Each row corresponds to a tile used during model training or evaluation.Columns include: slide_name: IDs of the associated slide image_path: Paths to the H&E tile target_path: Paths to the corresponding mIF tile nuclei_path: Paths to the nucleus label tile ORION_dataset_20x_he_norm.zipCycleGAN-normalized H&E images from ORION, transformed to match the staining style of IMMUcan data (20× resolution). These images can be used as augmentation during the training of MIPHEI-ViT. To extract:Use the following command: 7z x Code & Tools The full preprocessing pipeline used to produce this dataset — including tile extraction, autofluorescence correction, artifact removal, and single-cell analysis — is available at: 👉 GitHub Repository This code allows you to reproduce our results and adapt the workflow to new datasets. Citation Please cite the associated paper and Zenodo DOI when using this dataset: G. Balezo, et al, "MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models," 2025 [DOI TO UPDATE]
Histology, Staining and Labeling, Image Processing, Computer-Assisted
Histology, Staining and Labeling, Image Processing, Computer-Assisted
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
