Histo-Miner: NucSeg and TumSeg datasets

I. General Training dataset used for Histo-Miner paper. 2 datasets were used to train SCC-Hovernet: UncuratedSCC NucSeg 1 dataset was used to train SCC-Segmenter: TumSeg II. NucSeg Datasets The dataset is available here: NucSeg.zip. The dataset consists of annotated H&E patches for which the cell nucei are segmented and classified. 47,392 nuclei were labeled in total (3,135 granulocytes, 12,263 lymphocytes, 3,271 plasma cells, 11,526 stromal cells, 17,197 tumor cells). The dataset is composed of 6,816 patches of 560x560 pixels with 70% overlap in a 5D numpy array according to the Hovernet data format requirements. The patches are coming from 24WSIs of 20 cSCC patients. The resolutions of the images are a mix of 40x and 20x (see IV. Patient IDs for more information). The channels of the arrays are [RGB, inst, type] where: 'RGB' is the 3 channels raw image 'inst' is the instance segmentation ground truth: every pixel range from 0 to N, where 0 is background and N is the number of nuclear instances 'type' is the nuclear type ground truth: every pixel ranges from 0-K, where 0 is background and K is the number of classes. The dataset format is fitting Hovernet-like architecture training but is not conveniant for any visualization or training of other models. This is why, another more conventional format is available for this dataset, and you can see it here: NucSeg_OriginalFormat.zip. In this case the 'RGB', 'inst', 'type' data are saved in numpy format in different folders (RawImages, InstanceMaps, ClassMaps). For instance the user can apply the functions save2dnpy_2png and save3dnpy_2png from histo_miner.utils.filemanagement to generate PNG from these files. The dataset contains 1,707 H&E non-overlapping patches of 256x256 pixels with no overlap. As described in the paper, the SCC Hovernet model was first pretrained with a Not-Curated dataset, meaning the segmentation and cell classification contains several errors, that are not quantified. It is not recommanded to use this dataset for training, only for pre-training as a first step preceding another training step with another dataset. This Not-Curated dataset is available here: UncuratedSCC.zip. The file organization follow the one of NucSeg. III. TumSeg Dataset The dataset is available here TumSeg.zip. The dataset consists of pairs for raw WSIs images and binary segmentation images, for which the tumor region was annotated. 144 WSIsof 125 cSCC patients were collected for this dataset. The resolution of the WSIs is downsample to 1.25x. IV. Patient IDs For both datasets, a csv file is available to associate each file to its corresponding patient (anonymised). For NucSeg dataset, the resolutions of the WSIs from which the patches are extracted are also shown. In version 2 of the dataset we changed the Patients IDs of TumSeg to remove missleading names. The correspondance image - patient is unchanged, only names are updated. V. Funding Notes Lucas Sancéré and Kasia Bozek were supported by the North Rhine-Westphalia return program (311-8.03.03.02-147635) and hosted by the Center for Molecular Medicine Cologne. Johannes Brägelmann and Carina Lorenz received funding from a Milded Scheel Nachwuchszentrum Grant 70113307 by the German Cancer Aid (Deutsche Krebshilfe)

Related Organizations

University of Cologne
Germany
University Hospital Cologne
Germany

Keywords

Segmentation, Histology, Annotations, Skin cancer, Object Classification

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

Cancer Research