
Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.
This challenge was hosted in MICCAI 2024
Multi-class image segmentation, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, [SDV.IB.IMA] Life Sciences [q-bio]/Bioengineering/Imaging, Multiple expert annotations, Computer Vision and Pattern Recognition (cs.CV), Calibration, Uncertainty, [INFO.INFO-IM] Computer Science [cs]/Medical Imaging, Computer Vision and Pattern Recognition, Abdominal CT, abdominal CT, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Multi-class image segmentation, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, [SDV.IB.IMA] Life Sciences [q-bio]/Bioengineering/Imaging, Multiple expert annotations, Computer Vision and Pattern Recognition (cs.CV), Calibration, Uncertainty, [INFO.INFO-IM] Computer Science [cs]/Medical Imaging, Computer Vision and Pattern Recognition, Abdominal CT, abdominal CT, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 1 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
