Beyond Balanced Accuracy: Calibrated F1-Score for Reliable AI Evaluation in Imbalanced Domains

Evaluating the performance of Artificial Intelligence (AI) models in imbalanced domains poses significant challenges. Traditional metrics like accuracy can be misleading, favoring the majority class and masking poor performance on the minority class, which is often of greater interest. While balanced accuracy addresses this issue by averaging the accuracy across each class, it does not fully capture the nuances of precision and recall, crucial aspects in imbalanced scenarios. This paper introduces a novel evaluation metric, the Calibrated F1-Score, designed to provide a more reliable and informative assessment of AI models in imbalanced domains. The Calibrated F1-Score incorporates calibration techniques to ensure that the predicted probabilities reflect the true likelihood of belonging to a class. Furthermore, it leverages the F1-score, which harmonizes precision and recall, offering a balanced view of model performance. We demonstrate the effectiveness of the Calibrated F1-Score through comprehensive experiments on various imbalanced datasets, comparing its performance against traditional metrics such as accuracy, balanced accuracy, and standard F1-score. Our results show that the Calibrated F1-Score provides a more robust and insightful evaluation, enabling better model selection and optimization in imbalanced domains. This research contributes to the development of more reliable and trustworthy AI systems, particularly in critical applications where accurate prediction of minority classes is paramount.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average