Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Article . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Article . 2025
License: CC BY
Data sources: Datacite
ZENODO
Article . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Beyond Balanced Accuracy: Calibrated F1-Score for Reliable AI Evaluation in Imbalanced Domains

Authors: Revista, Zen; IA, 10;

Beyond Balanced Accuracy: Calibrated F1-Score for Reliable AI Evaluation in Imbalanced Domains

Abstract

Evaluating the performance of Artificial Intelligence (AI) models in imbalanced domains poses significant challenges. Traditional metrics like accuracy can be misleading, favoring the majority class and masking poor performance on the minority class, which is often of greater interest. While balanced accuracy addresses this issue by averaging the accuracy across each class, it does not fully capture the nuances of precision and recall, crucial aspects in imbalanced scenarios. This paper introduces a novel evaluation metric, the Calibrated F1-Score, designed to provide a more reliable and informative assessment of AI models in imbalanced domains. The Calibrated F1-Score incorporates calibration techniques to ensure that the predicted probabilities reflect the true likelihood of belonging to a class. Furthermore, it leverages the F1-score, which harmonizes precision and recall, offering a balanced view of model performance. We demonstrate the effectiveness of the Calibrated F1-Score through comprehensive experiments on various imbalanced datasets, comparing its performance against traditional metrics such as accuracy, balanced accuracy, and standard F1-score. Our results show that the Calibrated F1-Score provides a more robust and insightful evaluation, enabling better model selection and optimization in imbalanced domains. This research contributes to the development of more reliable and trustworthy AI systems, particularly in critical applications where accurate prediction of minority classes is paramount.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average