Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores

Name: Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores
Keywords: emotion analysis, Technology, QH301-705.5, T, Physics, QC1-999, performance metrics, Engineering (General). Civil engineering (General), annotated datasets, Chemistry

Maria Cristina Hinojosa Lee; Johan Braet; Johan Springael

Found an issue? Give us feedback

Applied Sciencesarrow_drop_down

Applied Sciences

Article . 2024 . Peer-reviewed

License: CC BY

Data sources: Crossref

Applied Sciences

Article . 2024

Data sources: DOAJ

Institutional Repository Universiteit Antwerpen

Article . 2024

Data sources: Institutional Repository Universiteit Antwerpen

Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores

descriptionPublicationkeyboard_double_arrow_right Article 28 Oct 2024 Belgium English Publisher:MDPI AGJournal:Applied Sciences, volume 14, page 9,863 (eissn: 2076-3417,

Copyright policy )

Authors: Maria Cristina Hinojosa Lee; Johan Braet; Johan Springael;

doi: 10.3390/app14219863

handle: 10067/2093150151162165141

Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores

- Summary
- Subjects
- Metrics

Abstract

This study compares various F1-score variants—micro, macro, and weighted—to assess their performance in evaluating text-based emotion classification. Lexicon distillation is employed using the multilabel emotion-annotated datasets XED and GoEmotions. The aim of this paper is to understand when each F1-score variant is better suited for evaluating text-based multilabel emotion classification. Unigram lexicons were derived from the annotated GoEmotions and XED datasets through a binary classification approach. The distilled lexicons were then applied to the GoEmotions and XED annotated datasets to calculate their emotional content, and the results were compared. The findings highlight the behavior of each F1-score variant under different class distributions, emphasizing the importance of appropriate metric selection for reliable model performance evaluation in imbalanced multilabel datasets. Additionally, this study also investigates the effect of the aggregation of negative emotions into broader categories on said F1 metrics. The contribution of this study is to provide insights into how different F1-score variants could improve the reliability of multilabel emotion classifier evaluation, particularly in the context of class imbalance present in the case of phishing emails.

Country

Belgium

Related Organizations

University of Antwerp
Belgium
University of Antwerp
Belgium

Keywords

emotion analysis, Technology, QH301-705.5, T, Physics, QC1-999, performance metrics, Engineering (General). Civil engineering (General), annotated datasets, Chemistry, multilabel dataset, Psychology, F1-score, TA1-2040, Biology (General), Engineering sciences. Technology, QD1-999, Mathematics

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	50
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%

Found an issue? Give us feedback

50

Top 1%

Top 10%

Top 1%

Green

gold