
doi: 10.3390/app14219863
handle: 10067/2093150151162165141
This study compares various F1-score variants—micro, macro, and weighted—to assess their performance in evaluating text-based emotion classification. Lexicon distillation is employed using the multilabel emotion-annotated datasets XED and GoEmotions. The aim of this paper is to understand when each F1-score variant is better suited for evaluating text-based multilabel emotion classification. Unigram lexicons were derived from the annotated GoEmotions and XED datasets through a binary classification approach. The distilled lexicons were then applied to the GoEmotions and XED annotated datasets to calculate their emotional content, and the results were compared. The findings highlight the behavior of each F1-score variant under different class distributions, emphasizing the importance of appropriate metric selection for reliable model performance evaluation in imbalanced multilabel datasets. Additionally, this study also investigates the effect of the aggregation of negative emotions into broader categories on said F1 metrics. The contribution of this study is to provide insights into how different F1-score variants could improve the reliability of multilabel emotion classifier evaluation, particularly in the context of class imbalance present in the case of phishing emails.
emotion analysis, Technology, QH301-705.5, T, Physics, QC1-999, performance metrics, Engineering (General). Civil engineering (General), annotated datasets, Chemistry, multilabel dataset, Psychology, F1-score, TA1-2040, Biology (General), Engineering sciences. Technology, QD1-999, Mathematics
emotion analysis, Technology, QH301-705.5, T, Physics, QC1-999, performance metrics, Engineering (General). Civil engineering (General), annotated datasets, Chemistry, multilabel dataset, Psychology, F1-score, TA1-2040, Biology (General), Engineering sciences. Technology, QD1-999, Mathematics
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 50 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 1% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 1% |
