Exploring the Feasibility of LLMs for Automated Music Emotion Annotation

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint 01 Jan 2025Embargo end date: 01 Jan 2025Publisher:ISMIRJournal:CoRR, volume abs/2508.12626

Authors: Meng Yang 0036; Jon McCormack; Maria Teresa Llano; Wanchao Su;

doi: 10.5281/zenodo.17706354 , 10.48550/arxiv.2508.12626 , 10.5281/zenodo.17811341 , 10.5281/zenodo.17706355

arXiv: 2508.12626

Exploring the Feasibility of LLMs for Automated Music Emotion Annotation

- Summary
- Subjects
- Metrics

Abstract

Current approaches to music emotion annotation remain heavily reliant on manual labelling, a process that imposes significant resource and labour burdens, severely limiting the scale of available annotated data. This study examines the feasibility and reliability of employing a large language model (GPT-4o) for music emotion annotation. In this study, we annotated GiantMIDI-Piano, a classical MIDI piano music dataset, in a four-quadrant valence-arousal framework using GPT-4o, and compared against annotations provided by three human experts. We conducted extensive evaluations to assess the performance and reliability of GPT-generated music emotion annotations, including standard accuracy, weighted accuracy that accounts for inter-expert agreement, inter-annotator agreement metrics, and distributional similarity of the generated labels. While GPT's annotation performance fell short of human experts in overall accuracy and exhibited less nuance in categorizing specific emotional states, inter-rater reliability metrics indicate that GPT's variability remains within the range of natural disagreement among experts. These findings underscore both the limitations and potential of GPT-based annotation: despite its current shortcomings relative to human performance, its cost-effectiveness and efficiency render it a promising scalable alternative for music emotion annotation.

Accepted to be published at ISMIR 2025

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Sound

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green