Aligning Text-to-Music Evaluation with Human Preferences

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint 01 Jan 2025Embargo end date: 01 Jan 2025Publisher:ISMIRJournal:CoRR, volume abs/2503.16669

Authors: Yichen Huang; Zachary Novack; Koichi Saito; Jiatong Shi; Shinji Watanabe 0001; Yuki Mitsufuji; John Thickstun; +1 Authors

doi: 10.5281/zenodo.17811347 , 10.5281/zenodo.17706363 , 10.48550/arxiv.2503.16669 , 10.5281/zenodo.17706362

arXiv: 2503.16669

Aligning Text-to-Music Evaluation with Human Preferences

- Summary
- Subjects
- Metrics

Abstract

Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, an open-source dataset of pairwise human preferences for TTM systems. We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata, and are only weakly correlated with human perception. We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model. We find that this metric effectively captures diverse musical desiderata (average rank correlation 0.84 for MAD vs. 0.49 for FAD) and also correlates more strongly with MusicPrefs (0.62 vs. 0.14).

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green