Universal Music Representations? Evaluating Foundation Models on World Music Corpora

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint 01 Jan 2025Embargo end date: 01 Jan 2025Publisher:ISMIRJournal:CoRR, volume abs/2506.17055

Authors: Charilaos Papaioannou; Emmanouil Benetos; Alexandros Potamianos;

doi: 10.5281/zenodo.17706397 , 10.5281/zenodo.17811370 , 10.48550/arxiv.2506.17055 , 10.5281/zenodo.17706396

arXiv: 2506.17055

Universal Music Representations? Evaluating Foundation Models on World Music Corpora

- Summary
- Subjects
- Metrics

Abstract

Foundation models have revolutionized music information retrieval, but questions remain about their ability to generalize across diverse musical traditions. This paper presents a comprehensive evaluation of five state-of-the-art audio foundation models across six musical corpora spanning Western popular, Greek, Turkish, and Indian classical traditions. We employ three complementary methodologies to investigate these models' cross-cultural capabilities: probing to assess inherent representations, targeted supervised fine-tuning of 1-2 layers, and multi-label few-shot learning for low-resource scenarios. Our analysis shows varying cross-cultural generalization, with larger models typically outperforming on non-Western music, though results decline for culturally distant traditions. Notably, our approaches achieve state-of-the-art performance on five out of six evaluated datasets, demonstrating the effectiveness of foundation models for world music understanding. We also find that our targeted fine-tuning approach does not consistently outperform probing across all settings, suggesting foundation models already encode substantial musical knowledge. Our evaluation framework and benchmarking results contribute to understanding how far current models are from achieving universal music representations while establishing metrics for future progress.

Accepted at ISMIR 2025

Keywords

Machine Learning, FOS: Computer and information sciences, Sound (cs.SD), Sound, Audio and Speech Processing (eess.AS), Information Retrieval, FOS: Electrical engineering, electronic engineering, information engineering, Audio and Speech Processing, Information Retrieval (cs.IR), Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green