Supervised and Unsupervised Learning of Audio Representations for Music Understanding

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint 01 Jan 2022Embargo end date: 01 Jan 2022Publisher:ISMIRJournal:CoRR, volume abs/2210.03799

Authors: Matthew C. McCallum; Filip Korzeniowski; Sergio Oramas; Fabien Gouyon; Andreas F. Ehmann;

doi: 10.5281/zenodo.7316643 , 10.48550/arxiv.2210.03799 , 10.5281/zenodo.7316644

arXiv: 2210.03799

Supervised and Unsupervised Learning of Audio Representations for Music Understanding

- Summary
- Subjects
- Metrics

Abstract

In this work, we provide a broad comparative analysis of strategies for pre-training audio understanding models for several tasks in the music domain, including labelling of genre, era, origin, mood, instrumentation, key, pitch, vocal characteristics, tempo and sonority. Specifically, we explore how the domain of pre-training datasets (music or generic audio) and the pre-training methodology (supervised or unsupervised) affects the adequacy of the resulting audio embeddings for downstream tasks.We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance in a wide range of music labelling tasks, each with novel content and vocabularies. This can be done in an efficient manner with models containing less than 100 million parameters that require no fine-tuning or reparameterization for downstream tasks, making this approach practical for industry-scale audio catalogs.Within the class of unsupervised learning strategies, we show that the domain of the training dataset can significantly impact the performance of representations learned by the model. We find that restricting the domain of the pre-training dataset to music allows for training with smaller batch sizes while achieving state-of-the-art in unsupervised learning---and in some cases, supervised learning---for music understanding.We also corroborate that, while achieving state-of-the-art performance on many tasks, supervised learning can cause models to specialize to the supervised information provided, somewhat compromising a model's generality.

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer Science - Artificial Intelligence, ismir, Computer Science - Sound, Computer Science - Information Retrieval, Machine Learning (cs.LG), Multimedia (cs.MM), Artificial Intelligence (cs.AI), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Multimedia, Information Retrieval (cs.IR), Electrical Engineering and Systems Science - Audio and Speech Processing, ismir2022

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average