Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint 01 Jan 2024Embargo end date: 01 Jan 2024Publisher:ISMIRJournal:CoRR, volume abs/2409.11498

Authors: Ilaria Manco; Justin Salamon; Oriol Nieto;

doi: 10.5281/zenodo.14877485 , 10.5281/zenodo.14877190 , 10.5281/zenodo.14877189 , 10.48550/arxiv.2409.11498 , 10.5281/zenodo.14877484

arXiv: 2409.11498

Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning

- Summary
- Subjects
- Metrics

Abstract

Audio-text contrastive models have become a powerful approach in music representation learning. Despite their empirical success, however, little is known about the influence of key design choices on the quality of music-text representations learnt through this framework. In this work, we expose these design choices within the constraints of limited data and computation budgets, and establish a more solid understanding of their impact grounded in empirical observations along three axes: the choice of base encoders, the level of curation in training data, and the use of text augmentation. We find that data curation is the single most important factor for music-text contrastive training in resource-constrained scenarios. Motivated by this insight, we introduce two novel techniques, Augmented View Dropout and TextSwap, which increase the diversity and descriptiveness of text inputs seen in training. Through our experiments we demonstrate that these are effective at boosting performance across different pre-training regimes, model architectures, and downstream data distributions, without incurring higher computational costs or requiring additional training data.

To appear in the Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green