Combining Audio Control and Style Transfer Using Latent Diffusion

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint 01 Jan 2024Embargo end date: 01 Jan 2024 France Publisher:ISMIRJournal:CoRR, volume abs/2408.00196

Authors: Demerlé, Nils; Esling, Philippe; Doras, Guillaume; Genova, David;

doi: 10.5281/zenodo.14877436 , 10.5281/zenodo.14877141 , 10.5281/zenodo.14877437 , 10.48550/arxiv.2408.00196 , 10.5281/zenodo.14877142

arXiv: 2408.00196

Combining Audio Control and Style Transfer Using Latent Diffusion

- Summary
- Subjects
- Metrics

Abstract

Deep generative models are now able to synthesize high-quality audio signals, shifting the critical aspect in their development from audio quality to control capabilities. Although text-to-music generation is getting largely adopted by the general public, explicit control and example-based style transfer are more adequate modalities to capture the intents of artists and musicians. In this paper, we aim to unify explicit control and style transfer within a single model by separating local and global information to capture musical structure and timbre respectively. To do so, we leverage the capabilities of diffusion autoencoders to extract semantic features, in order to build two representation spaces. We enforce disentanglement between those spaces using an adversarial criterion and a two-stage training strategy. Our resulting model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example. We evaluate our model on one-shot timbre transfer and MIDI-to-audio tasks on instrumental recordings and show that we outperform existing baselines in terms of audio quality and target fidelity. Furthermore, we show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.

ISMIR 2024

Country

France

Related Organizations

Sorbonne University
France
French National Centre for Scientific Research
France
Sorbonne Paris Cité
France
Sorbonne University
Sorbonne Universite
France

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Statistics - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Machine Learning (stat.ML), [STAT.ML] Statistics [stat]/Machine Learning [stat.ML], Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green