
Music generation has become a key area in artificial intelligence, achieving significant progress in recent years. However, current research focuses primarily on general music tasks, with limited support for ethnic music. Moreover, the lack of multimodal guidance, such as text and image inputs, restricts generative models in understanding complex semantics and producing high-quality music. To address these limitations, we propose MusDiff, a multimodal music generation framework that combines text and image inputs to enhance music quality and cross-modal consistency. MusDiff is based on a diffusion model architecture, integrating IP-Adapter and KAN (Kolmogorov–Arnold Network) optimizations to improve feature fusion and modality alignment. Additionally, we introduce a new multimodal dataset, MusiTextImg, which includes diverse music categories, such as ethnic and modern styles, with annotations for text, image, and music modalities. We also extend the MusicCaps dataset by adding matched image pairs to text descriptions, further supporting multimodal research. Experimental results demonstrate that MusDiff outperforms existing methods on benchmark datasets (MusiTextImg and MusicCaps), excelling in realism, detail fidelity, and multimodal alignment. MusDiff not only sets a new performance standard for multimodal music generation but also opens new research directions in the field of multimodal generation.
Multimodal music generation, Kolmogorov-Arnold network, Ethnic music, IP-Adapter, TA1-2040, Engineering (General). Civil engineering (General), Diffusion models
Multimodal music generation, Kolmogorov-Arnold network, Ethnic music, IP-Adapter, TA1-2040, Engineering (General). Civil engineering (General), Diffusion models
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
