Promptvc: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

Name: Promptvc: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
Keywords: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

Jixun Yao; Yuguang Yang 0005; Yi Lei; Ziqian Ning; Yanni Hu; Yu Pan 0008; Jingjing Yin; Hongbin Zhou; Heng Lu 0004; Lei Xie 0001

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2023

Data sources: arXiv.org e-Print Archive

https://doi.org/10.1109/icassp...

Article . 2024 . Peer-reviewed

License: STM Policy #29

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2023

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

DBLP

Article

Data sources: DBLP

DBLP

Conference object

Data sources: DBLP

Promptvc: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 14 Apr 2024Embargo end date: 01 Jan 2023Publisher:IEEEJournal:ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Authors: Jixun Yao; Yuguang Yang 0005; Yi Lei; Ziqian Ning; Yanni Hu; Yu Pan 0008; Jingjing Yin; +3 Authors

doi: 10.1109/icassp48485.2024.10445804 , 10.48550/arxiv.2309.09262

arXiv: 2309.09262

Promptvc: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to serve as the linguistic content, which minimizes residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, which can adapt the duration of the same linguistic content to different styles. The subjective and objective evaluation results demonstrate the effectiveness of our proposed system.

Accepted by ICASSP 2024

Related Organizations

Northwestern Polytechnical University
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

2 Research products, page 1 of 1

ChatGLM2-6B software on GitHub
IsRelatedTo
wenet software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

Promptvc: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

Promptvc: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

2 Research products, page 1 of 1

ChatGLM2-6B software on GitHub

wenet software on GitHub