Zero-Shot Voice Conversion Performance of NaturalSpeech 2 Versus Tacotron 2 and VALL-E on Low-Resource Accents

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

Zero-Shot Voice Conversion Performance of NaturalSpeech 2 Versus Tacotron 2 and VALL-E on Low-Resource Accents

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20637633

Zero-Shot Voice Conversion Performance of NaturalSpeech 2 Versus Tacotron 2 and VALL-E on Low-Resource Accents

- Summary

Abstract

This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS) frontend, including a grapheme-to-phoneme (G2P) module. Prior work on modelling accents assumes a phonetic transcription is available for the target accent, which might not be the case for low-resource, regional accents. In our work, we propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion, then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the donor's voice speakingResearch goal: How does the zero-shot voice conversion performance of NaturalSpeech 2 compare to other TTS models like Tacotron 2 or VALL-E on low-resource accents when evaluated using WER and speaker similarity metrics?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Found an issue? Give us feedback