Name: Zero-Shot vs. Few-Shot Multi-speaker TTS Using Pre-trained Czech SpeechT5 Model
Keywords: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article , Preprint 01 Jan 2024Embargo end date: 01 Jan 2024 English Publisher:Springer Nature Switzerland

Authors: Jan Lehečka; Zdeněk Hanzlíček; Jindřich Matoušek; Daniel Tihelka;

doi: 10.1007/978-3-031-70566-3_5 , 10.48550/arxiv.2407.17167

arXiv: 2407.17167

Zero-Shot vs. Few-Shot Multi-speaker TTS Using Pre-trained Czech SpeechT5 Model

- Summary
- Subjects
- Related research
  (3)
- Metrics

Abstract

In this paper, we experimented with the SpeechT5 model pre-trained on large-scale datasets. We pre-trained the foundation model from scratch and fine-tuned it on a large-scale robust multi-speaker text-to-speech (TTS) task. We tested the model capabilities in a zero- and few-shot scenario. Based on two listening tests, we evaluated the synthetic audio quality and the similarity of how synthetic voices resemble real voices. Our results showed that the SpeechT5 model can generate a synthetic voice for any speaker using only one minute of the target speaker's data. We successfully demonstrated the high quality and similarity of our synthetic voices on publicly known Czech politicians and celebrities.

Accepted to TSD2024

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

3 Research products, page 1 of 1

tortoise-tts software on GitHub
IsRelatedTo
List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words software on GitHub
IsRelatedTo
SpeechT5 software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green

Zero-Shot vs. Few-Shot Multi-speaker TTS Using Pre-trained Czech SpeechT5 Model

Zero-Shot vs. Few-Shot Multi-speaker TTS Using Pre-trained Czech SpeechT5 Model

3 Research products, page 1 of 1

tortoise-tts software on GitHub

List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words software on GitHub

SpeechT5 software on GitHub