publication . Conference object . Preprint . 2019

CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages

Park, Kyubyong; Mulc, Thomas;
Open Access
  • Published: 15 Sep 2019
  • Publisher: ISCA
Abstract
We describe our development of CSS10, a collection of single speaker speech datasets for ten languages. It is composed of short audio clips from LibriVox audiobooks and their aligned texts. To validate its quality we train two neural text-to-speech models on each dataset. Subsequently, we conduct Mean Opinion Score tests on the synthesized speech samples. We make our datasets, pre-trained models, and test resources publicly available. We hope they will be used for future speech tasks.
Subjects
free text keywords: Speech recognition, Computer science, Computer Science - Computation and Language
34 references, page 1 of 3

[1] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.

[2] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: A fully end-to-end text-to-speech synthesis model,” arXiv preprint arXiv:1703.10135, 2017. [OpenAIRE]

[3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017.

[4] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2wav: End-to-end speech synthesis,” 2017.

[5] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta et al., “Deep voice: Real-time neural text-to-speech,” arXiv preprint arXiv:1702.07825, 2017. [OpenAIRE]

[6] S. O. Arık, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural textto-speech,” arXiv preprint arXiv:1705.08947, 2017.

[7] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker neural text-to-speech,” arXiv preprint arXiv:1710.07654, 2017.

[8] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” arXiv preprint arXiv:1710.08969, 2017. [OpenAIRE]

[9] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voice synthesis for in-the-wild speakers via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017. [OpenAIRE]

[10] K. Ito, “The lj speech LJ-Speech-Dataset/, 2017.

[11] “Blizzard challenge 2018,” https://www.synsig.org/index.php/ Blizzard Challenge 2018, 2018.

[12] J. Yamagishi, T. Nose, H. Zen, Z. H. Ling, T. Toda, K. Tokuda, S. King, and S. Renals, “Robust speaker-adaptive hmm-based text-to-speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1208-1230, Aug 2009.

[13] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” http://www.openslr.org/12/, 2015.

[14] A. Rousseau, P. Delglise, and Y. Estve, “Ted-lium: an automatic speech recognition dedicated corpus,” http://www-lium. univ-lemans.fr/en/content/ted-lium-corpus, 2014.

[15] “Voxforge,” http://www.voxforge.org/, 2006.

34 references, page 1 of 3
Abstract
We describe our development of CSS10, a collection of single speaker speech datasets for ten languages. It is composed of short audio clips from LibriVox audiobooks and their aligned texts. To validate its quality we train two neural text-to-speech models on each dataset. Subsequently, we conduct Mean Opinion Score tests on the synthesized speech samples. We make our datasets, pre-trained models, and test resources publicly available. We hope they will be used for future speech tasks.
Subjects
free text keywords: Speech recognition, Computer science, Computer Science - Computation and Language
34 references, page 1 of 3

[1] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.

[2] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: A fully end-to-end text-to-speech synthesis model,” arXiv preprint arXiv:1703.10135, 2017. [OpenAIRE]

[3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017.

[4] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2wav: End-to-end speech synthesis,” 2017.

[5] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta et al., “Deep voice: Real-time neural text-to-speech,” arXiv preprint arXiv:1702.07825, 2017. [OpenAIRE]

[6] S. O. Arık, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural textto-speech,” arXiv preprint arXiv:1705.08947, 2017.

[7] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker neural text-to-speech,” arXiv preprint arXiv:1710.07654, 2017.

[8] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” arXiv preprint arXiv:1710.08969, 2017. [OpenAIRE]

[9] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voice synthesis for in-the-wild speakers via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017. [OpenAIRE]

[10] K. Ito, “The lj speech LJ-Speech-Dataset/, 2017.

[11] “Blizzard challenge 2018,” https://www.synsig.org/index.php/ Blizzard Challenge 2018, 2018.

[12] J. Yamagishi, T. Nose, H. Zen, Z. H. Ling, T. Toda, K. Tokuda, S. King, and S. Renals, “Robust speaker-adaptive hmm-based text-to-speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1208-1230, Aug 2009.

[13] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” http://www.openslr.org/12/, 2015.

[14] A. Rousseau, P. Delglise, and Y. Estve, “Ted-lium: an automatic speech recognition dedicated corpus,” http://www-lium. univ-lemans.fr/en/content/ted-lium-corpus, 2014.

[15] “Voxforge,” http://www.voxforge.org/, 2006.

34 references, page 1 of 3
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Conference object . Preprint . 2019

CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages

Park, Kyubyong; Mulc, Thomas;