A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese
Zhou, Shiyu; Dong, Linhao; Xu, Shuang; Xu, Bo;
Subject: Computer Science - Computation and Language | Electrical Engineering and Systems Science - Audio and Speech Processing | Computer Science - Sound
The choice of modeling units is critical to automatic speech recognition (ASR) tasks. Conventional ASR systems typically choose context-dependent states (CD-states) or context-dependent phonemes (CD-phonemes) as their modeling units. However, it has been challenged by s... View more
 G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30-42, 2012.
 H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014.
 A. Senior, H. Sak, and I. Shafran, “Context dependent phone models for lstm rnn acoustic modelling,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4585-4589.
 R. Prabhavalkar, T. N. Sainath, B. Li, K. Rao, and N. Jaitly, “An analysis of attention in sequence-to-sequence models,,” in Proc. of Interspeech, 2017.
 T. N. Sainath, R. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V. Schogol, P. Nguyen, B. Li, Y. Wu et al., “No need for a lexicon? evaluating the value of the pronunciation lexica in end-to-end models,” arXiv preprint arXiv:1712.01864, 2017.
 C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina et al., “Stateof-the-art speech recognition with sequence-to-sequence models,” arXiv preprint arXiv:1712.01769, 2017.
 W. Chan and I. Lane, “On online attention-based speech recognition and joint mandarin character-pinyin training.” in INTERSPEECH, 2016, pp. 3404-3408.
 C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-based end-toend speech recognition on voice search.”
 S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-Based Sequenceto-Sequence Speech Recognition with the Transformer in Mandarin Chinese,” ArXiv e-prints, Apr. 2018.
 B. X. Linhao Dong, Shuang Xu, “Speech-transformer: A norecurrence sequence-to-sequence model for speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 2018, pp. 5884-5888.