publication . Preprint . 2019

End-To-End Speech Recognition Using A High Rank LSTM-CTC Based Model

Shi, Yangyang; Hwang, Mei-Yuh; Lei, Xin;
Open Access English
  • Published: 12 Mar 2019
Abstract
Long Short Term Memory Connectionist Temporal Classification (LSTM-CTC) based end-to-end models are widely used in speech recognition due to its simplicity in training and efficiency in decoding. In conventional LSTM-CTC based models, a bottleneck projection matrix maps the hidden feature vectors obtained from LSTM to softmax output layer. In this paper, we propose to use a high rank projection layer to replace the projection matrix. The output from the high rank projection layer is a weighted combination of vectors that are projected from the hidden feature vectors via different projection matrices and non-linear activation function. The high rank projection la...
Subjects
free text keywords: Computer Science - Computation and Language
Download from
25 references, page 1 of 2

[1] G Hinton, L Deng, D Yu, G Dahl, A Mohamed, N Jaitly, A Senior, V Vanhoucke, P Nguyen, T Sainath, and B Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, 2012.

[2] W Xiong, J Droppo, X Huang, F Seide, M Seltzer, A Stolcke, D Yu, and G Zweig, “The Microsoft 2016 conversational speech recognition system,” in Proceedings of ICASSP, 2017. [OpenAIRE]

[3] A Das, J Li, R Zhao, and Y Gong, “Advancing connectionist temporal classification with attention modeling,” CoRR, vol. abs/1803.0, 2018.

[4] S Kim, T Hori, and S Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proceedings of ICASSP, 2017.

[5] S Kim, M Seltzer, J Li, and R Zhao, “Improved training for online end-to-end speech recognition systems,” CoRR, vol. abs/1711.0, 2017.

[6] A Graves and N Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” JMLR Workshop and Conference Proceedings, 2014.

[7] A Graves, S Fernandez, F Gomez, and J Schmidhuber, “Connectionist temporal classification : labelling unsegmented sequence data with recurrent neural networks,” Proceedings of ICML, 2006.

[8] C Chiu, T Sainath, Y Wu, R Prabhavalkar, P Nguyen, Z Chen, A Kannan, R Weiss, K Rao, K Gonina, N Jaitly, B Li, J Chorowski, and M Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” CoRR, vol. abs/1712.0, 2017.

[9] E Battenberg, J Chen, R Child, A Coates, Y Li, H Liu, S Satheesh, A Sriram, and Z Zhu, “Exploring neural transducers for end-to-end speech recognition,” in Proceedings of ASRU, 2018. [OpenAIRE]

[10] W Chan, N Jaitly, Q Le, and O Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proceedings of ICASSP, 2016. [OpenAIRE]

[11] H Sak, M Shannon, K Rao, and F Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping,” in Proceedings of Interspeech, 2017.

[12] Y Miao, M Gowayyed, and F Metze, “EESEN: Endto-end speech recognition using deep RNN models and WFST-based decoding,” in Proceedings of ASRU, 2016.

[13] T Sainath, C Chiu, R Prabhavalkar, A Kannan, Y Wu, P Nguyen, and Z Chen, “Improving the performance of online neural transducer models,” CoRR, vol. abs/1712.0, 2017.

[14] A Hannun, C Case, J Casper, B Catanzaro, G Diamos, E Elsen, R Prenger, S Satheesh, S Sengupta, A Coates, and A Ng, “DeepSpeech: Scaling up end-to-end speech recognition,” arXiv:1412.5567, 2014. [OpenAIRE]

[15] D Amodei, R Anubhai, E Battenberg, C Case, J Casper, B Catanzaro, J Chen, M Chrzanowski, and etc, “Deep Speech 2: end-to-end speech recognition in English and Mandarin,” CoRR, vol. abs/1512.0, 2015.

25 references, page 1 of 2
Abstract
Long Short Term Memory Connectionist Temporal Classification (LSTM-CTC) based end-to-end models are widely used in speech recognition due to its simplicity in training and efficiency in decoding. In conventional LSTM-CTC based models, a bottleneck projection matrix maps the hidden feature vectors obtained from LSTM to softmax output layer. In this paper, we propose to use a high rank projection layer to replace the projection matrix. The output from the high rank projection layer is a weighted combination of vectors that are projected from the hidden feature vectors via different projection matrices and non-linear activation function. The high rank projection la...
Subjects
free text keywords: Computer Science - Computation and Language
Download from
25 references, page 1 of 2

[1] G Hinton, L Deng, D Yu, G Dahl, A Mohamed, N Jaitly, A Senior, V Vanhoucke, P Nguyen, T Sainath, and B Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, 2012.

[2] W Xiong, J Droppo, X Huang, F Seide, M Seltzer, A Stolcke, D Yu, and G Zweig, “The Microsoft 2016 conversational speech recognition system,” in Proceedings of ICASSP, 2017. [OpenAIRE]

[3] A Das, J Li, R Zhao, and Y Gong, “Advancing connectionist temporal classification with attention modeling,” CoRR, vol. abs/1803.0, 2018.

[4] S Kim, T Hori, and S Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proceedings of ICASSP, 2017.

[5] S Kim, M Seltzer, J Li, and R Zhao, “Improved training for online end-to-end speech recognition systems,” CoRR, vol. abs/1711.0, 2017.

[6] A Graves and N Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” JMLR Workshop and Conference Proceedings, 2014.

[7] A Graves, S Fernandez, F Gomez, and J Schmidhuber, “Connectionist temporal classification : labelling unsegmented sequence data with recurrent neural networks,” Proceedings of ICML, 2006.

[8] C Chiu, T Sainath, Y Wu, R Prabhavalkar, P Nguyen, Z Chen, A Kannan, R Weiss, K Rao, K Gonina, N Jaitly, B Li, J Chorowski, and M Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” CoRR, vol. abs/1712.0, 2017.

[9] E Battenberg, J Chen, R Child, A Coates, Y Li, H Liu, S Satheesh, A Sriram, and Z Zhu, “Exploring neural transducers for end-to-end speech recognition,” in Proceedings of ASRU, 2018. [OpenAIRE]

[10] W Chan, N Jaitly, Q Le, and O Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proceedings of ICASSP, 2016. [OpenAIRE]

[11] H Sak, M Shannon, K Rao, and F Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping,” in Proceedings of Interspeech, 2017.

[12] Y Miao, M Gowayyed, and F Metze, “EESEN: Endto-end speech recognition using deep RNN models and WFST-based decoding,” in Proceedings of ASRU, 2016.

[13] T Sainath, C Chiu, R Prabhavalkar, A Kannan, Y Wu, P Nguyen, and Z Chen, “Improving the performance of online neural transducer models,” CoRR, vol. abs/1712.0, 2017.

[14] A Hannun, C Case, J Casper, B Catanzaro, G Diamos, E Elsen, R Prenger, S Satheesh, S Sengupta, A Coates, and A Ng, “DeepSpeech: Scaling up end-to-end speech recognition,” arXiv:1412.5567, 2014. [OpenAIRE]

[15] D Amodei, R Anubhai, E Battenberg, C Case, J Casper, B Catanzaro, J Chen, M Chrzanowski, and etc, “Deep Speech 2: end-to-end speech recognition in English and Mandarin,” CoRR, vol. abs/1512.0, 2015.

25 references, page 1 of 2
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue