publication . Preprint . Conference object . 2017

Transfer Learning for Speech Recognition on a Budget

Julius Kunze; Louis Kirsch; Ilia Kurenkov; Andreas Krug; Jens Johannsmeier; Sebastian Stober;
Open Access English
  • Published: 01 Jun 2017
Abstract
End-to-end training of automated speech recognition (ASR) systems requires massive data and compute resources. We explore transfer learning based on model adaptation as an approach for training ASR models under constrained GPU memory, throughput and training data. We conduct several systematic experiments adapting a Wav2Letter convolutional neural network originally trained for English ASR to the German language. We show that this technique allows faster training on consumer-grade resources while requiring less training data in order to achieve the same accuracy, thereby lowering the cost of training ASR models in other languages. Model introspection revealed th...
Subjects
free text keywords: Computer Science - Learning, Computer Science - Computation and Language, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning, Speech recognition, Transfer of learning, Computer science, Natural language processing, computer.software_genre, computer, Artificial intelligence, business.industry, business
Related Organizations
Communities
Digital Humanities and Cultural Heritage

Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane´, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vie´gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. http://tensorflow.org/.

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin. CoRR abs/1512.02595. http://arxiv.org/abs/1512.02595. [OpenAIRE]

Dongpeng Chen and Brian Kan-Wing Mak. 2015. Multitask learning of deep neural networks for lowresource speech recognition. IEEE/ACM Trans. Audio, Speech & Language Processing 23(7):1172-1183. http://dx.doi.org/10.1109/TASLP.2015.2422573.

Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. 2016. Wav2letter: an end-to-end convnet-based speech recognition system. CoRR abs/1609.03193. http://arxiv.org/abs/1609.03193. [OpenAIRE]

Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML. volume 30.

Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and Music Signal Analysis in Python. In Proceedings of the 14th python in science conference. pages 18-25.

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. LibriSpeech: an ASR corpus based on public domain audio books. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pages 5206-5210. [OpenAIRE]

Stephan Radeck-Arneth, Benjamin Milde, Arvid Lange, Evandro Gouveˆa, Stefan Radomski, Max Mu¨hlha¨user, and Chris Biemann. 2015. Open source german distant speech recognition: Corpus and acoustic model. In International Conference on Text, Speech, and Dialogue. Springer International Publishing, pages 480-488.

Uwe D. Reichel, Florian Schiel, Thomas Kisler, Christoph Draxler, and Nina Po¨rner. 2016. The BAS Speech Data Repository .

Florian Schiel. 1998. Speech and speech-related resources at BAS. In Proceedings of the First International Conference on Language Resources and Evaluation. pages 343-349. [OpenAIRE]

Florian Schiel, Christian Heinrich, and Sabine Barfu¨sser. 2012. Alcohol language corpus: the first public corpus of alcoholized German speech. Language resources and evaluation 46(3):503-521.

Ste´fan van der Walt, S. Chris Colbert, and Gae¨l Varoquaux. 2011. The numpy array: a structure for efficient numerical computation. CoRR abs/1102.1523. http://arxiv.org/abs/1102.1523.

Ngoc Thang Vu and Tanja Schultz. 2013. Multilingual multilayer perceptron for rapid language adaptation between and across language families. In Fre´de´ric Bimbot, Christophe Cerisara, Ce´cile Fougeron, Guillaume Gravier, Lori Lamel, Franc¸ois Pellegrino, and Pascal Perrier, editors, INTERSPEECH. ISCA, pages 515-519.

Wolfgang Wahlster. 1993. Verbmobil. In Grundlagen und Anwendungen der Ku¨nstlichen Intelligenz. Springer Berlin Heidelberg, pages 393-402. [OpenAIRE]

Dong Wang and Thomas Fang Zheng. 2015. Transfer Learning for Speech and Language Processing. arXiv:1511.06066 [cs] http://arxiv.org/abs/1511.06066.

Abstract
End-to-end training of automated speech recognition (ASR) systems requires massive data and compute resources. We explore transfer learning based on model adaptation as an approach for training ASR models under constrained GPU memory, throughput and training data. We conduct several systematic experiments adapting a Wav2Letter convolutional neural network originally trained for English ASR to the German language. We show that this technique allows faster training on consumer-grade resources while requiring less training data in order to achieve the same accuracy, thereby lowering the cost of training ASR models in other languages. Model introspection revealed th...
Subjects
free text keywords: Computer Science - Learning, Computer Science - Computation and Language, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning, Speech recognition, Transfer of learning, Computer science, Natural language processing, computer.software_genre, computer, Artificial intelligence, business.industry, business
Related Organizations
Communities
Digital Humanities and Cultural Heritage

Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane´, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vie´gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. http://tensorflow.org/.

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin. CoRR abs/1512.02595. http://arxiv.org/abs/1512.02595. [OpenAIRE]

Dongpeng Chen and Brian Kan-Wing Mak. 2015. Multitask learning of deep neural networks for lowresource speech recognition. IEEE/ACM Trans. Audio, Speech & Language Processing 23(7):1172-1183. http://dx.doi.org/10.1109/TASLP.2015.2422573.

Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. 2016. Wav2letter: an end-to-end convnet-based speech recognition system. CoRR abs/1609.03193. http://arxiv.org/abs/1609.03193. [OpenAIRE]

Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML. volume 30.

Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and Music Signal Analysis in Python. In Proceedings of the 14th python in science conference. pages 18-25.

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. LibriSpeech: an ASR corpus based on public domain audio books. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pages 5206-5210. [OpenAIRE]

Stephan Radeck-Arneth, Benjamin Milde, Arvid Lange, Evandro Gouveˆa, Stefan Radomski, Max Mu¨hlha¨user, and Chris Biemann. 2015. Open source german distant speech recognition: Corpus and acoustic model. In International Conference on Text, Speech, and Dialogue. Springer International Publishing, pages 480-488.

Uwe D. Reichel, Florian Schiel, Thomas Kisler, Christoph Draxler, and Nina Po¨rner. 2016. The BAS Speech Data Repository .

Florian Schiel. 1998. Speech and speech-related resources at BAS. In Proceedings of the First International Conference on Language Resources and Evaluation. pages 343-349. [OpenAIRE]

Florian Schiel, Christian Heinrich, and Sabine Barfu¨sser. 2012. Alcohol language corpus: the first public corpus of alcoholized German speech. Language resources and evaluation 46(3):503-521.

Ste´fan van der Walt, S. Chris Colbert, and Gae¨l Varoquaux. 2011. The numpy array: a structure for efficient numerical computation. CoRR abs/1102.1523. http://arxiv.org/abs/1102.1523.

Ngoc Thang Vu and Tanja Schultz. 2013. Multilingual multilayer perceptron for rapid language adaptation between and across language families. In Fre´de´ric Bimbot, Christophe Cerisara, Ce´cile Fougeron, Guillaume Gravier, Lori Lamel, Franc¸ois Pellegrino, and Pascal Perrier, editors, INTERSPEECH. ISCA, pages 515-519.

Wolfgang Wahlster. 1993. Verbmobil. In Grundlagen und Anwendungen der Ku¨nstlichen Intelligenz. Springer Berlin Heidelberg, pages 393-402. [OpenAIRE]

Dong Wang and Thomas Fang Zheng. 2015. Transfer Learning for Speech and Language Processing. arXiv:1511.06066 [cs] http://arxiv.org/abs/1511.06066.

Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Preprint . Conference object . 2017

Transfer Learning for Speech Recognition on a Budget

Julius Kunze; Louis Kirsch; Ilia Kurenkov; Andreas Krug; Jens Johannsmeier; Sebastian Stober;