publication . Preprint . 2016

Full-Capacity Unitary Recurrent Neural Networks

Wisdom, Scott; Powers, Thomas; Hershey, John R.; Roux, Jonathan Le; Atlas, Les;
Open Access English
  • Published: 31 Oct 2016
Abstract
Recurrent neural networks are powerful models for processing sequential data, but they are generally plagued by vanishing and exploding gradient problems. Unitary recurrent neural networks (uRNNs), which use unitary recurrence matrices, have recently been proposed as a means to avoid these issues. However, in previous experiments, the recurrence matrices were restricted to be a product of parameterized unitary matrices, and an open question remains: when does such a parameterization fail to represent all unitary matrices, and how does this restricted representational capacity limit what can be learned? To address this question, we propose full-capacity uRNNs tha...
Subjects
free text keywords: Statistics - Machine Learning, Computer Science - Learning, Computer Science - Neural and Evolutionary Computing
Download from
23 references, page 1 of 2

[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157-166, 1994. [OpenAIRE]

[2] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, eds, A field guide to dynamical recurrent neural networks. IEEE Press, 2001.

[3] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training Recurrent Neural Networks. arXiv:1211.5063, Nov. 2012. [OpenAIRE]

[4] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, Dec. 2013.

[5] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv:1504.00941, Apr. 2015.

[6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.

[7] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv:1409.1259, 2014.

[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, Dec. 2015.

[9] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Advances in Neural Information Processing Systems (NIPS), pp. 2204-2212, 2014. [OpenAIRE]

[10] M. Arjovsky, A. Shah, and Y. Bengio. Unitary Evolution Recurrent Neural Networks. In International Conference on Machine Learning (ICML), Jun. 2016.

[11] A. S. Householder. Unitary triangularization of a nonsymmetric matrix. Journal of the ACM, 5(4):339-342, 1958. [OpenAIRE]

[12] R. Gilmore. Lie groups, physics, and geometry: an introduction for physicists, engineers and chemists. Cambridge University Press, 2008.

[13] A. Sard. The measure of the critical values of differentiable maps. Bulletin of the American Mathematical Society, 48(12):883-890, 1942. [OpenAIRE]

[14] H. D. Tagare. Notes on optimization on Stiefel manifolds. Technical report, Yale University, 2011.

[15] T. Tieleman and G. Hinton. Lecture 6.5-RmsProp: Divide the gradient by a running average of its recent magnitude, 2012. COURSERA: Neural Networks for Machine Learning.

23 references, page 1 of 2
Abstract
Recurrent neural networks are powerful models for processing sequential data, but they are generally plagued by vanishing and exploding gradient problems. Unitary recurrent neural networks (uRNNs), which use unitary recurrence matrices, have recently been proposed as a means to avoid these issues. However, in previous experiments, the recurrence matrices were restricted to be a product of parameterized unitary matrices, and an open question remains: when does such a parameterization fail to represent all unitary matrices, and how does this restricted representational capacity limit what can be learned? To address this question, we propose full-capacity uRNNs tha...
Subjects
free text keywords: Statistics - Machine Learning, Computer Science - Learning, Computer Science - Neural and Evolutionary Computing
Download from
23 references, page 1 of 2

[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157-166, 1994. [OpenAIRE]

[2] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, eds, A field guide to dynamical recurrent neural networks. IEEE Press, 2001.

[3] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training Recurrent Neural Networks. arXiv:1211.5063, Nov. 2012. [OpenAIRE]

[4] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, Dec. 2013.

[5] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv:1504.00941, Apr. 2015.

[6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.

[7] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv:1409.1259, 2014.

[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, Dec. 2015.

[9] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Advances in Neural Information Processing Systems (NIPS), pp. 2204-2212, 2014. [OpenAIRE]

[10] M. Arjovsky, A. Shah, and Y. Bengio. Unitary Evolution Recurrent Neural Networks. In International Conference on Machine Learning (ICML), Jun. 2016.

[11] A. S. Householder. Unitary triangularization of a nonsymmetric matrix. Journal of the ACM, 5(4):339-342, 1958. [OpenAIRE]

[12] R. Gilmore. Lie groups, physics, and geometry: an introduction for physicists, engineers and chemists. Cambridge University Press, 2008.

[13] A. Sard. The measure of the critical values of differentiable maps. Bulletin of the American Mathematical Society, 48(12):883-890, 1942. [OpenAIRE]

[14] H. D. Tagare. Notes on optimization on Stiefel manifolds. Technical report, Yale University, 2011.

[15] T. Tieleman and G. Hinton. Lecture 6.5-RmsProp: Divide the gradient by a running average of its recent magnitude, 2012. COURSERA: Neural Networks for Machine Learning.

23 references, page 1 of 2
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue