publication . Article . Preprint . 2018

Norm-Preservation: Why Residual Networks Can Become Extremely Deep?

Zaeemzadeh, Alireza; Rahnavard, Nazanin; Shah, Mubarak;
Open Access
  • Published: 18 May 2018 Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence (issn: 0162-8828, eissn: 1939-3539, Copyright policy)
  • Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Abstract
Augmenting neural networks with skip connections, as introduced in the so-called ResNet architecture, surprised the community by enabling the training of networks of more than 1,000 layers with significant performance gains. This paper deciphers ResNet by analyzing the effect of skip connections, and puts forward new theoretical results on the advantages of identity skip connections in neural networks. We prove that the skip connections in the residual blocks facilitate preserving the norm of the gradient, and lead to stable back-propagation, which is desirable from optimization perspective. We also show that, perhaps surprisingly, as more residual blocks are st...
Subjects
free text keywords: Computational Theory and Mathematics, Software, Applied Mathematics, Artificial Intelligence, Computer Vision and Pattern Recognition, Computer Science - Computer Vision and Pattern Recognition
Funded by
NSF| BIGDATA: IA: Distributed Semi-Supervised Training of Deep Models and Its Applications in Video Understanding
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1741431
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Information and Intelligent Systems
,
NSF| CIF:Small: A Tensor-based Framework for Reliable Radio Cartography
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1718195
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Computing and Communication Foundations
22 references, page 1 of 2

[1] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, pp. 354-359, 10 2017.

[2] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep neural networks,” in Advances in Neural Information Processing Systems 27, pp. 2924-2932, Curran Associates, Inc., 2014. [OpenAIRE]

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026-1034, IEEE, 12 2015.

[4] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, vol. 37 of Proceedings of Machine Learning Research, pp. 448-456, 07-09 Jul 2015.

[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, IEEE, 6 2016.

[6] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Computer Vision - ECCV 2016, pp. 630-645, Springer International Publishing, 2016.

[7] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Advances in Neural Information Processing Systems 28, pp. 2377-2385, Curran Associates, Inc., 2015. [OpenAIRE]

[8] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261-2269, IEEE, 7 2017.

[9] E. Orhan and X. Pitkow, “Skip connections eliminate singularities,” in International Conference on Learning Representations, 2018.

[10] M. Hardt and T. Ma, “Identity Matters in Deep Learning,” in International Conference on Learning Representations, 2017.

[11] K. Kawaguchi, “Deep learning without poor local minima,” in Advances in Neural Information Processing Systems 29, pp. 586-594, Curran Associates, Inc., 2016.

[12] D. Balduzzi, M. Frean, L. Leary, J. P. Lewis, K. W.-D. Ma, and B. McWilliams, “The shattered gradients problem: If resnets are the answer, then what is the question?,” in Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, pp. 342-350, 06-11 Aug 2017.

[13] A. Veit, M. J. Wilber, and S. Belongie, “Residual networks behave like ensembles of relatively shallow networks,” in Advances in Neural Information Processing Systems 29 (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), pp. 550-558, Curran Associates, Inc., 2016.

[15] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” in International Conference on Learning Representations, 2014.

[17] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” tech. rep., 2009.

22 references, page 1 of 2
Abstract
Augmenting neural networks with skip connections, as introduced in the so-called ResNet architecture, surprised the community by enabling the training of networks of more than 1,000 layers with significant performance gains. This paper deciphers ResNet by analyzing the effect of skip connections, and puts forward new theoretical results on the advantages of identity skip connections in neural networks. We prove that the skip connections in the residual blocks facilitate preserving the norm of the gradient, and lead to stable back-propagation, which is desirable from optimization perspective. We also show that, perhaps surprisingly, as more residual blocks are st...
Subjects
free text keywords: Computational Theory and Mathematics, Software, Applied Mathematics, Artificial Intelligence, Computer Vision and Pattern Recognition, Computer Science - Computer Vision and Pattern Recognition
Funded by
NSF| BIGDATA: IA: Distributed Semi-Supervised Training of Deep Models and Its Applications in Video Understanding
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1741431
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Information and Intelligent Systems
,
NSF| CIF:Small: A Tensor-based Framework for Reliable Radio Cartography
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1718195
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Computing and Communication Foundations
22 references, page 1 of 2

[1] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, pp. 354-359, 10 2017.

[2] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep neural networks,” in Advances in Neural Information Processing Systems 27, pp. 2924-2932, Curran Associates, Inc., 2014. [OpenAIRE]

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026-1034, IEEE, 12 2015.

[4] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, vol. 37 of Proceedings of Machine Learning Research, pp. 448-456, 07-09 Jul 2015.

[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, IEEE, 6 2016.

[6] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Computer Vision - ECCV 2016, pp. 630-645, Springer International Publishing, 2016.

[7] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Advances in Neural Information Processing Systems 28, pp. 2377-2385, Curran Associates, Inc., 2015. [OpenAIRE]

[8] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261-2269, IEEE, 7 2017.

[9] E. Orhan and X. Pitkow, “Skip connections eliminate singularities,” in International Conference on Learning Representations, 2018.

[10] M. Hardt and T. Ma, “Identity Matters in Deep Learning,” in International Conference on Learning Representations, 2017.

[11] K. Kawaguchi, “Deep learning without poor local minima,” in Advances in Neural Information Processing Systems 29, pp. 586-594, Curran Associates, Inc., 2016.

[12] D. Balduzzi, M. Frean, L. Leary, J. P. Lewis, K. W.-D. Ma, and B. McWilliams, “The shattered gradients problem: If resnets are the answer, then what is the question?,” in Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, pp. 342-350, 06-11 Aug 2017.

[13] A. Veit, M. J. Wilber, and S. Belongie, “Residual networks behave like ensembles of relatively shallow networks,” in Advances in Neural Information Processing Systems 29 (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), pp. 550-558, Curran Associates, Inc., 2016.

[15] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” in International Conference on Learning Representations, 2014.

[17] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” tech. rep., 2009.

22 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue