publication . Preprint . 2018

MPNA: A Massively-Parallel Neural Array Accelerator with Dataflow Optimization for Convolutional Neural Networks

Hanif, Muhammad Abdullah; Putra, Rachmad Vidya Wicaksana; Tanvir, Muhammad; Hafiz, Rehan; Rehman, Semeen; Shafique, Muhammad;
Open Access English
  • Published: 30 Oct 2018
Abstract
The state-of-the-art accelerators for Convolutional Neural Networks (CNNs) typically focus on accelerating only the convolutional layers, but do not prioritize the fully-connected layers much. Hence, they lack a synergistic optimization of the hardware architecture and diverse dataflows for the complete CNN design, which can provide a higher potential for performance/energy efficiency. Towards this, we propose a novel Massively-Parallel Neural Array (MPNA) accelerator that integrates two heterogeneous systolic arrays and respective highly-optimized dataflow patterns to jointly accelerate both the convolutional (CONV) and the fully-connected (FC) layers. Besides ...
Subjects
free text keywords: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, Computer Science - Machine Learning
Download from
16 references, page 1 of 2

[1] V. Sze and Y. H. Chen and T. J. Yang and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” in Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec. 2017.

[2] S. Han et al.,“EIE: Efficient Inference Engine on Compressed Deep Neural Network,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 243-254.

[3] A. Parashar et al., “SCNN: An accelerator for compressed-sparse convolutional neural networks,” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, 2017, pp. 27-40.

[4] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger and A. Moshovos,“Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 1-13.

[5] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Selfnormalizing neural networks,” in Advances in Neural Information Processing Systems 30, pp. 971980, 2017.

[6] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, 2017, pp. 1-12. [OpenAIRE]

[7] T. Luo et al., “DaDianNao: A Neural Network Supercomputer,”in IEEE Transactions on Computers, vol. 66, no. 1, pp. 73-88, 1 Jan. 2017.

[8] Y. Chen, T. Krishna, J. S. Emer and V. Sze,“Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017.

[9] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep convolutional neural networks,” CoRR, vol. abs/1512.08571, 2015.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS12, USA, pp. 10971105, 2012.

[11] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 779-788.

[12] N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti 7.0.” https://github.com/HewlettPackard/cacti, 2018.

[13] Y. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” in 2016 IEEE International Solid-State Circuits Conference (ISSCC), pp. 262263, Jan. 2016.

[14] W. Lu, G. Yan, J. Li, S. Gong, Y. Han and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, 2017, pp. 553-564.

[15] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu and X. Li “SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators,” DATE, 2018.

16 references, page 1 of 2
Abstract
The state-of-the-art accelerators for Convolutional Neural Networks (CNNs) typically focus on accelerating only the convolutional layers, but do not prioritize the fully-connected layers much. Hence, they lack a synergistic optimization of the hardware architecture and diverse dataflows for the complete CNN design, which can provide a higher potential for performance/energy efficiency. Towards this, we propose a novel Massively-Parallel Neural Array (MPNA) accelerator that integrates two heterogeneous systolic arrays and respective highly-optimized dataflow patterns to jointly accelerate both the convolutional (CONV) and the fully-connected (FC) layers. Besides ...
Subjects
free text keywords: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, Computer Science - Machine Learning
Download from
16 references, page 1 of 2

[1] V. Sze and Y. H. Chen and T. J. Yang and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” in Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec. 2017.

[2] S. Han et al.,“EIE: Efficient Inference Engine on Compressed Deep Neural Network,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 243-254.

[3] A. Parashar et al., “SCNN: An accelerator for compressed-sparse convolutional neural networks,” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, 2017, pp. 27-40.

[4] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger and A. Moshovos,“Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 1-13.

[5] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Selfnormalizing neural networks,” in Advances in Neural Information Processing Systems 30, pp. 971980, 2017.

[6] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, 2017, pp. 1-12. [OpenAIRE]

[7] T. Luo et al., “DaDianNao: A Neural Network Supercomputer,”in IEEE Transactions on Computers, vol. 66, no. 1, pp. 73-88, 1 Jan. 2017.

[8] Y. Chen, T. Krishna, J. S. Emer and V. Sze,“Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017.

[9] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep convolutional neural networks,” CoRR, vol. abs/1512.08571, 2015.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS12, USA, pp. 10971105, 2012.

[11] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 779-788.

[12] N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti 7.0.” https://github.com/HewlettPackard/cacti, 2018.

[13] Y. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” in 2016 IEEE International Solid-State Circuits Conference (ISSCC), pp. 262263, Jan. 2016.

[14] W. Lu, G. Yan, J. Li, S. Gong, Y. Han and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, 2017, pp. 553-564.

[15] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu and X. Li “SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators,” DATE, 2018.

16 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue