Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 29 Apr 2018Embargo end date: 01 Jan 2018Publisher:Association for the Advancement of Artificial Intelligence (AAAI)Journal:Proceedings of the AAAI Conference on Artificial Intelligence, volume 32 (issn: 2159-5399, eissn: 2374-3468,

Copyright policy )Funded by:NSF | NeTS: Medium: Collaborati..., NSF | AitF: Collaborative Resea..., NSF | CPS: Medium: Enabling Mul... +1 projects

Authors: Wang, Yanzhi; Ding, Caiwen; Li, Zhe; Yuan, Geng; Liao, Siyu; Ma, Xiaolong; Yuan, Bo; +4 Authors

doi: 10.1609/aaai.v32i1.11653 , 10.48550/arxiv.1802.06402

arXiv: 1802.06402

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

- Summary
- Subjects
- Related research
  (9)
- Metrics

Abstract

Hardware accelerations of deep learning systems have been extensively investigated in industry and academia. The aim of this paper is to achieve ultra-high energy efficiency and performance for hardware implementations of deep neural networks (DNNs). An algorithm-hardware co-optimization framework is developed, which is applicable to different DNN types, sizes, and application scenarios. The algorithm part adopts the general block-circulant matrices to achieve a fine-grained tradeoff of accuracy and compression ratio. It applies to both fully-connected and convolutional layers and contains a mathematically rigorous proof of the effectiveness of the method. The proposed algorithm reduces computational complexity per layer from O(n2) to O(n log n) and storage complexity from O(n2) to O(n), both for training and inference. The hardware part consists of highly efficient Field Programmable Gate Array (FPGA)-based implementations using effective reconfiguration, batch processing, deep pipelining, resource re-using, and hierarchical control. Experimental results demonstrate that the proposed framework achieves at least 152X speedup and 71X energy efficiency gain compared with IBM TrueNorth processor under the same test accuracy. It achieves at least 31X energy efficiency gain compared with the reference FPGA-based work.

Related Organizations

University of California System
United States
King’s University
United States
City University of New York
United States
Northeastern University
United States
Department of Electrical Engineering and Computer Science Stanford University
United States

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)

9 Research products, page 1 of 1

An Algorithm-Hardware Co-design Framework to Overcome Imperfections of Mixed-signal DNN Accelerators
2022IsAmongTopNSimilarDocuments
Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference
2020IsAmongTopNSimilarDocuments
Algorithm-hardware co-design of a discontinuous Galerkin shallow-water model for a dataflow architecture on FPGA
2021IsAmongTopNSimilarDocuments
Adaptive Precision CNN Accelerator Using Radix-X Parallel Connected Memristor Crossbars
2019IsAmongTopNSimilarDocuments
ALGORITHM-HARDWARE CODESIGN OF A FAST PARALLEL ROUTING ARCHITECTURE FOR CLOS NETWORKS
2010IsAmongTopNSimilarDocuments
Synetgy
2019IsAmongTopNSimilarDocuments
An Algorithm–Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers
2022IsAmongTopNSimilarDocuments
FPGA Implementation of Real-time Star Centroid Extraction Algorithm
2019IsAmongTopNSimilarDocuments
Enabling Energy-Efficient and Robust Machine Intelligence with Algorithm-Hardware Co-Design
2020IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	11
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average