Chipmunk: A Systolically Scalable 0.9 mm${}^2$, 3.08 Gop/s/mW @ 1.2 mW Accelerator for Near-Sensor Recurrent Neural Network Inference

Preprint English OPEN
Conti, Francesco ; Cavigelli, Lukas ; Paulin, Gianna ; Susmelj, Igor ; Benini, Luca (2017)
  • Subject: Computer Science - Distributed, Parallel, and Cluster Computing | Computer Science - Sound | Computer Science - Neural and Evolutionary Computing | Computer Science - Learning

Recurrent neural networks (RNNs) are state-of-the-art in voice awareness/understanding and speech recognition. On-device computation of RNNs on low-power mobile and wearable devices would be key to applications such as zero-latency voice-based human-machine interfaces. Here we present Chipmunk, a small (<1 mm${}^2$) hardware accelerator for Long-Short Term Memory RNNs in UMC 65 nm technology capable to operate at a measured peak efficiency up to 3.08 Gop/s/mW at 1.24 mW peak power. To implement big RNN models without incurring in huge memory transfer overhead, multiple Chipmunk engines can cooperate to form a single systolic array. In this way, the Chipmunk architecture in a 75 tiles configuration can achieve real-time phoneme extraction on a demanding RNN topology proposed by Graves et al., consuming less than 13 mW of average power.
  • References (17)
    17 references, page 1 of 2

    A. Graves, A.-R. Mohamed, and G. Hinton, “Speech Recognition With Deep Recurrent Neural Networks,” in Proc. IEEE ICASSP, 2013.

    W. Xiong, J. Droppo et al., “The Microsoft 2016 Conversational Speech Recognition System,” in Proc. IEEE ICASSP, 2017, pp. 5255-5259.

    K. Cho, B. van Merrienboer et al., “Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation,” in Proc. ACL EMNLP, 2014, pp. 1724-1734.

    L. Cavigelli and L. Benini, “A 803 GOp/s/W Convolutional Network Accelerator,” IEEE TCSVT, 2016.

    F. Conti, R. Schilling et al., “An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics,” IEEE TCAS, vol. 64, no. 9, pp. 2481-2494, 9 2017.

    R. Andri, L. Cavigelli et al., “YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration,” IEEE TCAD, 2017.

    IEEE ISSCC, 2016, pp. 262-263.

    Z. Du, R. Fasthuber et al., “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in Proc. ACM/IEEE ISCA, 2015, pp. 92-104.

    V. Sze, Y.-H. Chen et al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” arXiv:703.09039, 2017.

    N. P. Jouppi, A. Borchers et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in Proc. ACM ISCA, 2017.

  • Metrics
    7
    views in OpenAIRE
    0
    views in local repository
    0
    downloads in local repository
Share - Bookmark

  • Download from
  • Funded by
  • Related to
    FET H2020 -> FET HPC : HPC Core Technologies, Programming Environments and Algorithms for Extreme Parallelism and Extreme Data Applications
    FET H2020 -> FET HPC : European Exascale Processor Memory Node Design
  • Cite this publication