[JSSC'20] Minkyu Kim, et.al.

An Energy-Efficient Deep Convolutional Neural Network Accelerator Featuring Conditional Computing and Low External Memory Access

Minkyu Kim, et.al. on October 19, 2020
doi.org
obsidian에서 수정하기

Abstract

A DCNN accelerator featuring a novel conditional computing scheme that synergistically combines precision cascading (PC) with zero skipping (ZS) to reduce many redundant convolutions that are followed by max-pooling operations, and provides the added benefit of increased sparsity per low-precision group. With its algorithmic success in many machine learning tasks and applications, deep convolutional neural networks (DCNNs) have been implemented with custom hardware in a number of prior works. However, such works have not exploited conditional/approximate computing to the utmost toward eliminating redundant computations of CNNs. This article presents a DCNN accelerator featuring a novel conditional computing scheme that synergistically combines precision cascading (PC) with zero skipping (ZS). To reduce many redundant convolutions that are followed by max-pooling operations, we propose precision cascading, where the input features are divided into a number of low-precision groups and approximate convolutions with only the most significant bits (MSBs) are performed first. Based on this approximate computation, the full-precision convolution is performed only on the maximum pooling output that is found. This way, the total number of bit-wise convolutions can be reduced by $\sim 2\times $ with < 0.8% degradation in ImageNet accuracy. PC provides the added benefit of increased sparsity per low-precision group, which we exploit with ZS to eliminate the clock cycles and external memory accesses. The proposed conditional computing scheme has been implemented with custom architecture in a 40-nm prototype chip, which achieves a peak energy efficiency of 24.97 TOPS/W at 0.6-V supply and a low external memory access of 0.0018 access/MAC with VGG-16 CNN for ImageNet classification and a peak energy efficiency of 28.51 TOPS/W at 0.9-V supply with FlowNet for Flying Chair data set.

Figure

Fig. 1. - PC multiplication of the input feature by weight.

Fig. 2. - Conceptual operation of the PC scheme.

Fig. 3. - Overall latency of the PC scheme compared to the non-PC scheme.

figure 4 figure 4

figure 6 figure 6

figure 8 figure 8

figure 9 figure 9

figure 10 figure 10

figure 11 figure 11

figure 12 figure 12

figure 13 figure 13

figure 14 figure 14

figure 15 figure 15

figure 16 figure 16

figure 17 figure 17

figure 18 figure 18

figure 19 figure 19

figure 20 figure 20

figure 21 figure 21

Table

Table I- Statistics About the Percentage of the Max-Pooling Results Found in Each Precision Group

table II table II

table III table III

table IV table IV

table V table V

table VI table VI