obsidian에서 수정하기

Abstract

Exploiting sparsity is a key technique in accelerating quantized convolutional neural network (CNN) inference on mobile devices. Prior sparse CNN accelerators largely exploit unstructured sparsity and achieve significant speedups. Due to the unbounded, largely unpredictable sparsity patterns, however, exploiting unstructured sparsity requires complicated hardware design with significant energy and area overhead, which is particularly detrimental to mobile/IoT inference scenarios where energy and area efficiency are crucial.We propose to exploit structured sparsity, more specifically, Density Bound Block (DBB) sparsity for both weights and activations. DBB block tensors bound the maximum number of non-zeros per block. DBB thus exposes statically predictable sparsity patterns that enable lean sparsity-exploiting hardware and efficient memory access. We propose new hardware primitives to implement DBB sparsity for (static) weights and (dynamic) activations, respectively, with very low overheads.Building on top of the primitives, we describe S2TA, a systolic array-based CNN accelerator that exploits joint weight and activation DBB sparsity and new dimensions of data reuse unavailable on the traditional systolic array. S2TA in 16nm achieves more than 2× speedup and energy reduction compared to a strong baseline of a systolic array with zero-value clock gating, over five popular CNN benchmarks. Compared to two recent non-systolic sparse accelerators, Eyeriss v2 (65nm) and SparTen (45nm), S2TA in 65nm uses about 2.2× per and 3.1× less energy inference, respectively.

Figure

Fig. 1: - Energy breakdown for a conventional dense INT8 systolic array accelerator, with typical 50% CNN sparsity. The INT8 MAC datapath itself is very compact (consuming only 20% of the energy), while the operand/result buffers dominate.

Fig. 2: - Hardware structures for sparse GEMM require explicit data re-ordering, which introduces overheads (blocks in red), in the form of either (a) an operand gather stage before the MAC compute (SMT-SA [38]), or (b) a result scatter with a distributed accumulator (SCNN [30]).

Fig. 3: - Effective energy/area and speedup, with breakdown by component for INT8 systolic array variants processing a typical convolution with 50% weight and activation sparsity. The SMT, our INT8 re-implementation of [38], achieves speedup, but buffering overheads actually result in worse energy efficiency than even the dense case.