[TC'20] Xi Zeng, et.al.

Addressing Irregularity in Sparse Neural Networks Through a Cooperative Software/Hardware Approach

Xi Zeng, et.al. on July 1, 2020
doi.org
obsidian에서 수정하기

Abstract

A software-based coarse-grained pruning technique, together with local quantization, significantly reduces the size of indexes and improves the network compression ratio and a multi-core hardware accelerator, Cambricon-SE, to address the remaining irregularity of sparse synapses and neurons efficiently. Neural networks have become the dominant algorithms rapidly as they achieve state-of-the-art performance in a broad range of applications such as image recognition, speech recognition, and natural language processing. However, neural networks keep moving toward deeper and larger architectures, posing a great challenge to hardware systems due to the huge amount of data and computations. Although <italic>sparsity</italic> has emerged as an effective solution for reducing the intensity of computation and memory accesses directly, irregularity caused by sparsity (including sparse synapses and neurons) prevents accelerators from completely leveraging the benefits, i.e., it also introduces costly indexing module in accelerators. In this article, we propose a cooperative software/hardware approach to address the irregularity of sparse neural networks efficiently. Initially, we observe the local convergence, namely larger weights tend to gather into small clusters during training. Based on that key observation, we propose a software-based coarse-grained pruning technique to reduce the irregularity of sparse synapses drastically. The coarse-grained pruning technique, together with local quantization, significantly reduces the size of indexes and improves the network compression ratio. We further design a multi-core hardware accelerator, Cambricon-SE, to address the remaining irregularity of sparse synapses and neurons efficiently. The novel accelerator have three key features: 1) <italic>selector modules</italic> to filter unnecessary synapses and neurons, 2) <italic>compress/decompress modules</italic> for exploiting the sparsity in data transmission (which is rarely studied in previous work), and 3) a <italic>multi-core architecture</italic> with elevated throughput to meet the real-time processing requirement. Compared against a state-of-the-art sparse neural network accelerator, our accelerator is <inline-formula><tex-math notation="LaTeX">$1.20 \times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>20</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="xi-ieq1-2978475.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$2.72 \times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>72</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="xi-ieq2-2978475.gif"/></alternatives></inline-formula> better in terms of performance and energy efficiency, respectively. Moreover, for real-time video analysis tasks, Cambricon-SE can process 1080p video at the speed of 76.59 fps.

Figure

figure 1 figure 1

figure 2 figure 2

figure 3 figure 3

figure 4 figure 4

figure 5 figure 5

figure 6 figure 6

figure 7 figure 7

figure 8 figure 8

figure 9 figure 9

figure 10 figure 10

figure 11 figure 11

figure 12 figure 12

figure 13 figure 13

figure 14 figure 14

figure 15 figure 15

figure 16 figure 16

figure 17 figure 17

figure 18 figure 18

figure 19 figure 19

figure 20 figure 20

figure 21 figure 21

figure 22 figure 22

figure 23 figure 23

figure 24 figure 24

figure 25 figure 25

Table

table 1 table 1

table 2 table 2

table 3 table 3

table 4 table 4

table 5 table 5

table 6 table 6

table 8 table 8

table 9 table 9