논문 아카이브

[MICRO'19] Sparse Tensor Core

Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs

Abstract Deep neural networks have become the compelling solution for the applications such as image classification, object detection, speech recognition, and machine translation. However, the gre...

Posted by Maohua Zhu, et.al. on October 12, 2019

[MICRO'19] ExTensor

ExTensor: An Accelerator for Sparse Tensor Algebra

Abstract Generalized tensor algebra is a prime candidate for acceleration via customized ASICs. Modern tensors feature a wide range of data sparsity, with the density of non-zero elements ranging ...

Posted by Kartik Hegde, et.al. on October 12, 2019

[TACOT'19] Nicolas Vasilache, et.al.

The Next 700 Accelerated Layers

Abstract A domain-specific language with a tensor notation close to the mathematics of deep learning; a Just-In-Time optimizing compiler based on the polyhedral framework; carefully coordinated li...

Posted by Nicolas Vasilache, et.al. on October 11, 2019

[ICCV'19] Xingchen Ma, et.al.

A Bayesian Optimization Framework for Neural Network Compression

Abstract A general Bayesian optimization framework for optimizing functions that are computed based on U-statistics is developed and a method that gives a probabilistic approximation certificate o...

Posted by Xingchen Ma, et.al. on October 1, 2019

[TC'19] NNPIM

NNPIM: A Processing In-Memory Architecture for Neural Network Acceleration

Abstract This paper proposes a novel processing in-memory architecture, called NNPIM, that significantly accelerates neural network's inference phase inside the memory and introduces simple optimi...

Posted by Saransh Gupta, et.al. on September 1, 2019

[VLSI'19] SIMBA

A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm

Abstract This work presents a scalable deep neural network (DNN) accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS)...

Posted by B. Zimmer, et.al. on June 1, 2019

[arXiv.org'19] M. Naumov, et.al.

Deep Learning Recommendation Model for Personalization and Recommendation Systems

Abstract A state-of-the-art deep learning recommendation model (DLRM) is developed and its implementation in both PyTorch and Caffe2 frameworks is provided and a specialized parallelization scheme...

Posted by M. Naumov, et.al. on May 31, 2019

[ASPLOS'19] Buffets

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration

Abstract Accelerators spend significant area and effort on custom on-chip buffering. Unfortunately, these solutions are strongly tied to particular designs, hampering re-usability across other acc...

Posted by Michael Pellauer, et.al. on April 4, 2019

[TPAMI'19] Res2Net

Res2Net: A New Multi-Scale Backbone Architecture

Abstract This paper proposes a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block that represents multi-scale fe...

Posted by Shanghua Gao, et.al. on April 2, 2019

[FCCM'19] T2S-Tensor

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

Abstract We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. ...

Posted by Nitish Srivastava, et.al. on April 1, 2019