Archive - parkdongho.github.io/paper-review

Show All ¹⁴⁷ paper-review ¹⁴⁷ top-tier ⁹¹ modeling ³⁵ architecture ³⁰ polyhedral ²⁰ sparse ²⁰ conference ¹⁶ dse ¹⁴ dense ⁹ chisel ⁸ compiler ⁸ objective-function ⁸ scheduler ⁸ generator ⁷ isa ³ journal ³ memory ³ mlir ³ bit-serial ² cross-layer ² dataflow/flexible ² dsl ² hls ² inference ² letter ² mapper ² noc ² simd ² survey ² accelerator ¹ dsl-arch ¹ generater ¹ hw-sw-codesign ¹ modeling/dataflow ¹ simulator ¹ training ¹

2024

[TVLSI'24] Pengbo Yu, et.al. ¹

An Energy Efficient Soft SIMD Microarchitecture and Its Application on Quantized CNNs

#simd

[TCAS-I'24] HDSuper ⁰

HDSuper: High-Quality and High Computational Utilization Edge Super-Resolution Accelerator With Hardware-Algorithm Co-Design Techniques
Summary : 본 연구는 경량의 초해상도 네트워크(LSR)와 효율적이고 고품질의 SR 가속기 HDSuper를 포함하는 종단 간 플랫폼을 제공하기 위해 하드웨어 알고리즘 공동 설계 기술을 탐구하며, 다양한 연산자를 높은 계산 활용도로 지원하기 위해 효율적인 평면화 및 할당(F-A) 매핑 전략과 결합된 통합 컴퓨팅 코어(UCC)를 설계합니다.

[DATE'24] WideSA ⁰ 📄

WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on ACAP
Summary : 하드웨어와 계산의 기능을 모두 활용하여 Versal ACAP 아키텍처에서 균일한 반복을 가속화하는 것을 목표로 하는 WideSA라는 매핑 체계를 제안하고 자동 매핑 프레임워크를 도입합니다.

#mapper #modeling #polyhedral #conference

[HPCA'24] POM ⁰ 📄

An Optimizing Framework on MLIR for Efficient FPGA-based Accelerator Generation
Summary : 고급 종속성 분석과 광범위한 FPGA 지향 루프 변환을 가능하게 하는 다단계 중간 표현(MLIR)에 기반한 엔드투엔드 최적화 프레임워크인 '다단계 중간 표현(MLIR)'을 제안

#modeling #polyhedral #mlir #conference

[CAL'24] DeMM ⁰

DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity
Summary : 구조적으로 구조적 희소성이 완화된 넓은 블록에서 작동하는 가속기를 설계하여 DL 모델의 더 넓은 범위의 희소성을 목표로 하며, 세밀하고 완화된 구조적 희소성을 위해 구축된 현재의 최신 수축기 배열 엔진보다 지연 시간이 크게 개선되었음을 입증합니다.

#sparse #architecture #letter

2023

[TCAD'23] Rubick ⁰ 📄

Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow Decomposition
Summary : 데이터 흐름을 액세스 입력과 데이터 레이아웃이라는 두 가지 낮은 수준의 중간 표현으로 분해하여 구조 분석과 체계적인 탐색이 가능한 PE 상호 연결 및 메모리 구조와 같은 하드웨어 구현 세부 사항을 추론

#dse #polyhedral #chisel #modeling #journal

[TCAS-I'23] ACNPU ⁰

ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator With Decoupled Asymmetric Convolution

#journal

[DAC'23] Rubick ³ 📄

Rubick: A Synthesis Framework for Spatial Architectures via Dataflow Decomposition
Summary : Rubick은 데이터 흐름을 액세스 입력과 데이터 레이아웃을 포함한 두 가지 낮은 수준의 중간 표현으로 분해하여 효율적인 DSE를 제공하고 최적화된 하드웨어를 생성

#dse #chisel #modeling #modeling/dataflow #polyhedral #conference

[DAC'23] AutoMM ⁸

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives
Summary : FP32, INT16, INT8 데이터 유형에 대해 각각 3.7 TFLOP, 7.5 TOP, 28.2 TOP을 달성하는 Versal의 MM 가속기 설계를 체계적으로 생성할 수 있는 자동 화이트박스 프레임워크인 AutoMM이 제공됩니다.

#modeling #conference

[CS'23] Rui Xu, et.al. ⁸

A Survey of Design and Optimization for Systolic Array-based DNN Accelerators

#survey

[TCAD'23] TensorLib ² 📄

TensorLib: Automatic Generation of Spatial Accelerator for Tensor Algebra
Summary : Tensorlib은 다양한 텐서 대수 프로그램에 대해 서로 다른 데이터 흐름을 가진 하드웨어 설계를 자동으로 생성할 수 있으며, 최첨단 제너레이터보다 뛰어난 318MHz 주파수와 786GFLOP/s 처리량을 달성할 수 있는 Xilinx VU9P FPGA에 대한 성능을 입증했습니다.

#dse #polyhedral #chisel #modeling #generator #journal

[arXiv'23] ChipGPT ²⁰

ChipGPT: How far are we from natural language hardware design
Summary : LLMs를 활용하여 자연어 사양에서 하드웨어 논리 설계를 생성하는 자동화된 설계 환경을 시연, 사전 연구 및 기존 LLMs 단독 사용보다 더 넓은 설계 최적화 공간을 보여줌

#dse

[JSEN'23] F. Spagnolo, et.al. ⁸

Design of a Low-Power Super-Resolution Architecture for Virtual Reality Wearable Devices

[FPGA'23] CHARM ¹⁸

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture
Summary : 애플리케이션 내의 다양한 크기의 작은 MM 레이어와 하나의 모놀리식 가속기의 방대한 계산 자원 간의 불일치로 인해 발생하는 시스템 처리량의 가장 큰 병목 현상을 식별하고, 애플리케이션 내의 다양한 레이어를 동시에 처리하기 위해 여러 다양한 MM 가속기 아키텍처를 구성하는 CHARM 프레임워크를 제안합니다.

#dse #modeling #conference

2022

[ICCAD'22] HECTOR ⁷ 📄

HECTOR: A Multi-level Intermediate Representation for Hardware Synthesis Methodologies
Summary : 하드웨어 합성 방법론에 대한 통합된 중간 표현을 제공하는 2단계 IR인 Hector는 최첨단 HLS 툴에서 생성되는 것과 비슷하며 성능과 생산성 면에서 HLS 구현을 능가할 수 있습니다.

#generator #conference

[TRTS'22] TAPA ¹⁰

TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design
Summary : C++ 태스크 병렬 데이터플로우 프로그램을 고주파수 FPGA 가속기로 컴파일하는 종단 간 프레임워크인 TAPA를 제안, HLS 컴파일 중 잠재적인 임계 경로의 정확한 파이프라이닝을 위해 거친 입자 수준의 플로어플래닝 단계를 채택

[DAC'22] EMS ¹⁰ 📄

EMS: efficient memory subsystem synthesis for spatial accelerators
Summary : 공간 가속기의 메모리 서브시스템을 최적화하는 프레임워크, 공간-시간 변환을 사용하여 1. 재사용 분석을 통해 데이터 레이아웃, 데이터 매핑, 메모리 액세스 컨트롤러를 자동생성, 2. 직접 연결, 멀티캐스트 연결, 회전 연결 등 다양한 PE-메모리 상호연결 토폴로지를 생성함

#polyhedral #chisel #modeling #generator #memory #noc #conference

[TRTS'22] FPGA HLS Today ⁴⁸

FPGA HLS Today: Successes, Challenges, and Opportunities

[FPGA'22] HeteroFlow ¹⁸ 📄

HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs
Summary : 알고리즘 사양과 맞춤형 메모리 계층 구조 전반의 데이터 배치 조율과 관련된 최적화를 분리하고 오픈 소스 HeteroCL DSL 및 컴파일 프레임워크 위에 구축되는 FPGA 가속기 프로그래밍 모델인 HeteroFlow가 제안되었습니다.

#modeling #conference

2021

[TRTS'21] Yi-Hsiang Lai, et.al. ²¹

Programming and Synthesis for Software-defined FPGA Acceleration: Status and Future Prospects

[arXiv.org'21] Jie Wang, et.al. ³

Search for Optimal Systolic Arrays: A Comprehensive Automated Exploration Framework and Lessons Learned

#dse

[FPGA'21] Sextans ³⁹

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication
Summary : Sextans 가속기는 온칩 메모리를 사용한 빠른 랜덤 액세스, 오프칩 대용량 행렬에 대한 스트리밍 액세스, II=1 파이프라인으로 균형 잡힌 워크로드를 위한 PE 인식 비제로 스케줄링, 범용 가속기로서 다양한 크기의 SpMM을 지원하기 위해 하드웨어를 한번 프로토타이핑할 수 있는 하드웨어 유연성 등을 특징으로 합니다.

#architecture #sparse #conference

[PACT'21] Union ¹¹ 📄

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators
Summary : 이 작업에서는 Union11이라는 공간 가속기를 위한 HW-SW 코드 설계 에코시스템을 소개하고, 다양한 매핑 체계를 사용하여 다양한 가속기 아키텍처에서 서로 다른 텐서 연산을 오프로드하는 여러 사례 연구를 통해 커뮤니티를 위한 Union의 가치를 입증합니다.

#modeling #polyhedral #mlir #conference

[HPCA'21] S2TA ⁴⁴ 📄

S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration
Summary : S2TA is described, a systolic array-based CNN accelerator that exploits joint weight and activation DBB sparsity and new dimensions of data reuse unavailable on the traditional syStolic array.

#sparse #architecture #conference

[CAL'21] STONNE ³⁶ 📄

STONNE: Enabling Cycle-Level Microarchitectural Simulation for DNN Inference Accelerators
Summary : 최첨단 경직 및 유연 DNN 추론 가속기를 위한 사이클 레벨 마이크로아키텍처 시뮬레이터인 STONNE은 모든 상위 DNN 프레임워크에 가속기 장치로 연결하여 조밀하고 희박한 실제 수정되지 않은 DNN 모델의 전체 모델 평가를 수행할 수 있는 기능을 선보입니다.

#modeling #sparse #letter

[ISCA'21] RaPiD ⁵²

RaPiD: AI Accelerator for Ultra-low Precision Training and Inference
Summary : 이 연구는 16 및 8비트 부동 소수점과 4 및 2비트 고정 소수점의 다양한 정밀도를 지원하는 4코어 AI 가속기 칩인 RaPiD 1을 설계하고, 4개의 32코어 RaPiD 칩으로 구성된 768 TFLOPs AI 시스템에 대한 DNN 추론 및 DNN 훈련을 평가했습니다.

#architecture

[ASPLOS'21] FAST ³⁵

A full-stack search technique for domain optimized deep learning accelerators
Summary : EfficientNet과 BERT를 포함한 최첨단 비전 및 자연어 처리(NLP) 모델의 병목 현상을 분석하고 이러한 병목 현상을 해결할 수 있는 가속기를 설계하기 위해 FAST를 사용하며, FAST로 생성된 가속기가 중간 규모의 데이터센터 배포에 실용적일 수 있음을 보여줍니다.

#conference

[ISCA'21] CoSA ⁷⁰ 📄

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators
Summary : CoSA는 DNN 연산자와 하드웨어의 규칙성을 활용하여 DNN 스케줄링 공간을 알고리즘 및 아키텍처 제약 조건이 있는 혼합 정수 프로그래밍(MIP) 문제로 공식화하여 한 번에 매우 효율적인 스케줄을 자동으로 생성할 수 있습니다.

#modeling #objective-function #polyhedral

[ISCA'21] TENET ⁴¹ 📄

TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation
Summary : 텐서 계산을 위한 하드웨어 데이터 흐름을 수식적으로 설명하는 관계 중심 표기법이 도입되어 보다 정교한 아핀 변환을 사용하여 계산 중심 및 데이터 중심 표기법보다 표현력이 뛰어나고 데이터 재사용, 대역폭, 지연 시간 및 에너지를 포함한 정확한 메트릭 추정을 본질적으로 지원합니다.

#polyhedral #modeling

[ISCA'21] HASCO ⁴⁸ 📄

HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation
Summary : 이 연구에서는 고밀도 텐서 연산에 효율적인 HW/SW 솔루션을 제공하는 애자일 공동 설계 접근 방식인 HASCO를 제안하고 하드웨어 최적화를 탐색하기 위한 다목적 베이지안 최적화 알고리즘을 개발합니다.

#modeling #hw-sw-codesign #chisel

[DAC'21] TensorLib ¹⁹ 📄

TensorLib: A Spatial Accelerator Generation Framework for Tensor Algebra
Summary : 공간 하드웨어 아키텍처의 개발 및 최적화를 위한 생산성을 획기적으로 개선하여 성능, 면적, 전력의 절충점을 갖춘 풍부한 설계 공간을 제공합니다.

#polyhedral #dse #chisel #modeling

[CGO'21] MLIR ²⁵⁵ 📄

MLIR: Scaling Compiler Infrastructure for Domain Specific Computation

#compiler #mlir

[FPGA'21] AutoSA ⁹⁵ 📄

AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA
Summary : 다면체 프레임워크를 기반으로 FPGA에서 수축 어레이를 생성하기 위한 엔드투엔드 컴파일 프레임워크인 AutoSA를 소개하고, 성능을 향상시키기 위해 다양한 차원의 최적화 세트를 추가로 통합합니다.

#polyhedral #dse #modeling #generater

[J'21] T. Hoefler, et.al. ⁴⁷⁰

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

#sparse

[ACCESS'21] L. Sekanina ²³

Neural Architecture Search and Hardware Accelerator Co-Search: A Survey

#survey

[OJCAS'21] W. Gross, et.al. ⁵

Hardware-Aware Design for Edge Intelligence

2020

[HPCA'20] HDA ⁹³

Heterogeneous Dataflow Accelerators for Multi-DNN Workloads
Summary : 이기종 데이터 흐름 가속기(HDA)는 RDA보다 더 높은 에너지 효율과 낮은 면적 비용으로 더 세밀한 데이터 흐름 유연성을 제공하며, 하드웨어 파티셔닝과 레이어 스케줄링을 공동 최적화하는 프레임워크인 Herald를 소개합니다.

#mapper #modeling #dataflow/flexible

[ICCAD'20] SuSy ³⁵ 📄

SuSy: A Programming Model for Productive Construction of High-Performance Systolic Arrays on FPGAs
Summary : SuSy는 도메인 특화 언어(DSL)와 컴파일 플로우로 구성된 프로그래밍 프레임워크로, 프로그래머가 FPGA에서 고성능 시스톨릭 어레이를 생산적으로 구축할 수 있게 해주며, 전문가가 최적화한 수동 설계 성능에 근접하면서도 코드 라인 수는 30배 적게 사용할 수 있습니다.

#modeling #dsl #hls

[ICCAD'20] GAMMA ⁸⁹ 📄

GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm
Summary : 매우 유연한 map-space를 구성하고 GAMMA가 공간을 탐색하고 높은 샘플 효율로 최적화된 매핑을 결정할 수 있음을 보여주며, 널리 사용되는 여러 최적화 방법과 GAMMA를 정량적으로 비교하여 GAMMA가 일관되게 더 나은 솔루션을 찾는 것을 관찰합니다.

#modeling

[JSSC'20] Minkyu Kim, et.al. ⁸ 📄

An Energy-Efficient Deep Convolutional Neural Network Accelerator Featuring Conditional Computing and Low External Memory Access

[FPGA'20] AutoDSE ⁹

AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators
Summary : 병목지점 안내 그래디언트 옵티마이저를 활용하여 더 나은 설계 지점을 체계적으로 찾고, 각 단계에서 설계의 병목지점을 찾아 영향력이 큰 파라미터에 집중하여 이를 극복하는 자동화된 DSE 프레임워크인 AutoDSE가 통합되어 전문가가 접근하는 것과 같은 방식으로 설계를 진행할 수 있습니다.

#modeling

[JETCAS'20] SRNPU ³¹

SRNPU: An Energy-Efficient CNN-Based Super-Resolution Processor With Tile-Based Selective Super-Resolution in Mobile Devices

#architecture

[ISPASS'20] SCALE-Sim ¹¹⁴ 📄

A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim
Summary : This work demonstrates and analyzes the trade-off space for performance, DRAM bandwidth, and energy, and identifies sweet spots for various workloads and hardware configurations, and observes that a judicious choice of scaling can lead to performance improvements as high as 50 per layer, within the availableDRAM bandwidth.

#modeling

[ISPASS'20] ASTRA-SIM ³¹ 📄

ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms
Summary : The SW/HW design-space for Distributed Training over a hierarchical scale-up fabric is established, the promise of algorithm-topology co-design for speeding up end to end training is demonstrated, and a network simulator is developed for navigating the design-space.

#modeling

[TC'20] Xi Zeng, et.al. ²²

Addressing Irregularity in Sparse Neural Networks Through a Cooperative Software/Hardware Approach

#sparse

[MICRO'20] MAESTRO ¹⁰⁵

MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings
Summary : An analytical cost model, MAESTRO, that analyzes various forms of data reuse in an accelerator based on inputs quickly and generates more than 20 statistics including total latency, energy, throughput, etc., as outputs is proposed.

#modeling

[ICASSP'20] dMazeRunner ¹⁵

dMazeRunner: Optimizing Convolutions on Dataflow Accelerators
Summary : dMazeRunner 프레임워크는 주어진 아키텍처에서 컨볼루션과 행렬 곱셈을 가속화하기 위한 실행 방법을 최적화하고, CNN 모델을 효율적으로 실행하기 위한 데이터플로우 가속기 설계를 탐색할 수 있게 해줍니다.

#modeling

[ISCA'20] DSAGEN ⁷⁸ 📄

DSAGEN: Synthesizing Programmable Spatial Accelerators
Summary : 재구성 가능한 가속기를 위한 하드웨어/소프트웨어 공동 설계 프로세스를 자동화하는 DSAGEN 프레임워크 개발에 사용된 공간 아키텍처의 하드웨어 기본 요소, 특히 소수의 하드웨어 기본 요소를 구성함으로써 이전의 많은 가속기 아키텍처를 근사화할 수 있다는 통찰을 얻게 되었습니다.

#dse #modeling

[ASPLOS'20] FlexTensor ¹⁴¹

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
Summary : FlexTensor는 인간의 개입 없이 텐서 계산 프로그램을 최적화할 수 있으며, 프로그래머가 하드웨어 플랫폼의 세부 사항을 고려하지 않고도 고수준의 프로그래밍 추상화 작업만 할 수 있게 합니다.

#architecture #modeling #conference

[HPCA'20] SIGMA ²⁹⁴ 📄

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training
Summary : SIGMA is proposed, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity, and includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN).

#accelerator #sparse

[JSSC'20] SIMBA ⁶²

A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm
Summary : A scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS) enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains.

[FPGA'20] AutoDNNchip ⁷⁶

AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs
Summary : The proposed AutoDNNchip is a DNN chip generator that can automatically produce both FPGA- and ASIC-based DNNChip implementation from DNNs developed by machine learning frameworks without humans in the loop and can achieve better performance than that of expert-crafted state-of-the-art FPGAs and ASICs.

#generator

[TVLSI'20] Jaehyeong Sim, et.al. ³¹

An Energy-Efficient Deep Convolutional Neural Network Inference Processor With Enhanced Output Stationary Dataflow in 65-nm CMOS

[TC'20] A. Ardakani, et.al. ³²

Fast and Efficient Convolutional Accelerator for Edge Computing

#sparse

2019

[arXiv.org'19] Gemmini ⁵²

Gemmini: An Agile Systolic Array Generator Enabling Systematic Evaluations of Deep-Learning Architectures

#dse #generator #chisel

[ICCAD'19] MAGNet ⁹⁶ 📄

MAGNet: A Modular Accelerator Generator for Neural Networks
Summary : MAGNet, a modular accelerator generator for neural networks, is proposed and an inference accelerator optimized for image classification application using three different neural networks—AlexNet, ResNet, and DriveNet is designed.

#generator

[MICRO'19] ExTensor ¹⁶⁶

ExTensor: An Accelerator for Sparse Tensor Algebra
Summary : The ExTensor accelerator is proposed, which builds these novel ideas on handling sparsity into hardware to enable better bandwidth utilization and compute throughput and evaluated on several kernels relative to industry libraries and state-of-the-art tensor algebra compilers.

#sparse

[MICRO'19] Sparse Tensor Core ¹⁰³

Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs
Summary : A novel pruning algorithm is devised to improve the workload balance and reduce the decoding overhead of the sparse neural networks and new instructions and micro-architecture optimization are proposed in Tensor Core to adapt to the structurally sparse Neural networks.

#architecture #sparse

[MICRO'19] eCNN ⁴⁰

eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference

#architecture

[MICRO'19] Simba ²⁷²

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
Summary : 이 연구에서는 대규모 컴퓨팅 및 온칩 스토리지가 필요한 애플리케이션 영역인 딥 러닝 추론에 세분화된 칩렛이 있는 MCM을 사용할 때의 비용과 이점을 조사하고 정량화하며, 데이터 로컬리티를 개선하는 세 가지 타일링 최적화 방법을 소개합니다.

#architecture #dense #noc

[MICRO'19] SparTen ¹⁶⁵

SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks
Summary : SparTen is proposed which achieves efficient inner join in sparse CNNs by providing support for native two-sided sparse execution and memory storage and to tackle load imbalance, SparTen employs a software scheme, called greedy balancing, which groups filters by density via two variants.

#architecture #sparse

[MICRO'19] ASV ³¹

ASV: Accelerated Stereo Vision System
Summary : ASV is described, an accelerated stereo vision system that simultaneously improves both performance and energy-efficiency while achieving high accuracy, and a new stereo algorithm, invariant-based stereo matching (ISM), that achieves significant speedup while retaining high accuracy.

[TACOT'19] Nicolas Vasilache, et.al. ¹³

The Next 700 Accelerated Layers

[ICCV'19] Xingchen Ma, et.al. ²⁰

A Bayesian Optimization Framework for Neural Network Compression

[TC'19] NNPIM ⁴²

NNPIM: A Processing In-Memory Architecture for Neural Network Acceleration

[VLSI'19] SIMBA ⁴²

A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm
Summary : This work presents a scalable deep neural network accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS), which enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains.

[arXiv.org'19] M. Naumov, et.al. ⁵⁶⁷

Deep Learning Recommendation Model for Personalization and Recommendation Systems

[ASPLOS'19] Buffets ⁶⁷

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration
Summary : This work presents buffets, an efficient and composable storage idiom for the needs of accelerators that is independent of any particular design, and implements buffets in RTL and shows that they only add 2% control overhead over an 8KB RAM.

#memory

[TPAMI'19] Res2Net ¹⁷⁷¹

Res2Net: A New Multi-Scale Backbone Architecture

[FCCM'19] T2S-Tensor ⁴³ 📄

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations
Summary : 프로그래머가 공간 매핑에서 기능 사양을 분리하여 동일한 기능에 대한 다양한 공간 최적화를 빠르게 탐색할 수 있도록 FPGA 및 CGRA를 비롯한 공간 아키텍처에서 고밀도 텐서 커널을 위한 고성능 수축기 배열을 생산적으로 생성하기 위한 언어 및 컴파일 프레임워크입니다.

#modeling #hls

[DATE'19] Nhut-Minh Ho, et.al. ⁴

Multi-objective Precision Optimization of Deep Neural Networks for Edge Devices

[ISPASS'19] Timeloop ³⁵⁵ 📄

Timeloop: A Systematic Approach to DNN Accelerator Evaluation
Summary : Timeloop's underlying models and algorithms are described in detail and results from case studies enabled by Timeloop are shown, which reveal that dataflow and memory hierarchy co-design plays a critical role in optimizing energy efficiency.

#dse #modeling

[FPGA'19] HeteroCL ¹¹⁵

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing
Summary : 실험 결과에 따르면 프로그래머는 알고리즘 코드를 그대로 유지하면서 다양한 유형의 하드웨어 커스터마이징을 결합하고 공간 아키텍처를 타겟팅하여 성능과 정확도 측면에서 설계 공간을 효율적으로 탐색할 수 있습니다.

#modeling

[HPCA'19] Shortcut Mining ³³

Shortcut Mining: Exploiting Cross-Layer Shortcut Reuse in DCNN Accelerators
Summary : A novel approach that “mines” the unexploited opportunity of on-chip data reusing by introducing the abstraction of logical buffers to address the lack of flexibility in existing buffer architecture, and proposing a sequence of procedures which can effectively reuse both shortcut and non-shortcut feature maps.

#cross-layer

[JSSC'19] UNPU ¹⁷⁹ 📄

UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision

2018

[BigData'18] YOLO-LITE ⁴¹²

YOLO-LITE: A Real-Time Object Detection Algorithm Optimized for Non-GPU Computers

[ASID'18] NVDLA ³⁰

Research on NVIDIA Deep Learning Accelerator

#architecture #dense #conference

[ICCAD'18] PolySA ⁹² 📄

PolySA: Polyhedral-Based Systolic Array Auto-Compilation
Summary : PolySA는 최근의 하이레벨 합성 기술을 활용하여 FPGA에서 고성능 systolic array 아키텍처를 생성하기 위한 최초의 완전 자동화된 컴파일 프레임워크로, 최첨단 수동 설계에 필적하는 성능으로 1시간 이내에 최적의 설계를 생성할 수 있습니다.

#polyhedral #modeling

[MICRO'18] Cambricon-S ¹⁷¹ 📄

Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach
Summary : A software-based coarse-grained pruning technique, together with local quantization, significantly reduces the size of indexes and improves the network compression ratio and a hardware accelerator is designed to address the remaining irregularity of sparse synapses and neurons efficiently.

#architecture #sparse

[ASPLOS'18] Interstellar ¹⁸²

Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

#modeling

[JETCAS'18] Eyeriss v2 ⁶⁶⁹ 📄

Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices

#architecture #sparse

[PLDI'18] Spatial ¹⁶⁰

Spatial: a language and compiler for application accelerators

#dsl-arch #modeling #polyhedral

[ISCA'18] GANAX ⁶⁹

GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks
Summary : 반복적인 계산 패턴을 활용하여 개별 마이크로프로그램을 생성하고, 이를 SIMD 모드에서 동시에 실행함으로써 GANAX의 통일된 MIMD-SIMD 설계를 제안함. 또한 각 개별 피연산자의 계산의 가장 미세한 세분화까지 접근-실행 아키텍처의 개념을 확장.

#architecture #isa #simd

[MICRO'18] MAESTRO ²¹⁶

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach
Summary : This work introduces a set of data-centric directives to concisely specify the DNN dataflow space in a compiler-friendly form and codifies this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Patio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration.

#dse #modeling

[DPCC'18] MAESTRO ⁸

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach Using MAESTRO.
Summary : This work introduces a set of data-centric directives to concisely specify the space of DNN dataflows in a compilerfriendly form and codifies this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Spatio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration.

#modeling

[CGO'18] Tiramisu ²⁰⁶ 📄

Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
Summary : 티라미수는 이미지 처리, 스텐실, 선형 대수 및 딥 러닝 분야를 위해 설계된 새로운 명령어를 통해 이러한 시스템을 대상으로 할 때 발생하는 복잡성을 명시적으로 관리할 수 있는 스케줄링 언어를 도입했습니다.

#polyhedral #objective-function #scheduler #compiler

[HPCA'18] OuterSPACE ¹⁹⁷ 📄

OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator
Summary : OuterSPACE는 확장성이 뛰어나고 에너지 효율적이며 재구성이 가능한 설계로, 대규모 병렬 단일 프로그램 다중 데이터(SPMD) 스타일 처리 장치, 분산 메모리, 고속 크로스바, 고대역폭 메모리(HBM)로 구성되어 있습니다.

#architecture #sparse

[ASPLOS'18] MAERI ³³⁴

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects
Summary : MAERI is a DNN accelerator built with a set of modular and configurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring tiny switches and provides 8-459% better utilization across multiple dataflow mappings over baselines with rigid NoC fabrics.

#architecture #sparse

[HCS'18] Bit-Tactical ²⁴

Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How

#bit-serial

[CC'18] Spatial Locality ³⁶ 📄

Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling

#objective-function #polyhedral #scheduler #compiler

[FPGA'18] Duncan J. M. Moss, et.al. ⁷¹

A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study

[arXiv'18] Tensor Comprehensions ³⁸⁴ 📄

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
Summary : 명령형과 선언형 스타일을 모두 제공하는 텐서 컴프리헨션(Tensor Comprehensions)이라는 딥 러닝 수학에 가까운 언어, 딥 러닝 DAG의 수학적 설명을 위임된 메모리 관리 및 동기화를 통해 CUDA 커널로 변환하는 다면체 Just-In-Time 컴파일러, 자동 튜너로 채워진 컴파일 캐시 등이 기여하고 있습니다.

#objective-function #polyhedral #scheduler

2017

[ICCV'17] Yihui He, et.al. ²²⁴⁷

Channel Pruning for Accelerating Very Deep Neural Networks

[ISCA'17] Plasticine ²⁰⁴

Plasticine: A reconfigurable architecture for parallel patterns
Summary : 병렬 패턴으로 구성된 애플리케이션을 효율적으로 실행하도록 설계된 새로운 공간 재구성 가능한 아키텍처인 Plasticine을 설계하여 다양한 Dense 및 Sparse 애플리케이션에서 기존 FPGA에 비해 최대 76.9배의 성능-와트 향상을 제공합니다.

#dense #sparse #architecture #simulator

[DAC'17] Automated Systolic Array ³⁵³ 📄

Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs
Summary : 본 논문에서는 높은 리소스 사용률에서 높은 클럭 주파수를 달성할 수 있는 systolic array 아키텍처를 사용하여 FPGA에서 CNN을 구현하고, 성능 및 리소스 활용에 대한 분석 모델을 제공하고 자동 설계 공간 탐색 프레임워크를 개발합니다.

#dse #modeling #dse

[ISCA'17] SCNN ⁹⁹⁶ 📄

SCNN: An accelerator for compressed-sparse convolutional neural networks
Summary : The Sparse CNN (SCNN) accelerator architecture is introduced, which improves performance and energy efficiency by exploiting thezero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator.

#architecture #sparse

[arXiv.org'17] MobileNets ¹⁷²⁵¹

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

[ISCA'17] TPU ⁴⁰⁸¹ 📄

In-datacenter performance analysis of a tensor processing unit
Summary : 이 논문은 2015년부터 데이터센터에 배치된 사용자 정의 ASIC인 Tensor Processing Unit(TPU)을 평가하며, 이는 신경망(NN)의 추론 단계를 가속화하고 동일한 데이터센터에 배치된 동시대의 서버급 Intel Haswell CPU와 Nvidia K80 GPU와 비교합니다.

#dense #architecture #conference

[HPCA'17] FlexFlow ²⁶³ 📄

FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks
Summary : 컴퓨팅 엔진이 지원하는 병렬 유형과 CNN 워크로드의 주요 병렬 유형 간의 불일치를 완화하기 위해 특징 맵, 뉴런, 시냅스 병렬 간의 상호 보완 효과를 활용할 수 있는 유연한 데이터 흐름 아키텍처(FlexFlow)를 제안합니다.

#dataflow/flexible #architecture

[FPGA'17] U. Aydonat, et.al. ²³³

An OpenCL™ Deep Learning Accelerator on Arria 10

2016

[CVPR'16] Tien-Ju Yang, et.al. ⁶⁶²

Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning

[MICRO'16] Fused-Layer ⁵³¹

Fused-layer CNN accelerators
Summary : This work finds that a previously unexplored dimension exists in the design space of CNN accelerators that focuses on the dataflow across convolutional layers, and is able to fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip, enabling caching of intermediate data between the evaluation of adjacent CNN layers.

#cross-layer #memory

[MICRO'16] Cambricon-X ⁶¹⁰ 📄

Cambricon-X: An accelerator for sparse neural networks
Summary : A novel accelerator is proposed, Cambricon-X, to exploit the sparsity and irregularity of NN models for increased efficiency and experimental results show that this accelerator achieves, on average, 7.23x speedup and 6.43x energy saving against the state-of-the-art NN accelerator.

#architecture #sparse

[ISCA'16] Cambricon ²⁷³ 📄

Cambricon: An Instruction Set Architecture for Neural Networks
Summary : 본 논문에서는 기존 NN 기술에 대한 종합적인 분석을 바탕으로 스칼라, 벡터, 행렬, 논리, 데이터 전송, 제어 명령어를 통합한 로드 스토어 아키텍처인 캠브리콘(Cambricon)이라는 새로운 도메인별 NN 가속기용 명령어 집합 아키텍처(ISA)를 제안합니다.

#isa #architecture

[ISCA'16] Eyeriss ¹³³⁰ 📄

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Summary : A novel dataflow, called row-stationary (RS), is presented that minimizes data movement energy consumption on a spatial architecture and can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine local storage, direct inter-PE communication and spatial parallelism.

#architecture #dense

[TPLS'16] Pluto+ ⁵¹

The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests

#objective-function

[FPGA'16] Jiantao Qiu, et.al. ¹⁰⁶⁷

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network

[ISCA'16] EIE ²²⁷² 📄

EIE: Efficient Inference Engine on Compressed Deep Neural Network
Summary : 이 압축된 네트워크 모델에서 추론을 수행하고 가중치 공유를 통해 희소 행렬-벡터 곱셈을 가속하는 에너지 효율적인 추론 엔진(EIE)은 압축하지 않은 동일한 DNN의 CPU 및 GPU 구현과 비교할 때 189배 및 13배 더 빠릅니다.

#architecture #sparse

[JSSC'16] Eyeriss ²¹⁹¹ 📄

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

#architecture #dense

[MICRO'16] Stripes ³³⁵

Stripes: Bit-serial deep neural network computing

#sparse #architecture #bit-serial

2015

[ISSCC'15] K-Brain ⁹³

A 1.93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications
Summary : 휴대용 기기에서 사용자 중심의 패턴 인식을 실현하기 위해 고성능 및 에너지 효율적인 DL/DI(딥 인퍼런스) 프로세서를 제안

#training

[PACT'15] PENCIL ¹⁰⁷

PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming

#compiler #scheduler #polyhedral

[doi'15] Deep Compression ⁷⁷⁹⁶

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

[ISCA'15] ShiDianNao ⁴⁷³

ShiDianNao: Shifting vision processing closer to the sensor
Summary : CNN은 가중치가 많은 뉴런 간에 공유되어 신경망의 메모리 점유율을 상당히 줄임. 이 특성 활용, CNN 전체를 SRAM 내에 매핑하여 가중치에 대한 모든 DRAM 접근을 제거, 가속기를 이미지 센서 옆에 배치함으로써 입력과 출력에 대한 모든 DRAM 접근을 제거함.

#architecture #dense

[NIPS'15] Song Han, et.al. ⁵⁷⁶³

Learning both Weights and Connections for Efficient Neural Network

[ICCV'15] FlowNet ³⁷⁵⁸

FlowNet: Learning Optical Flow with Convolutional Networks

[ASPLOS'15] PolyMage ²⁰⁴

PolyMage: Automatic Optimization for Image Processing Pipelines

#scheduler #compiler #polyhedral

[arXiv'15] PLuTo ¹²⁸

PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System

#objective-function

2014

[MICRO'14] DaDianNao ¹³⁰⁴ 📄

DaDianNao: A Machine-Learning Supercomputer
Summary : 사용자 지정 멀티 칩 머신 러닝 아키텍처를 소개하며, 가장 큰 것으로 알려진 일부 신경망 계층에서 GPU 대비 450.65배의 속도 향상과 64-칩 시스템에서 평균 150.31배의 에너지 절감을 달성할 수 있음을 보여줍니다.

#architecture #dense #inference

[PRL'14] Xun Sun, et.al. ³⁰

Real-time local stereo via edge-aware disparity propagation

[CVPR'14] Christian Szegedy, et.al. ³⁹⁷³⁷

Going deeper with convolutions

[cviu'14] Denis Fortun, et.al. ³⁶

Aggregation of local parametric candidates with exemplar-based occlusion handling for optical flow

[BMVC'14] Max Jaderberg, et.al. ¹³⁹³

Speeding up Convolutional Neural Networks with Low Rank Expansions

[NIPS'14] Emily L. Denton, et.al. ¹⁵⁷⁸

Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

[ASPLOS'14] DianNao ¹⁴⁷¹

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
Summary : 본 연구에서는 메모리가 가속기 설계, 성능 및 에너지에 미치는 영향에 특히 중점을 두고 대규모 CNN 및 DNN용 가속기를 설계하며, 작은 설치 공간에서 452 GOP/s를 수행할 수 있는 높은 처리량을 가진 가속기를 설계할 수 있음을 보여줍니다.

#architecture #dense #conference #inference

2013

[SC'13] Uday Bondhugula ⁷²

Compiling affine loop nests for distributed-memory parallel architectures

[ISCA'13] Triggered instructions ¹¹⁹

Triggered instructions: a control paradigm for spatially-programmed architectures
Summary : 이 접근 방식은 과도한 직렬화 실행을 방지하는 통합 메커니즘을 제공하여 기존의 순차적 아키텍처에서 각각 별도의 하드웨어 메커니즘이 필요한 동적 명령어 재정렬 및 멀티스레딩과 같은 기술의 효과를 본질적으로 달성합니다.

#isa #architecture

[PLDI'13] Halide ¹¹⁰⁸

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

#compiler #scheduler

[DAC'13] PolyCGRA ⁵⁵ 📄

Polyhedral model based mapping optimization of loop nests for CGRAs
Summary : This paper proposes extracting an efficient heuristic loop transformation and mapping algorithm (PolyMAP) to improve mapping performance and shows that this proposed approach can improve the performance of the kernels by 21% on average, as compared to one of the best existing mapping algorithm, EPIMap.

#polyhedral

2012

[PPL'12] Polly ³⁵⁶

Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation

#polyhedral #compiler #scheduler

[HotNets-XI'12] Debayan Gupta, et.al. ²³⁰⁴

A new approach to interdomain routing based on secure multi-party computation

[LCPC'12] AlphaZ ⁶⁴

AlphaZ: A System for Design Space Exploration in the Polyhedral Model

[MICRO'12] DySER ²¹⁷

DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing
Summary : DySER(동적 실행 리소스 전문화) 아키텍처는 기능 전문화와 병렬 처리 전문화를 모두 지원하며, 에너지 소비를 줄이면서 순서 외 CPU, SSE(스트리밍 SIMD 확장) 가속 및 GPU 가속 성능을 뛰어넘습니다.

[DAC'12] Chisel ⁸¹⁶

Chisel: Constructing hardware in a Scala embedded language

#chisel #dsl #generator

[ICDE'12] Jia Pan, et.al. ⁵⁵

Bi-level Locality Sensitive Hashing for k-Nearest Neighbor Computation

[IMPACT'12] Nicolas, et.al. ³⁸

Joint Scheduling and Layout Optimization to Enable Multi-Level Vectorization

#objective-function

2011

[CAN'11] Rehan Hameed, et.al. ²⁵¹

Understanding sources of ineffciency in general-purpose chips

[CVPR'11] Christoph Rhemann, et.al. ¹¹¹⁶

Fast cost-volume filtering for visual correspondence and beyond

[TPAMI'11] T. Brox, et.al. ¹⁴²¹

Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation

2010

[ICMS'10] isl ³⁹²

isl: An Integer Set Library for the Polyhedral Model

[CVPR'10] H. Jégou, et.al. ²⁶¹⁷

Aggregating local descriptors into a compact image representation

2008

[PACT'08] Hyunchul Park, et.al. ¹⁹⁸

Edge-centric modulo scheduling for coarse-grained reconfigurable architectures

[PLDI'08] Pluto ⁹⁵³ 📄

A practical automatic polyhedral parallelizer and locality optimizer
Summary : 일반 프로그램을 병렬성과 로컬리티를 동시에 최적화할 수 있는 자동 다면체 소스 간 변환 프레임워크로, C 프로그램 섹션에서 OpenMP 병렬 코드를 자동으로 생성하는 도구로 구현되었습니다.

#objective-function #compiler #scheduler

2007

['07] J. Bennett, et.al. ²²⁰⁰

The Netflix Prize

['07] CHiLL ²¹¹

CHiLL : A Framework for Composing High-Level Loop Transformations

1992

[I'92] P. Feautrier ⁴⁰⁶

Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time

1985

[TVLSI'24] Pengbo Yu, et.al. 1

An Energy Efficient Soft SIMD Microarchitecture and Its Application on Quantized CNNs

[TCAS-I'24] HDSuper 0

[DATE'24] WideSA 0 📄

[HPCA'24] POM 0 📄

[CAL'24] DeMM 0

[TCAD'23] Rubick 0 📄

[TCAS-I'23] ACNPU 0

ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator With Decoupled Asymmetric Convolution

[DAC'23] Rubick 3 📄

[DAC'23] AutoMM 8

[CS'23] Rui Xu, et.al. 8

A Survey of Design and Optimization for Systolic Array-based DNN Accelerators

[TCAD'23] TensorLib 2 📄

[arXiv'23] ChipGPT 20

ChipGPT: How far are we from natural language hardware design Summary : LLMs를 활용하여 자연어 사양에서 하드웨어 논리 설계를 생성하는 자동화된 설계 환경을 시연, 사전 연구 및 기존 LLMs 단독 사용보다 더 넓은 설계 최적화 공간을 보여줌

[JSEN'23] F. Spagnolo, et.al. 8

Design of a Low-Power Super-Resolution Architecture for Virtual Reality Wearable Devices

[FPGA'23] CHARM 18

[ICCAD'22] HECTOR 7 📄

[TRTS'22] TAPA 10

[DAC'22] EMS 10 📄

[TRTS'22] FPGA HLS Today 48

FPGA HLS Today: Successes, Challenges, and Opportunities

[FPGA'22] HeteroFlow 18 📄

[TRTS'21] Yi-Hsiang Lai, et.al. 21

Programming and Synthesis for Software-defined FPGA Acceleration: Status and Future Prospects

[arXiv.org'21] Jie Wang, et.al. 3

Search for Optimal Systolic Arrays: A Comprehensive Automated Exploration Framework and Lessons Learned

[FPGA'21] Sextans 39

[PACT'21] Union 11 📄

[HPCA'21] S2TA 44 📄

S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration Summary : S2TA is described, a systolic array-based CNN accelerator that exploits joint weight and activation DBB sparsity and new dimensions of data reuse unavailable on the traditional syStolic array.

[CAL'21] STONNE 36 📄

[ISCA'21] RaPiD 52

[ASPLOS'21] FAST 35

[ISCA'21] CoSA 70 📄

[ISCA'21] TENET 41 📄

[ISCA'21] HASCO 48 📄

[DAC'21] TensorLib 19 📄

TensorLib: A Spatial Accelerator Generation Framework for Tensor Algebra Summary : 공간 하드웨어 아키텍처의 개발 및 최적화를 위한 생산성을 획기적으로 개선하여 성능, 면적, 전력의 절충점을 갖춘 풍부한 설계 공간을 제공합니다.

[CGO'21] MLIR 255 📄

MLIR: Scaling Compiler Infrastructure for Domain Specific Computation

[FPGA'21] AutoSA 95 📄

[J'21] T. Hoefler, et.al. 470

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

[ACCESS'21] L. Sekanina 23

Neural Architecture Search and Hardware Accelerator Co-Search: A Survey

[OJCAS'21] W. Gross, et.al. 5

Hardware-Aware Design for Edge Intelligence

[HPCA'20] HDA 93

[ICCAD'20] SuSy 35 📄

[ICCAD'20] GAMMA 89 📄

[JSSC'20] Minkyu Kim, et.al. 8 📄

An Energy-Efficient Deep Convolutional Neural Network Accelerator Featuring Conditional Computing and Low External Memory Access

[FPGA'20] AutoDSE 9

[JETCAS'20] SRNPU 31

SRNPU: An Energy-Efficient CNN-Based Super-Resolution Processor With Tile-Based Selective Super-Resolution in Mobile Devices

[ISPASS'20] SCALE-Sim 114 📄

[ISPASS'20] ASTRA-SIM 31 📄

[TC'20] Xi Zeng, et.al. 22

Addressing Irregularity in Sparse Neural Networks Through a Cooperative Software/Hardware Approach

[MICRO'20] MAESTRO 105

[ICASSP'20] dMazeRunner 15

[ISCA'20] DSAGEN 78 📄

[ASPLOS'20] FlexTensor 141

[HPCA'20] SIGMA 294 📄

[JSSC'20] SIMBA 62

[FPGA'20] AutoDNNchip 76

[TVLSI'20] Jaehyeong Sim, et.al. 31

An Energy-Efficient Deep Convolutional Neural Network Inference Processor With Enhanced Output Stationary Dataflow in 65-nm CMOS

[TC'20] A. Ardakani, et.al. 32

Fast and Efficient Convolutional Accelerator for Edge Computing

[arXiv.org'19] Gemmini 52

Gemmini: An Agile Systolic Array Generator Enabling Systematic Evaluations of Deep-Learning Architectures

[ICCAD'19] MAGNet 96 📄

[MICRO'19] ExTensor 166

[MICRO'19] Sparse Tensor Core 103

[MICRO'19] eCNN 40

eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference

[TVLSI'24] Pengbo Yu, et.al. ¹

[TCAS-I'24] HDSuper ⁰

[DATE'24] WideSA ⁰ 📄

[HPCA'24] POM ⁰ 📄

[CAL'24] DeMM ⁰

[TCAD'23] Rubick ⁰ 📄

[TCAS-I'23] ACNPU ⁰

[DAC'23] Rubick ³ 📄

[DAC'23] AutoMM ⁸

[CS'23] Rui Xu, et.al. ⁸

[TCAD'23] TensorLib ² 📄

[arXiv'23] ChipGPT ²⁰

ChipGPT: How far are we from natural language hardware design
Summary : LLMs를 활용하여 자연어 사양에서 하드웨어 논리 설계를 생성하는 자동화된 설계 환경을 시연, 사전 연구 및 기존 LLMs 단독 사용보다 더 넓은 설계 최적화 공간을 보여줌

[JSEN'23] F. Spagnolo, et.al. ⁸

[FPGA'23] CHARM ¹⁸

[ICCAD'22] HECTOR ⁷ 📄

[TRTS'22] TAPA ¹⁰

[DAC'22] EMS ¹⁰ 📄

[TRTS'22] FPGA HLS Today ⁴⁸

[FPGA'22] HeteroFlow ¹⁸ 📄

[TRTS'21] Yi-Hsiang Lai, et.al. ²¹

[arXiv.org'21] Jie Wang, et.al. ³

[FPGA'21] Sextans ³⁹

[PACT'21] Union ¹¹ 📄

[HPCA'21] S2TA ⁴⁴ 📄

S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration
Summary : S2TA is described, a systolic array-based CNN accelerator that exploits joint weight and activation DBB sparsity and new dimensions of data reuse unavailable on the traditional syStolic array.

[CAL'21] STONNE ³⁶ 📄

[ISCA'21] RaPiD ⁵²

[ASPLOS'21] FAST ³⁵

[ISCA'21] CoSA ⁷⁰ 📄

[ISCA'21] TENET ⁴¹ 📄

[ISCA'21] HASCO ⁴⁸ 📄

[DAC'21] TensorLib ¹⁹ 📄

TensorLib: A Spatial Accelerator Generation Framework for Tensor Algebra
Summary : 공간 하드웨어 아키텍처의 개발 및 최적화를 위한 생산성을 획기적으로 개선하여 성능, 면적, 전력의 절충점을 갖춘 풍부한 설계 공간을 제공합니다.

[CGO'21] MLIR ²⁵⁵ 📄

[FPGA'21] AutoSA ⁹⁵ 📄

[J'21] T. Hoefler, et.al. ⁴⁷⁰

[ACCESS'21] L. Sekanina ²³

[OJCAS'21] W. Gross, et.al. ⁵

[HPCA'20] HDA ⁹³

[ICCAD'20] SuSy ³⁵ 📄

[ICCAD'20] GAMMA ⁸⁹ 📄

[JSSC'20] Minkyu Kim, et.al. ⁸ 📄

[FPGA'20] AutoDSE ⁹

[JETCAS'20] SRNPU ³¹

[ISPASS'20] SCALE-Sim ¹¹⁴ 📄

[ISPASS'20] ASTRA-SIM ³¹ 📄

[TC'20] Xi Zeng, et.al. ²²

[MICRO'20] MAESTRO ¹⁰⁵

[ICASSP'20] dMazeRunner ¹⁵

[ISCA'20] DSAGEN ⁷⁸ 📄

[ASPLOS'20] FlexTensor ¹⁴¹

[HPCA'20] SIGMA ²⁹⁴ 📄

[JSSC'20] SIMBA ⁶²

[FPGA'20] AutoDNNchip ⁷⁶

[TVLSI'20] Jaehyeong Sim, et.al. ³¹

[TC'20] A. Ardakani, et.al. ³²

[arXiv.org'19] Gemmini ⁵²

[ICCAD'19] MAGNet ⁹⁶ 📄

[MICRO'19] ExTensor ¹⁶⁶

[MICRO'19] Sparse Tensor Core ¹⁰³

[MICRO'19] eCNN ⁴⁰

[MICRO'19] Simba ²⁷²

[MICRO'19] SparTen ¹⁶⁵

[MICRO'19] ASV ³¹

[TACOT'19] Nicolas Vasilache, et.al. ¹³

[ICCV'19] Xingchen Ma, et.al. ²⁰

[TC'19] NNPIM ⁴²

[VLSI'19] SIMBA ⁴²

[arXiv.org'19] M. Naumov, et.al. ⁵⁶⁷

[ASPLOS'19] Buffets ⁶⁷

[TPAMI'19] Res2Net ¹⁷⁷¹

[FCCM'19] T2S-Tensor ⁴³ 📄

[DATE'19] Nhut-Minh Ho, et.al. ⁴

[ISPASS'19] Timeloop ³⁵⁵ 📄

[FPGA'19] HeteroCL ¹¹⁵

[HPCA'19] Shortcut Mining ³³

[JSSC'19] UNPU ¹⁷⁹ 📄