0% found this document useful (0 votes)
49 views42 pages

Optimizing NoCs for DNN Accelerators

The document discusses the challenges of traditional Network-on-Chip (NoC) architectures in supporting the unique traffic patterns of Deep Neural Network (DNN) accelerators, particularly in convolutional layers. It introduces the Microswitch NoC, which is designed to optimize latency, throughput, area, and energy efficiency while being reconfigurable to accommodate diverse neural network dimensions. The evaluation results demonstrate that the Microswitch NoC significantly outperforms traditional NoCs in terms of latency, area, power consumption, and energy efficiency.

Uploaded by

Zhiyuan Lei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views42 pages

Optimizing NoCs for DNN Accelerators

The document discusses the challenges of traditional Network-on-Chip (NoC) architectures in supporting the unique traffic patterns of Deep Neural Network (DNN) accelerators, particularly in convolutional layers. It introduces the Microswitch NoC, which is designed to optimize latency, throughput, area, and energy efficiency while being reconfigurable to accommodate diverse neural network dimensions. The evaluation results demonstrate that the Microswitch NoC significantly outperforms traditional NoCs in terms of latency, area, power consumption, and energy efficiency.

Uploaded by

Zhiyuan Lei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Rethinking NoCs for

Spatial Neural Network Accelerators


Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna

Georgia Institute of Technology


Synergy Lab ([Link]

NOCS 2017
Oct 20, 2017
Emergence of DNN Accelerators
• Emerging DNN applications

2
Emergence of DNN Accelerators
• Convolutional Neural Network (CNN)
Convolutional Layers Summarize
(Feature Extraction) features

Conv. Conv. Conv. Pool. FC


... Layer Layer “Palace”
Layer Layer Layer

Intermediate
features

3
Emergence of DNN Accelerators
• Computation in Convolutional Layers

• Sliding window operation over input featuremaps


Image source: Y. Chen et al., Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional
Neural Networks, ISCA 2016
4
Emergence of DNN Accelerators
• Massive Parallelism in Convolutional Layers

for(n=0; n<N; n++) { // Input feature maps (IFMaps)


for(m=0; m<M; m++) { // Weight Filters
for(c=0; c<C; c++) { // IFMap/Weight Channels
for(y=0; y<H; y++) { // Input feature map row
for(x=0; x<H; x++) { // Input feature map column
for(j=0; j<R; j++) { // Weight filter row
for(i=0; i<R; i++) { // Weight filter column
O[n][m][x][y] += W[m][c][i][j] * I[n][c][y][x]}}}}}}}

Accumulation Multiplication
5
Emergence of DNN Accelerators
• Spatial DNN Accelerator ASIC Architecture

Dadiannao (MICRO 2014) Eyeriss (ISSCC 2016)


256 PEs (16 in each 168 PEs
tile)
*PE: processing element
6
Emergence of DNN Accelerators
• Spatial DNN Accelerator ASIC Architecture

PE Array
PE PE ... PE
Global
DRAM

Memory NoC PE PE ... PE


(GBM) ...
PE PE PE

Multi-Bus: Eyeriss
Mesh: Diannao, Dadiannao
Crossbar+Mesh: TrueNorth
Spatial processing over PEs
7
Challenges with Traditional NoCs
• Relative Area Overhead Compared to Tiny PEs

Eyeriss
PE Bu
Crossbar Switch
s

Mesh
PE

Throughput
Size of squares of NoC: Total area divided by the number of PEs (256 PEs)
8
Challenges with Traditional NoCs
• Bandwidth

Alexnet Conv. layer


Simulation Results (RS)

Serialized broad-/multi-casting
Bandwidth bottleneck at top level

Bus provides low bandwidth for DNN traffic


9
Challenges with Traditional NoCs
• Dataflow Style Processing over Spatial PEs

PE PE ... PE PE PE ... PE

PE PE ... PE PE PE ... PE

PE PE PE PE PE PE

Systolic Array (TPU) Eyeriss


No way to hide the latency
Traffic is different from that of CMPs and MPSoCs
10
Challenges with Traditional NoCs
• Unique Traffic Patterns

PE
Core Core Core GPU
PE
GBM NoC
PE
Sen
Core Core Comm
sor PE

• CMPs • MPSoCs • DNN Accelerators

Dynamic Static fixed


?
all-to-all traffic traffic

11
Traffic Patterns in DNN Accelerators
• Scatter

PE PE

PE PE
GBM NoC GBM NoC
PE PE

PE PE

One-to-All One-to-Many
E.g., filter weight and/or input feature map distribution
12
Traffic Patterns in DNN Accelerators
• Gather

PE PE

PE PE
GBM NoC GBM NoC
PE PE

PE PE

All-to-one Many-to-one
E.g., partial sum gathering
13
Traffic Patterns in DNN Accelerators
• Local

PE

PE
GBM NoC - Key optimization to
PE remove traffic between
GBM and PE array and
maximize data reuse in
PE the PE array

Many one-to-one
e.g., psum accumulation
14
Why Not Traditional NoCs
• Unique Traffic Patterns

PE
Core Core Core GPU
PE
GBM NoC
PE
Sen
Core Core Comm
sor PE

• CMPs • MPSoCs • DNN Accelerators


Scatter
Dynamic Static fixed
Gather
all-to-all traffic traffic
Local
15
Requirements for NoCs in DNN Accelerators

• Requirements
• High throughput: Many PEs
• Area/power efficiency: Tiny PEs
• Low latency: No latency hiding
• Reconfigurability: Diverse neural network dimensions

• Optimization Opportunity
• Three traffic patterns: Specialization for each traffic

16
Outline
• Motivations
• Microswitch Network
• Topology
• Routing
• Microswitch
• Network Reconfiguration
• Flow control
• Evaluations
• Latency (throughput)
• Area
• Power
• Energy
• Conclusion
17
Topology: Microswitch Network

Top Switch

Middle Switch
Bottom Switch
Lv 0 Lv 1 Lv 2 Lv 3

• Distribute communication to tiny switches


18
Routing: Scatter Traffic

• Tree-based broad/multicasting
19
Routing: Gather Traffic

• Multiple pipelined linear network


• Bandwidth bound to GBM write bandwidth
20
Routing: Local Traffic

• Linear single-cycle multi-hop (SMART*) network


H. Kwon et al., OpenSMART: Single cycle-Multi-hop NoC Generator in BSV and Chisel, ISPASS 2017
T. Krishna et al., Breaking the On-Chip Latency Barrier Using SMART, HPCA 2013
21
Microarchitecture: Microswitches

22
Microswitches: Top Switch

Only required for switches


connected with global buffer
Scatter Traffic
Only required
if a switch Gather Traffic
has multiple Local Traffic
gather inputs
23
Microswitches: Middle Switch

Only required for switches in


the scatter tree
Scatter Traffic
Gather Traffic
Local Traffic
24
Microswitches: Bottom Switch

Scatter Traffic
Gather Traffic
Local Traffic
25
Topology: Microswitch Network

Top Switch

Middle Switch
Bottom Switch
Lv 0 Lv 1 Lv 2 Lv 3

26
Outline
• Motivations
• Microswitch Network
• Topology
• Routing
• Microswitch
• Network Reconfiguration
• Flow control
• Evaluations
• Latency (throughput)
• Area
• Power
• Energy
• Conclusion
27
Scatter Network Reconfiguration
• Control Registers

En_Up

En_Down
28
Scatter Network Reconfiguration
• Reconfiguration logic
En_Up

En_Down

Recursively check
destination PEs in
upper/lower
subtrees
29
Scatter Netowrk Recofiguration
• Reconfiguration logic

30
Local Network: Linear SMART
• Dynamic traffic control

• Static traffic control

H. Kwon et al., OpenSMART: Single cycle-Multi-hop NoC Generator in BSV and Chisel, ISPASS 2017
T. Krishna et al., Breaking the On-Chip Latency Barrier Using SMART, HPCA 2013
31
Reconfiguration Policy
• Scatter Tree
– Coarse-grained: Epoch-by-epoch
– Fine-grained: Cycle-by-cycle for each data
• Gather
– No reconfiguration (flow control-based)
• Local
– Static: Compiler-based
– Dynamic: Traffic-based

* Accelerator Dependent

32
Flow Control
• Scatter Network
• On/Off flow control

• Gather Network
• On/Off flow control between microswitches

• Local Network
• Dynamic flow control: Global arbiter-based control
• Static flow control: SMART* flow control

* SMART flow control


Hyoukjun Kwon et al., OpenSMART: Single cycle-Multi-hop NoC Generator
in BSV and Chisel, in ISPASS 2017
Tushar Krishna et al., Breaking the On-Chip Latency Barrier Using SMART,
in HPCA 2013

33
Flow Control
• Why not credit-based flow control?
• Tiny microswitches: low latency for on/off signals delivery
• Reduce overhead of credit registers
No Credit Registers
Short distance

34
Outline
• Motivations
• Microswitch Network
• Topology
• Microswitch
• Routing
• Network Reconfiguration
• Flow control
• Evaluations
• Latency/throughput
• Area
• Power
• Energy
• Conclusion
35
Evaluation Environment
Target Neural
Network Alexnet
Implementation RTL written in Bluespec System Verilog (BSV)
Accelerator Weight-Stationary (No local traffic) and
Dataflow Row-stationary (Exploit local traffic)*
Latency RTL simulation over BSV implementation using
Measurement Bluesim
Synthesis Tool Synopsys Design Compiler
Standard Cell
Library NanGate 15nm PDK
Baseline NoCs Bus, Tree, Crossbar, Mesh, and H-Mesh
PE Delay 1 cycle

* Y. Chen et al., "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," ISCA 2016
36
Evaluations
• Latency (Entire Alexnet Convolutional layers)

Microswitch reduces the latency by 61% compared to mesh


37
Evaluations
• Area

Microswitch NoC requires 16% area of mesh


38
Evaluations
• Power

Microswitch NoC only consumes 12% power of mesh


39
Evaluations
• Energy

Buses always need to broadcast, even for unicast traffic


Microswitch NoC enables only necessary links
40
Conclusion
• Traditional NoCs are not optimal for traffic in spatial
accelerators because such NoCs are tailored for
random traffic in cache-coherence traffic in CMPs

• Microswitch NoC is a scalable solution for four


goals, latency, throughput, area, and energy, while
traditional NoCs only achieve one of two of them

• Microswitch NoC also provides reconfigurability


so that it can support the dynamism across neural
network layers

41
Conclusion
• Microswitch NoC is applicable to any spatial
accelerator (e.g., cryptography, graph)

• Microswitch NoC will be available as open source.


• Please sign up via this link
• [Link]
• For general purpose NoC, openSMART is available
[Link]

Thank
you! 42

You might also like