Rethinking NoCs for
Spatial Neural Network Accelerators
Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna
Georgia Institute of Technology
Synergy Lab ([Link]
NOCS 2017
Oct 20, 2017
Emergence of DNN Accelerators
• Emerging DNN applications
2
Emergence of DNN Accelerators
• Convolutional Neural Network (CNN)
Convolutional Layers Summarize
(Feature Extraction) features
Conv. Conv. Conv. Pool. FC
... Layer Layer “Palace”
Layer Layer Layer
Intermediate
features
3
Emergence of DNN Accelerators
• Computation in Convolutional Layers
• Sliding window operation over input featuremaps
Image source: Y. Chen et al., Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional
Neural Networks, ISCA 2016
4
Emergence of DNN Accelerators
• Massive Parallelism in Convolutional Layers
for(n=0; n<N; n++) { // Input feature maps (IFMaps)
for(m=0; m<M; m++) { // Weight Filters
for(c=0; c<C; c++) { // IFMap/Weight Channels
for(y=0; y<H; y++) { // Input feature map row
for(x=0; x<H; x++) { // Input feature map column
for(j=0; j<R; j++) { // Weight filter row
for(i=0; i<R; i++) { // Weight filter column
O[n][m][x][y] += W[m][c][i][j] * I[n][c][y][x]}}}}}}}
Accumulation Multiplication
5
Emergence of DNN Accelerators
• Spatial DNN Accelerator ASIC Architecture
Dadiannao (MICRO 2014) Eyeriss (ISSCC 2016)
256 PEs (16 in each 168 PEs
tile)
*PE: processing element
6
Emergence of DNN Accelerators
• Spatial DNN Accelerator ASIC Architecture
PE Array
PE PE ... PE
Global
DRAM
Memory NoC PE PE ... PE
(GBM) ...
PE PE PE
Multi-Bus: Eyeriss
Mesh: Diannao, Dadiannao
Crossbar+Mesh: TrueNorth
Spatial processing over PEs
7
Challenges with Traditional NoCs
• Relative Area Overhead Compared to Tiny PEs
Eyeriss
PE Bu
Crossbar Switch
s
Mesh
PE
Throughput
Size of squares of NoC: Total area divided by the number of PEs (256 PEs)
8
Challenges with Traditional NoCs
• Bandwidth
Alexnet Conv. layer
Simulation Results (RS)
Serialized broad-/multi-casting
Bandwidth bottleneck at top level
Bus provides low bandwidth for DNN traffic
9
Challenges with Traditional NoCs
• Dataflow Style Processing over Spatial PEs
PE PE ... PE PE PE ... PE
PE PE ... PE PE PE ... PE
PE PE PE PE PE PE
Systolic Array (TPU) Eyeriss
No way to hide the latency
Traffic is different from that of CMPs and MPSoCs
10
Challenges with Traditional NoCs
• Unique Traffic Patterns
PE
Core Core Core GPU
PE
GBM NoC
PE
Sen
Core Core Comm
sor PE
• CMPs • MPSoCs • DNN Accelerators
Dynamic Static fixed
?
all-to-all traffic traffic
11
Traffic Patterns in DNN Accelerators
• Scatter
PE PE
PE PE
GBM NoC GBM NoC
PE PE
PE PE
One-to-All One-to-Many
E.g., filter weight and/or input feature map distribution
12
Traffic Patterns in DNN Accelerators
• Gather
PE PE
PE PE
GBM NoC GBM NoC
PE PE
PE PE
All-to-one Many-to-one
E.g., partial sum gathering
13
Traffic Patterns in DNN Accelerators
• Local
PE
PE
GBM NoC - Key optimization to
PE remove traffic between
GBM and PE array and
maximize data reuse in
PE the PE array
Many one-to-one
e.g., psum accumulation
14
Why Not Traditional NoCs
• Unique Traffic Patterns
PE
Core Core Core GPU
PE
GBM NoC
PE
Sen
Core Core Comm
sor PE
• CMPs • MPSoCs • DNN Accelerators
Scatter
Dynamic Static fixed
Gather
all-to-all traffic traffic
Local
15
Requirements for NoCs in DNN Accelerators
• Requirements
• High throughput: Many PEs
• Area/power efficiency: Tiny PEs
• Low latency: No latency hiding
• Reconfigurability: Diverse neural network dimensions
• Optimization Opportunity
• Three traffic patterns: Specialization for each traffic
16
Outline
• Motivations
• Microswitch Network
• Topology
• Routing
• Microswitch
• Network Reconfiguration
• Flow control
• Evaluations
• Latency (throughput)
• Area
• Power
• Energy
• Conclusion
17
Topology: Microswitch Network
Top Switch
Middle Switch
Bottom Switch
Lv 0 Lv 1 Lv 2 Lv 3
• Distribute communication to tiny switches
18
Routing: Scatter Traffic
• Tree-based broad/multicasting
19
Routing: Gather Traffic
• Multiple pipelined linear network
• Bandwidth bound to GBM write bandwidth
20
Routing: Local Traffic
• Linear single-cycle multi-hop (SMART*) network
H. Kwon et al., OpenSMART: Single cycle-Multi-hop NoC Generator in BSV and Chisel, ISPASS 2017
T. Krishna et al., Breaking the On-Chip Latency Barrier Using SMART, HPCA 2013
21
Microarchitecture: Microswitches
22
Microswitches: Top Switch
Only required for switches
connected with global buffer
Scatter Traffic
Only required
if a switch Gather Traffic
has multiple Local Traffic
gather inputs
23
Microswitches: Middle Switch
Only required for switches in
the scatter tree
Scatter Traffic
Gather Traffic
Local Traffic
24
Microswitches: Bottom Switch
Scatter Traffic
Gather Traffic
Local Traffic
25
Topology: Microswitch Network
Top Switch
Middle Switch
Bottom Switch
Lv 0 Lv 1 Lv 2 Lv 3
26
Outline
• Motivations
• Microswitch Network
• Topology
• Routing
• Microswitch
• Network Reconfiguration
• Flow control
• Evaluations
• Latency (throughput)
• Area
• Power
• Energy
• Conclusion
27
Scatter Network Reconfiguration
• Control Registers
En_Up
En_Down
28
Scatter Network Reconfiguration
• Reconfiguration logic
En_Up
En_Down
Recursively check
destination PEs in
upper/lower
subtrees
29
Scatter Netowrk Recofiguration
• Reconfiguration logic
30
Local Network: Linear SMART
• Dynamic traffic control
• Static traffic control
H. Kwon et al., OpenSMART: Single cycle-Multi-hop NoC Generator in BSV and Chisel, ISPASS 2017
T. Krishna et al., Breaking the On-Chip Latency Barrier Using SMART, HPCA 2013
31
Reconfiguration Policy
• Scatter Tree
– Coarse-grained: Epoch-by-epoch
– Fine-grained: Cycle-by-cycle for each data
• Gather
– No reconfiguration (flow control-based)
• Local
– Static: Compiler-based
– Dynamic: Traffic-based
* Accelerator Dependent
32
Flow Control
• Scatter Network
• On/Off flow control
• Gather Network
• On/Off flow control between microswitches
• Local Network
• Dynamic flow control: Global arbiter-based control
• Static flow control: SMART* flow control
* SMART flow control
Hyoukjun Kwon et al., OpenSMART: Single cycle-Multi-hop NoC Generator
in BSV and Chisel, in ISPASS 2017
Tushar Krishna et al., Breaking the On-Chip Latency Barrier Using SMART,
in HPCA 2013
33
Flow Control
• Why not credit-based flow control?
• Tiny microswitches: low latency for on/off signals delivery
• Reduce overhead of credit registers
No Credit Registers
Short distance
34
Outline
• Motivations
• Microswitch Network
• Topology
• Microswitch
• Routing
• Network Reconfiguration
• Flow control
• Evaluations
• Latency/throughput
• Area
• Power
• Energy
• Conclusion
35
Evaluation Environment
Target Neural
Network Alexnet
Implementation RTL written in Bluespec System Verilog (BSV)
Accelerator Weight-Stationary (No local traffic) and
Dataflow Row-stationary (Exploit local traffic)*
Latency RTL simulation over BSV implementation using
Measurement Bluesim
Synthesis Tool Synopsys Design Compiler
Standard Cell
Library NanGate 15nm PDK
Baseline NoCs Bus, Tree, Crossbar, Mesh, and H-Mesh
PE Delay 1 cycle
* Y. Chen et al., "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," ISCA 2016
36
Evaluations
• Latency (Entire Alexnet Convolutional layers)
Microswitch reduces the latency by 61% compared to mesh
37
Evaluations
• Area
Microswitch NoC requires 16% area of mesh
38
Evaluations
• Power
Microswitch NoC only consumes 12% power of mesh
39
Evaluations
• Energy
Buses always need to broadcast, even for unicast traffic
Microswitch NoC enables only necessary links
40
Conclusion
• Traditional NoCs are not optimal for traffic in spatial
accelerators because such NoCs are tailored for
random traffic in cache-coherence traffic in CMPs
• Microswitch NoC is a scalable solution for four
goals, latency, throughput, area, and energy, while
traditional NoCs only achieve one of two of them
• Microswitch NoC also provides reconfigurability
so that it can support the dynamism across neural
network layers
41
Conclusion
• Microswitch NoC is applicable to any spatial
accelerator (e.g., cryptography, graph)
• Microswitch NoC will be available as open source.
• Please sign up via this link
• [Link]
• For general purpose NoC, openSMART is available
[Link]
Thank
you! 42