Communication Architectures For SOC AYALA
Communication Architectures For SOC AYALA
COMPUTER ENGINEERING
Ayala
K11940
6000 Broken Sound Parkway, NW ISBN: 978-1-4398-4170-9
Suite 300, Boca Raton, FL 33487 90000
270 Madison Avenue
an informa business New York, NY 10016
2 Park Square, Milton Park
w w w. c r c p r e s s . c o m Abingdon, Oxon OX14 4RN, UK 9 78 1 439 84 1 709
COMMUNICATION
ARCHITECTURES FOR
SYSTEMS-ON-CHIP
Embedded Systems
Series Editor
Richard Zurawski
SA Corporation, San Francisco, California, USA
Edited by
José L. Ayala
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a pho-
tocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Dedicado a quien me enseña cada dı́a,
a quien me hace pensar,
a quien me sonrı́e y me apoya,
a quien me estimula y causa admiración.
Dedicado a ti.
This page intentionally left blank
Contents
List of Figures xv
Preface xxv
Acknowledgments xxvii
Author xxix
1 Introduction 1
José L. Ayala
1.1 Today’s Market for Systems-on-Chip . . . . . . . . . . . . . 1
1.2 Basics of the System-on-Chip Design . . . . . . . . . . . . . 3
1.2.1 Main Characteristics of the Design Flow . . . . . . . . 4
1.2.1.1 Interoperability . . . . . . . . . . . . . . . . 4
1.2.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Power Consumption . . . . . . . . . . . . . . . . . . . 5
1.2.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 Limitations of the Current Engineering Practices for
System-on-Chip Design . . . . . . . . . . . . . . . . . 7
1.3 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
vii
viii Contents
3 NoC Architectures 83
Martino Ruggiero
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2 Advantages of the NoC Paradigm . . . . . . . . . . . . . . . 85
3.3 Challenges of the NoC Paradigm . . . . . . . . . . . . . . . . 87
3.4 Principles of NoC Architecture . . . . . . . . . . . . . . . . . 88
3.4.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.4.2.1 Oblivious Routing Algorithms . . . . . . . . 94
3.4.2.2 Deterministic Routing Algorithms . . . . . . 97
3.4.2.3 Adaptive Routing Algorithms . . . . . . . . 99
3.4.2.4 Problems on Routing . . . . . . . . . . . . . 101
3.4.3 Flow Control . . . . . . . . . . . . . . . . . . . . . . . 101
3.4.3.1 Message-Based . . . . . . . . . . . . . . . . . 102
3.4.3.2 Packet-Based . . . . . . . . . . . . . . . . . . 103
3.4.3.3 Flit-Based . . . . . . . . . . . . . . . . . . . 103
3.5 Basic Building Blocks of a NoC . . . . . . . . . . . . . . . . 104
3.5.1 Router . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.5.1.1 Virtual Channels . . . . . . . . . . . . . . . . 105
3.5.2 Network Interface . . . . . . . . . . . . . . . . . . . . 105
3.5.3 Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.6 Available NoC Implementations and Solutions . . . . . . . . 106
3.6.1 IBM Cell . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.6.1.1 Element Interconnect Bus . . . . . . . . . . . 107
3.6.2 Intel TeraFLOPS . . . . . . . . . . . . . . . . . . . . . 108
3.6.2.1 TeraFLOPS Network . . . . . . . . . . . . . 109
3.6.3 RAW Processor . . . . . . . . . . . . . . . . . . . . . . 109
3.6.4 Tilera Architectures . . . . . . . . . . . . . . . . . . . 110
3.6.4.1 iMesh . . . . . . . . . . . . . . . . . . . . . . 112
3.6.5 Intel Single-Chip Cloud Computer . . . . . . . . . . . 112
3.6.6 ST Microelectronics STNoC . . . . . . . . . . . . . . . 114
3.6.7 Xpipes . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
x Contents
Index 403
This page intentionally left blank
List of Figures
xv
xvi List of Figures
8.3 CPA attack results of the Fastcore AES chip. Figure from [44].
Used with permission. . . . . . . . . . . . . . . . . . . . . . . 339
8.4 Precharge wave generation in WDDL. . . . . . . . . . . . . . 350
8.5 State machine for the dual spacer protocol. . . . . . . . . . . 354
8.6 Timing diagram for the TDPL inverter. . . . . . . . . . . . . 356
8.7 Basic components of MDPL. . . . . . . . . . . . . . . . . . . 359
8.8 Architecture of a MDPL-based cryptographic circuit. . . . . . 359
8.9 Random Switching Logic example. . . . . . . . . . . . . . . . 361
8.10 Interwire capacitance model. . . . . . . . . . . . . . . . . . . 364
8.11 Basic components of iMDPL. . . . . . . . . . . . . . . . . . . 366
8.12 Basic components of PMRML. . . . . . . . . . . . . . . . . . 368
8.13 Overview of a SORU2 system. . . . . . . . . . . . . . . . . . . 375
8.14 SORU2 datapath. . . . . . . . . . . . . . . . . . . . . . . . . . 375
List of Tables
xxiii
This page intentionally left blank
Preface
xxv
This page intentionally left blank
Acknowledgments
xxvii
This page intentionally left blank
Author
xxix
This page intentionally left blank
1
Introduction
José L. Ayala
Complutense University of Madrid, Spain
CONTENTS
1.1 Today’s Market for Systems-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Basics of the System-on-Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Main Characteristics of the Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1.1 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 Limitations of the Current Engineering Practices for System-on-
Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1
2 Communication Architectures for SoC
these devices in air bags, Anti-lock Braking System (ABS), Electricity Supply
Board (ESB), or engine control. However, those sectors that have benefited
the most from the development of the embedded systems are communica-
tions and multimedia. The former includes mobile phones, network routers,
modems, and software radio; while the latter includes cable and satellite TV
decoders, HDTV, DVD players, and video games. According to In-Stat/MDR,
the market for smart appliances in digital home experienced a 70% compound
annual growth rate from 2001 to 2006 [19]. Moving forward, Gartner Market
Report predicted that $500 millions market for SoC in 2005 will grow over 80%
by 2010 [10]. The annual growth rate is about 2x faster than general-purpose
microprocessors [3].
The well-organized integrated approach of multiple disciplines, such as de-
vice modeling, system design, and development of applications has motivated
the success of system-on-a-chip (SoC) design. The continuous advances in in-
tegration technologies have also contributed by the rapid development of Very
Large Scale Integration Circuits (VLSI) from the device perspective, due to the
integration of billions of transistors on a single chip (see Figure 1.1) [12, 13, 17].
Therefore, a modern SoC can have billions of transistors, supporting a wide
range of complex functions.
FIGURE 1.1
Transistors per die [1].
FIGURE 1.2
Design constraints in modern system-on-chip design [1].
1.2.1.1 Interoperability
Today’s systems are required to be more and more directly connected. In the
past, humans linked the different systems together, applying intelligence and
adaptations where needed. In the electronic age the adaptation must take
place in the systems themselves.
However, connectivity is not sufficient to justify the integration cost of
complex systems; only if systems interoperate in an efficient way, is true added
value created.
An additional challenge for interoperability are the many dimensions in
which interoperability is required: applications, languages, vendors, networks,
standards, and releases.
Introduction 5
1.2.2 Reliability
The amount of software (and technology) in products is increasing exponen-
tially. Many other products show the same growth in the amount of integrating
software parts: mobile phones, portable multimedia devices, etc.
This software parts are far from errorless. Studies of the density of errors
in actual code show that 1000 lines of code (LOC) typically contain 3 errors.
In case of very good software processes and mature organizations this figure
is 1 to 2 errors per kloc; in poor software designs, it can be much worse.
Incremental increase of the code size will increase the number of hidden errors
in a system also exponentially, as shown in Figure 1.3.
FIGURE 1.3
Reliability as opposed to software complexity [14].
FIGURE 1.4
Power consumption and its relation with performance [14].
1.2.4 Security
Several stakeholders have significant different security interests. Figure 1.5
shows three categories with different interests and security solutions:
2. Consumers, who want to maintain privacy and, at the same time, us-
ability of services
3. The content industry, who wants to get fair payment for content creation
and distribution. Their solution is again very restrictive, even violating
the right of private copies, and characterized by a paranoia attitude:
every customer is assumed to be a criminal pirate.
Introduction 7
FIGURE 1.5
Conflicting interests in security [14].
1.4 Conclusions
Systems-on-chip are a main component of today’s market of electronic sys-
tems. The advances in integration technologies and software techniques have
allowed their widespread use and application to the most diverse domains.
Moreover, the successful integration of the hardware and software parts is key
for the development of these systems.
The design flow of systems-on-chip is characterized by the need of models
and abstractions at different levels, and a well-established theory that allows
the efficient implementation, validation, and test of them. In particular, we
need a mathematical basis for systems modeling, and analysis, which inte-
grates both abstract-machine models and transfer-function models in order
Introduction 11
1.5 Glossary
AADL: Architecture Analysis and Design Language
HDTV: High-Definition TV
LOC: Line of Code
SoC: System-on-Chip
12 Communication Architectures for SoC
1.6 Bibliography
[1] Y-K Chen and S. Y. Kung. Trend and challenge on system-on-a-chip
designs. J. Signal Process. Syst., 53(1-2):217–229, 2008.
[2] P.H. Feiler, B. Lewis, and S. Vestal. The SAE architecture analysis and
design language (AADL) standard: A basis for model-based architecture-
driven embedded systems engineering. In RTAS Workshop on Model-
driven Embedded Systems, 2003.
[4] R. Gupta. EIC message: The neglected community. IEEE Design and
Test of Computers, 19:3, 2002.
[10] B. Lewis. SoC market is set for years of growth in the mainstream.
Technical report, Gartner Market Report, October 2005.
[11] J. Magee and J. Kramer. Concurrency: State Models and Java Programs.
John Wiley & Sons, 2000.
[17] R. R. Schaller. Moore’s law: Past, present, and future. IEEE Spectrum,
34(6):52–59, 1997.
[19] Microwave Journal Staff. Smart appliances: Bringing the digital home
closer to reality. Microwave Journal, 45(11):45, 2002.
This page intentionally left blank
2
Communication Buses for SoC Architectures
José L. Ayala
Complutense University of Madrid, Spain
CONTENTS
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Current Research Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Modeling and Exploring the Design Space of On-Chip Communi-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.3 Automatic Synthesis of Communication Architectures . . . . . . . . . . . 18
2.2 The AMBA Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 AMBA 4 AXI Interface: AXI4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1.1 AXI4-Lite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1.2 AXI4-Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 AMBA 3 AHB Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2.1 AMBA 3 AHB-Lite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2.2 Multilayer AHB Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 AMBA 3 APB Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.4 AMBA 3 ATB Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.5 AMBA Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.5.1 ARM’s CoreLink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.5.2 ARM’s CoreSight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.5.3 Third-Party Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Sonics SMART Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 SNAP: Sonics Network for AMBA Protocol . . . . . . . . . . . . . . . . . . . . . . 32
2.3.2 SonicsSX Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.3 SonicsLX Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.4 SonicsMX Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.5 S3220 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.6 Sonics Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.6.1 SonicsStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.6.2 SNAP Capture Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4 CoreConnect Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.1 Processor Local Bus (PLB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.1.1 PLB6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.2 On-Chip Peripheral Bus (OPB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.3 Device Control Register Bus (DCR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.4 Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.4.1 PLB-OPB Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.4.2 AHB Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.5 Coreconnect Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
15
16 Communication Architectures for SoC
2.1 Introduction
With powerful embedded devices populating the consumer electronics market,
Multiprocessor Systems-on-Chip (MPSoCs) are the most promising way to
keep on exploiting the high level of integration provided by the semiconductor
technology while, at the same time, satisfying the constraints imposed by the
embedded system market in terms of performance and power consumption.
A modern MPSoC usually integrates hundreds of processing units and stor-
age elements in a single chip to provide both processing power and flexibility.
In order to efficiently communicate the different elements, designers must de-
fine an interconnect technology and architecture. This is a key element, since
it will affect both system performance and power consumption.
This chapter focuses on state-of-the-art SoC communication architectures,
providing an overview of the most relevant system interconnects. Open bus
specifications such as Open Core Protocol International Partnership Associ-
ation (OCP-IP), Advanced Microcontroller Bus Architecture (AMBA), and
CoreConnect will be described in detail, as well as the Software (SW) tools
Communication Buses for SoC Architectures 17
cation. This allows simple control register interfaces, thereby reducing SoC
wiring congestion that facilitates implementation. The AXI4-Stream protocol
provides a streaming interface for non-address-based, point-to-point commu-
nication like video and audio data.
The AMBA-4 specification has been written with contributions from 35
companies that include Original Equipment Manufactures (OEMs) and semi-
conductor and Electronic Design Automation (EDA) vendors. Some of the
early adopters of these new specifications include Arteris, Cadence, Mentor,
Sonics, Synopsys, and Xilinx.
AMBA 4 Phase Two, the next version of the AMBA specification, has
already been announced, and will focus on simplifying Multicore SoC De-
sign. It will aim at further maximizing performance and power efficiency by
reducing traffic to the external memory adding hardware support for cache co-
herency and message-ordering barriers. By putting more capabilities in hard-
ware, AMBA 4 will greatly simplify the software programmer’s view of mul-
ticore systems.
Through the following sections we will briefly review the interface protocols
proposed by AMBA. They can be summarized as:
• AMBA AXI4 Interface, that focuses on high performance high-frequency
implementations. This burst-based interface provides the maximum of
interconnect flexibility and yields the highest performance.
AMBA AXI4-Lite Interface, is a subset of the full AXI4 specifica-
tion for simple control register interfaces, reducing SoC wiring conges-
tion and simplifying implementation.
AMBA AXI4-Stream Interface, provides a streaming interface for
non-address-based, point-to-point communication, such as video and au-
dio data.
• AMBA 3 AHB Interface, that enables highly efficient interconnect be-
tween simpler peripherals in a single frequency subsystem where the
performance of AMBA 3 AXI is not required. There are two flavors:
AMBA 3 AHB-Lite, a subset of the whole AMBA 3 AHB specifi-
cation, for designs that do not need all the AHB features.
Multilayer AHB, a more complex interconnect that solves the main
limitation of the AHB: enables parallel access paths between multiple
masters and slaves in a system.
• AMBA 3 APB Interface, intended for general-purpose low-speed low-
power peripheral devices, allowing isolation of slow data traffic from the
high-performance AMBA interfaces.
• AMBA 3 ATB Interface, provides visibility for debug purposes by adding
tracing data capabilities.
The following sections will describe these protocols in detail.
Communication Buses for SoC Architectures 21
FIGURE 2.1
AXI write transaction.
Two variations of the AXI4 exist, both of them with a clear focus on FPGA
implementation: AXI4-Lite, a subset of the full AXI4 specification, and the
AXI4-Stream protocol, for streaming communications.
2.2.1.1 AXI4-Lite
For simpler components, AMBA offers the AXI4-Lite version [9], a subset
of the AXI4 protocol designed for communications between components with
control registers. It greatly reduces the number of wires required to implement
the bus. Designed with FPGAs in mind, it can also be used for custom chips.
In this protocol, all transactions have a burst length of one, all data ac-
cesses are the same size as the width of the data bus (32- or 64-bit), and
exclusive accesses are not supported. Thanks to this specification, commu-
nication between AXI4 and AXI4-Lite interfaces is done through a single
common conversion component.
2.2.1.2 AXI4-Stream
This protocol [8], unlike AXI4-Lite, is not a strict subset of AXI4. Designed
also for FPGAs as the main target, it aims at greatly reducing signal routing
for unidirectional data transfers from master to slave. The protocol lets design-
ers stream data from one interface to another without needing an address. It
supports single and multiple data streams using the same set of shared wires.
transfer occurs during the data phase of the previous transfer. This over-
lapping of address and data is fundamental to the pipelined nature of
the bus and allows for high-performance operation, while still providing
adequate time for a slave to provide the response to a transfer. This also
implies that ownership of the data bus is delayed with respect to own-
ership of the address bus. Moreover, support for burst transfers allows
for efficient use of memory interfaces by providing transfer information
in advance.
• Split transactions. They maximize the use of bus bandwidth by enabling
high-latency slaves to release the system bus during dead time while they
complete processing of their access requests.
• Wide data bus configurations. Support for high-bandwidth data-
intensive applications is provided using wide on-chip memories. System
buses support 32-, 64-, and 128-bit data-bus implementations with a
32-bit address bus, as well as smaller byte and half-word designs.
• Non-tristate implementation. AMBA AHB implements a separate read
and write data bus in order to avoid the use of tristate drivers. In par-
ticular, master and slave signals are multiplexed onto the shared com-
munication resources (read and write data buses, address bus, control
signals).
Like the original AMBA AHB system [3], it contains the following compo-
nents:
AHB master: Only one bus master at a time is allowed to initiate and
complete read and write transactions. Bus masters drive out the address
and control signals and the arbiter determines which master has its
signals routed to all the slaves. A central decoder controls the read data
and response signal multiplexor. It also selects the appropriate signals
from the slave that has been addressed.
AHB slave: It signals back to the active master the status of the pending
transaction. It can indicate that the transfer completed successfully, that
there was an error, that the master should retry the transfer, or indicate
the beginning of a split transaction.
AHB arbiter: The bus arbiter serializes bus access requests. The arbitra-
tion algorithm is not specified by the standard and its selection is left
as a design parameter (fixed priority, round-robin, latency-driven, etc.),
although the request-grant based arbitration protocol has to be kept
fixed.
AHB decoder: This is used for address decoding and provides the select
signal to the intended slave.
There are two variations of the AHB protocol: The AHB-lite, for simpler
devices, and the Multilayer AHB, for more complex systems.
24 Communication Architectures for SoC
FIGURE 2.2
AHB-Lite block diagram.
The main components of the AHB-lite interface are the bus master, the
bus slaves, a decoder and a slave-to master multiplexor, as can be seen in
Figure 2.2.
The master starts a transfer by driving the address and control signals.
These signals provide information about the address, direction, width of the
transfer, and indicate if the transfer forms part of a burst. Transfers can be:
(i) single, (ii) incrementing bursts that do not wrap at address boundaries, or
(iii) wrapping bursts that wrap at particular address boundaries. The write
data bus moves data from the master to a slave, and the read data bus moves
data from a slave (selected by the decoder and the mux) to the master.
Given that AHB-Lite is a single master bus interface, if a multimaster
system is required, an AHB-Lite multilayer structure is selected to isolate
every master from each other.
Communication Buses for SoC Architectures 25
FIGURE 2.3
Schematic view of the multilayer AHB interconnect.
26 Communication Architectures for SoC
• Unpipelined architecture.
FIGURE 2.4
State diagram describing the operation of the AMBA APB bus.
the ACCESS state. The enable signal, PENABLE, is asserted in the ACCESS
state. The bus remains in this state until the access is completed. At this
moment, the PREADY signal is driven HIGH by the slave. Then, if other
transfers are to take place, the bus goes back to the SETUP state, otherwise
to IDLE.
As can be observed, AMBA APB should be used to interface to any pe-
ripherals that are low bandwidth and do not require the high performance of
a pipelined bus interface. The simplicity of this bus results in a low gate count
implementation.
FIGURE 2.5
Schematic view of an AMBA system with trace and debug instrumentation.
• Flushing.
FIGURE 2.6
Snapshot of the tool configuring a CoreLink AMBA Network Interconnect.
Initiator: Who implements the interface between the interconnect and the
master core (Central Processing Unit [CPU], Digital Signal Processing
[DSP], Direct Memory Access [DMA]...). The initiator receives requests
from the core, then transmits the requests according to the Sonics stan-
dard, and finally processes the responses from the target.
FIGURE 2.7
Typical SNAP Architecture.
FIGURE 2.8
Example of system with a SonicsMX Interconnect including a crossbar and a
shared link.
36 Communication Architectures for SoC
that off-loads slow transfers from the main system interconnect inside more
complex SoCs.
Providing low latency access to a large number of low bandwidth, physi-
cally dispersed target cores, Sonics3220 is fully compatible with IP cores that
support AMBA and OCP standards. Thus, providing the ability to decouple
cores to achieve high IP core reuse. Using a very low die area interconnect
structure facilitates a rapid path to simulation.
As other Sonics SMART Interconnects, the S3220 is a nonblocking periph-
eral interconnect that guarantees end-to-end performance by managing data,
control, and test flows between all connected cores. By eliminating blocking,
Sonics3220 allows multiple transfers to be in flight at the same time while
there is no need for a multilayered bus architecture. For example, allowing
latency-sensitive CPU traffic to bypass DMA-based I/O traffic.
FIGURE 2.9
SonicsStudioTM Development Flow.
38 Communication Architectures for SoC
FIGURE 2.10
SNAP Capture Tool graphical interface.
40 Communication Architectures for SoC
FIGURE 2.11
Schematic structure of the CoreConnect bus.
• Processor Local Bus (PLB). The PLB bus is a high-performance bus that
connects the processor to high-performance peripherals, such as mem-
ories, DMA controllers, and fast devices. The PLB on-chip bus is used
in highly integrated systems. It supports read and write data transfers
between master and slave devices equipped with a PLB interface and
connected through PLB signals.
2.4.1.1 PLB6
With the purpose to support coherency in multiple-core designs, IBM released
in October 2009 the CoreConnect PLB6 On-Chip System Bus specification,
that includes the following key features:
With each bus agent port including a 128-bit read data bus and a 128-bit
write data bus, the point-to-point maximum data bandwidth is 25.6 GBps
Communication Buses for SoC Architectures 43
its DCRs and places another unit’s DCRs on the CPU read path. DCR bus
consists of a 10-bit address bus and a 32-bit data bus.
This is a synchronous bus, wherein slaves may be clocked either faster or
slower than the master, although a synchronization of clock signals with the
DCR bus clock is required. Finally, bursts are not supported by this bus.
2.4.4 Bridges
2.4.4.1 PLB-OPB Bridges
PLB masters gain access to the peripherals on the OPB bus through the PLB-
to-OPB bridge macro. The OPB bridge acts as a slave device on the PLB and
a master on the OPB. It supports bursts, word (32-bit), half-word (16-bit),
and byte read and write transfers on the 32-bit OPB data bus, and has the
capability to perform target word first line read accesses. The OPB bridge
performs dynamic bus sizing, allowing devices with different data widths to
efficiently communicate. When the OPB bridge master performs an operation
wider than the selected OPB slave can support, the bridge splits the operation
into two or more smaller transfers. Transactions from the OPB to the PLB
under the direction of the OPB masters are also supported through the in-
stantiation of an OPB-to-PLB bridge that works similar to the PLB-to-OPB,
but inverting the roles: It is a slave on the OPB, and master on the PLB.
FIGURE 2.12
Typical IBM CoreConnect architecture with different buses and interconnect
bridges.
for example, for creating bus contention testcases, often necessary to verify a
macro.
2.5 STBus
STBus is an STMicroelectronics proprietary on-chip bus protocol. STBus
is dedicated to SoC designed for high bandwidth applications such as au-
dio/video processing [47]. The STBus interfaces and protocols are closely
related to the industry standard VCI (Virtual Component Interface). The
components interconnected by an STBus are either initiators (which initiate
transactions on the bus by sending requests), or targets (which respond to re-
quests). The interconnect allows for the instantiation of complex bus systems
Communication Buses for SoC Architectures 47
FIGURE 2.13
PureSpec validation framework.
FIGURE 2.14
Schematic view of the STBus interconnect.
2.6.2 Xilinx
2.6.2.1 IBM’s CoreConnect
Xilinx offers the IBM CoreConnect license to all its embedded processor cus-
tomers since CoreConnect technology serves as the infrastructure for all Xilinx
embedded processor designs.
IBM optimized a PowerPC hard core for the Xilinx architecture. It is
included in some FPGA models. The Microblaze is a synthesizable 32-bit
Reduced Instruction Set Computing (RISC) processor designed by Xilinx that
can be instantiated as desired. Initially, the PowerPC allowed connection to
the PLB (with optional DCR interface), while the Microblaze offered the OPB
interface. Intercommunication between PLB and OPB subsytems was enabled
through Opb2Plb and Plb2Opb bridges. However, in the last version of the
Xilinx design tools (Integrated Software Environment [ISE] suite), the OPB
bus is not available, although still supported for legacy designs through the use
of adaptation bridges. Now, both the PowerPC and the Microblaze connect
to the PLB.
52 Communication Architectures for SoC
In addition to the Processor Local Bus (version PLBv46) [53], the Embed-
ded Development Kit tool (EDK), from Xilinx, offers the Fast-Symplex Link
(FSL) [54] (point-to-point) and Local Memory buses [52], for tighly coupled
memories and modules. A library of components is available with modules that
offer PLB/FSL/Local Memory Bus (LMB) interfaces. Through the ”Create
peripheral Wizard” tool, EDK generates a template that will guide designers
in the task of creating PLB/FSL peripherals.
Now, with Xilinx possibly looking to embed ARM processors in its prod-
ucts, they tightly collaborated with ARM to elaborate the new AMBA4 stan-
dard, that offers the AXI4-Lite version, a subset of the AXI4 protocol that
greatly reduces the number of wires required to implement the bus. Designed
with FPGAs in mind, can also be used for custom chips. The AXI4-Stream
protocol, for streaming communications, is also very FPGA-oriented.
FIGURE 2.15
System showing Wrapped Bus and OCP Instances.
tion can be developed and debugged using the standard ISE R Design Suite,
and other third-party HDL and algorithmic design tools.
Because the Extensible Processing Platform takes a processor-centric ap-
proach (it boots the Processing System at reset and then manages the pro-
grammable logic configuration), a more software-centric development flow is
enabled (Figure 2.16).
This flow enables the System Architect, Logic Designer, and Software De-
veloper to work in parallel, using their familiar programming environments,
then merge the final releases into the software baseline. As a result, key par-
titioning decisions on system functions/performance can be made early and
throughout the development process. This is critical for embedded systems
where application complexity is driving tremendous levels of system perfor-
mance against tightly managed cost, schedule, and power budgets. System
Architects and Software Developers typically define the system initially from
the software perspective and then determine what functions they need to
offload or accelerate in hardware. This allows them to trial fit their design
against the performance, cost, and power targets of the application. At this
proof-of-concept stage, System Architects and Software Developers are most
concerned with having flexibility over what can be performed in hardware or
run in software to meet the specific application requirements. Iteratively, they
converge on the optimal partitioning of hardware and software, and then re-
fine both to fit the system requirements. The Extensible Processing Platform
is ideal for this process as it will accelerate convergence on a more idealized
programming platform. It is important to note that the AMBA-AXI interfaces
54 Communication Architectures for SoC
FIGURE 2.16
System showing Wrapped Bus and OCP Instances.
are key in enabling the software-centric flow because they present a seamless,
common, and well-defined environment for the hardware extensions. While the
Logic Designer will need to deeply understand this technology, for the Soft-
ware Developer, the AMBA interfaces abstract the extended logic as memory
mapped calls. This allows for a straightforward interplay of hardware and
software programming in a parallel state of development.
FIGURE 2.17
System showing Wrapped Bus and OCP Instances
ner United Microelectronics Corp., ARM Ltd. and MIPS Technologies Inc.
are part of the effort, too.
2.7.2 Specification
OCP-IP is dedicated to proliferating a common standard for intellectual prop-
erty (IP) core interfaces, or sockets, that facilitate “plug and play” System-
on-Chip design.
The Open Core Protocol (OCP) version 1.0 defines the basis of this high-
performance, bus-independent interface between IP cores. An IP core can be
a simple peripheral core, a high-performance microprocessor, or an on-chip
communication subsystem such as a wrapped on-chip bus.
OCP separates the computational IP core from its communication activity
defining a point-to-point interface between two communicating entities, such
as IP cores and bus interface modules (bus wrappers). One entity acts as
the master of the OCP instance and the other as the slave. Only the master
can present commands and be the controlling entity. The slave responds to
commands presented to it. For two entities to communicate in a peer-to-peer
fashion, two instances of the OCP connecting them are needed.
Figure 2.17 shows a simple system containing a wrapped bus and three IP
core entities: a system target, a system initiator, and an entity that behaves
as both.
A transfer across this system occurs as follows. A system initiator (as the
OCP master) presents command, control, and possibly data to its connected
slave (a bus wrapper interface module). The interface module plays the request
across the on-chip bus system. The OCP does not specify the embedded bus
functionality. Instead, the interface designer converts the OCP request into
an embedded bus transfer. The receiving bus wrapper interface module (as
the OCP master) converts the embedded bus operation into a legal OCP
command. The system target (OCP slave) receives the command and takes
the requested action.
Communication Buses for SoC Architectures 57
The Open Core Protocol version 2.0, released in September 2003 adds
many enhancements to the 1.0 specification, including a new burst model,
the addition of in-band signaling, endianness specification, enhanced thread-
ing features, dual reset facilities, lazy synchronization, and additional write
semantics. In November 2009 the Specification Working Group released the
OCP 3.0 Specification [38]. This latest version contains extensions to sup-
port cache coherence and more aggressive power management, as well as an
additional high-speed consensus profile and other new elements.
Since version 1.0, all signaling is synchronous with reference to a single
interface clock, and all signals except for the clock are unidirectional, point-
to-point, resulting in a very simple interface design, and very simple timing
analysis. However, given the wide range of IP core functionality, performance
and interface requirements, a fixed-definition interface protocol cannot ad-
dress the full spectrum of requirements. The need to support verification and
test requirements adds an even higher level of complexity to the interface.
To address this spectrum of interface definitions, the OCP defines a highly
configurable interface. The OCP’s structured methodology includes all of the
signals required to describe an IP cores’ communication including data flow,
control, verification, and test signals.
Next, we describe some of the main characteristics of the OCP protocol:
• Optimizes die area by configuring into the OCP interfaces only those
features needed by the communicating cores.
The OCP provides the option of having responses for Write commands,
or completing them immediately without an explicit response (posted write
commands). The OCP protocol provides some advanced features, like burst
transfers or multiple-cycle access models, where signals are held static for sev-
eral clock cycles to simplify timing analysis and reduce implementation area.
According to the OCP specifications, there are two basic commands (types of
58 Communication Architectures for SoC
• A) The master starts the first read request, driving RD on MCmd and
a valid address on MAddr. The slave asserts SCmdAccept, for a request
accept latency of 0.
• B) Since SCmdAccept is asserted, the request phase ends. The slave
Communication Buses for SoC Architectures 59
FIGURE 2.18
Pipelined OCP Request and Response.
responds to the first request with DVA on SResp and valid data on
SData.
• C) The master launches a read request and the slave asserts SCmdAc-
cept. The master sees that SResp is DVA and captures the read data
from SData. The slave drives NULL on SResp, completing the first re-
sponse phase.
• F) The slave has the data for the third read, so it drives DVA on SResp
and the data on SData.
• G) The master captures the data for the third read from SData. The
request-to-response latency for this transfer is 2.
60 Communication Architectures for SoC
2.7.4 Tools
One of the advantages of the OCP protocol are the multiple tools available to
perform design space exploration and verification of OCP-based systems. A
big effort from the community has been made in this direction, including the
OCP-IP University Program. Members of this program are entitled to receive
free software tools, technical support, and training that is packaged and ready
for incorporation into a course or immediate independent use by students.
OCP-IP members also receive free training and support, and software tools,
enabling them to focus on the challenges of SoC design. We next describe
some of these tools: the CoreCreator, The OCP TLM Modeling Kit, and the
OCP conductor:
2.7.4.1 CoreCreator
Corecreator is a complete OCP Verification Environment. The last version,
CoreCreator II [37], provides capabilities to simulate OCP Cores and OCP-
based Systems. It includes verification IP to generate and respond to OCP
stimulus, an OCP checker to ensure protocol compliance, a performance ana-
lyzer to measure system performance and a disassembler, which helps to view
the behavior of OCP traffic (see Figure 2.19). CoreCreator II can be used with
FIGURE 2.19
CoreCreator components.
Communication Buses for SoC Architectures 61
• The first component in the SOLV package is the Sonics SVA OCP
Checker. It validates OCP sockets for protocol compliance during simu-
lation and generates OCP trace files for use by postprocessing tools. The
checker sits in between the Master and the Slave, as depicted in Figure
2.20, and captures values of OCP signals at each OCP clock cycle and
compares them to OCP protocol requirements as defined within the offi-
cial OCP specification. This tool enables users to quickly and efficiently
identify OCP protocol violations within individual sockets thereby re-
ducing debug and validation workloads.
FIGURE 2.20
An instantiation of the Sonics OCP Checker (example of the Sonics OCP Li-
brary for Verification: SOLV) monitors the OCP2 protocol compliance during
the simulation.
62 Communication Architectures for SoC
• The second tool within the SOLV package is the OCP Disassembler
(OCPDis2). OCPDis2 is a command line tool that allows for the dis-
play of OCP connection activity in a convenient report format. During
simulation, OCP connection activity can be logged into an OCP trace
file. These trace files are essentially tables of hexadecimal values until
OCPDis2 disassembles them into human readable data.
FIGURE 2.21
System showing Wrapped Bus and OCP Instances.
2.8 Wishbone
The WISHBONE System-on-Chip interconnect [51] is an open source hard-
ware computer bus. It is an attempt to define a standard interconnection
scheme for IP cores so that they can be integrated more quickly and easily by
the end user.
A large number of open-source designs for CPUs, and auxiliary computer
peripherals have been released with Wishbone interfaces. Many can be found
at OpenCores [39], a foundation that attempts to make open-source hardware
designs available.
It uses a Master/Slave architecture. Cores with Master interfaces initiate
data transactions to participating Slave interfaces. All signals are synchronous
to a single clock but some slave responses must be generated combinatorially
for maximum performance. Some relevant Wishbone features that are worth
mentioning are the multimaster capability which enables multiprocessing, the
arbitration methodology defined by end users attending to their needs, and the
scalable data bus widths (8-, 16-, 32-, and 64-bit) and operand sizes. Moreover,
the hardware implementation of bus interfaces is simple and compact (suitable
for FPGAs), and the hierarchical view of the Wishbone architecture supports
structured design methodologies [23]. Wishbone permits addition of a “tag
bus” to describe the data.
The hardware implementation supports various IP core interconnection
schemes, including: point-to-point connection, shared bus, crossbar switch
implementation, data-flow interconnection and even off-chip interconnection.
The crossbar switch interconnection is typically used when connecting two or
64 Communication Architectures for SoC
more masters together so that every one can access two or more slaves (in a
point-to-point fashion).
The Wishbone specification does not require the use of specific develop-
ment tools or target hardware. Furthermore, it is fully compliant with virtually
all logic synthesis tools.
FIGURE 2.22
In-car intelligent elements.
2.9.1 Automotive
The most widely used automotive bus architecture is the CAN bus. The CAN
(Controller Area Network) is a multimaster broadcast serial bus standard for
connecting electronic control units (ECUs). Originally developed by Robert
Bosch GmbH in 1986 for in-vehicle networks in cars. Bosch published the
CAN 2.0 specification in 1991 [13]. It currently dominates the automotive
industry, also having considerable impact in other industries where noise im-
munity and fault tolerance are more important than raw speed, such as factory
automation, building automation, aerospace systems, and medical equipment.
Hundreds of millions of CAN controllers are sold every year and most go into
cars. Typically, the CAN controllers are sold as on-chip peripherals in mi-
crocontrollers. Bosch holds patents on the technology, and manufacturers of
CAN-compatible microprocessors pay license fees to Bosch.
The applications of CAN bus in automobiles include window and seat
operation (low speed), engine management (high speed), brake control (high
speed), and many other systems. Figure 2.22 shows some of the typical in-car
elements.
Choosing a CAN controller defines the physical and data-link portions of
your protocol stack. In a closed system, designers can implement their own
higher-level protocol. If they need to interoperate with other vehicle compo-
66 Communication Architectures for SoC
nents, though, the vehicle manufacturer will most likely mandate the use of
one of the standard higher-level protocols.
For the physical layer, a twisted pair cable (with a length ranging from
1,000m at 40Kbps to 40m at 1Mbps) carries the information on the bus as a
voltage difference between the two lines. The bus is therefore immune to any
ground noise and to electromagnetic interference, which in a vehicle can be
considerable.
Each node is able to send and receive messages, but not simultaneously. A
message consists primarily of an ID; usually chosen to identify the message-
type or sender (interpreted differently depending on the application or higher-
level protocols used), and up to eight data bytes. It is transmitted serially
onto the bus. This signal pattern is encoded in Nonreturn to Zero (NRZ) and
is sensed by all nodes. The devices that are connected by a CAN network
are typically sensors, actuators, and other control devices. These devices are
not connected directly to the bus, but through a host processor and a CAN
controller. All messages carry a cyclic redundancy code (CRC).
CAN features an automatic “arbitration free” transmission. A CAN mes-
sage that is transmitted with highest priority will “win” the arbitration, and
the node transmitting the lower priority message will sense this and back off
and wait. This is achieved by CAN transmitting data through a binary model
of “dominant” bits and “recessive” bits where dominant is a logical 0 and
recessive is a logical 1. This means open collector, or “wired or” physical im-
plementation of the bus (but since dominant is 0 this is sometimes referred to
as wired-AND). If one node transmits a dominant bit and another node trans-
mits a recessive bit then the dominant bit “wins” (a logical AND between the
two).
If the bus is free, any node may begin to transmit. If two or more nodes
begin sending messages at the same time, the message with the more dominant
ID (which has more dominant bits, i.e., zeroes) will overwrite other nodes’ less
dominant IDs, so that eventually (after this arbitration on the ID) only the
dominant message remains and is received by all nodes.
There are several high-level communication standards that use CAN as
the low level protocol implementing the data link and physical layers. For ex-
ample, The time-triggered CAN (TTCAN) protocol [12] (standardized in ISO
11898-4). Time-triggered communication means that activities are triggered
by the elapsing of time segments. In a time-triggered communication system
all points of time of message transmission are defined during the development
of a system. A time-triggered communication system is ideal for applications
in which the data traffic is of a periodic nature. Other high-level protocols are
CANopen, DeviceNet, and J1939 (http://www.can-cia.org/). For engine man-
agement, the J1939 protocol is common, while CANopen is preferred for body
management, such as lights and locks. Both buses run on the same hardware;
different application-specific needs are met by the higher-level protocols.
CAN is a relatively slow medium and cannot satisfy all automotive
needs [34]. For example in-car entertainment requires high-speed audio and
Communication Buses for SoC Architectures 67
2.9.2 Avionics
With the purpose of reducing integration costs and provide distributed data
collection and processing, the most relevant features of avionics buses include
deterministic behavior, fault tolerance, and redundancy. Most avionics buses
are serial in nature. A serial bus using only a few sets of wires keeps the
point to point wiring and weight down to a minimum. Newest bus standards
use fiber optics to provide a significantly reduced weight and power solution
to spacecraft subsystem interfacing. Another benefit from the technology ad-
vances is that small-scale embedded subsystems can now be implemented as
a System-on-Chip using the latest generations of FPGAs. This creates oppor-
tunities for Commercial, Off-the-Shelf (COTS) vendors to offer IP cores for
single or multiple remote terminals, bus controller, and bus monitor functions.
Avionics architectures typically separate the flight safety-critical elements
such as primary flight control, cockpit, landing gear, and so on from less
critical elements such as cabin environment, entertainment, and, in the case of
military aircraft, the mission systems. This separation offers less onerous initial
certification and allows incremental addition, as is often required for regulatory
reasons, without the need for complete recertification. Significant savings in
weight and power could be made with an integrated systems approach, using
centralized computing supporting individual applications running in secure
partitions with critical and noncritical data sharing the same bus.
While it appears that avionics buses are being left behind by the pace of
technological change, there are sound economic and safety reasons why avion-
ics architectures cannot change so rapidly. Avionics buses are traditionally
slow to evolve, partly because requirements change so slowly and partly be-
cause of the costs of development, certification, and sustainment. It is with
the development of new airplanes that the demand for new bus architectures
evolves. This can be seen in the adoption of Fibre Channel for JSF and AR-
68 Communication Architectures for SoC
INC 664, also known as AFDX (Avionics Full-Duplex Switched Ethernet) for
new Boeing and Airbus airplane types. Some buses, although ideally suited
technically such as Time-Triggered Protocol (TTP), have been sluggish to be
adopted and might only find use in niche applications. However, although the
rate of change might be slow, the nature of the market still leaves room for
much innovation in packaging, soft cores, and test equipment by embedded
computing vendors.
Next, we will review some of the most widespread standards, divided into
military and civil solutions and, then, we dedicate a small section to the task
of debugging this type of systems.
2.9.2.1 Military
MIL-STD-1553 (rev B) [16] is the best known military example, developed
in the 1970s by the United States Department of Defense. It was originally
designed for use with military avionics, but has also become commonly used
in spacecraft on-board data handling (OBDH) subsystems, both military and
civil. It features a dual redundant balanced line physical layer, a (differential)
network interface, time division multiplexing, half-duplex command/response
protocol, and up to 31 remote terminals (devices). A version of MIL-STD-1553
using optical cabling in place of electrical is known as MIL-STD-1773. It is
now widely used by all branches of the U.S. military and has been adopted
by NATO as STANAG 3838 AVS. MIL-STD-1553 is being replaced on some
newer U.S. designs by FireWire.
Another specification, the MIL-STD-1760 (rev D) [17] Aircraft/Store Elec-
trical Interconnection System, is a very particular case. It defines an electrical
interface between a military aircraft and its carriage stores. Carriage stores
range from weapons, such as GBU-31 JDAM, to pods, such as AN/AAQ-14
LANTIRN, to external fuel tanks. Prior to adoption and widespread use of
MIL-STD-1760, new store types were added to aircraft using dissimilar, pro-
prietary interfaces. This greatly complicated the aircraft equipment used to
control and monitor the store while it was attached to the aircraft: the stores
management system, or SMS. The specification document defines the electri-
cal characteristics of the signals at the interface, as well as the connector and
pin assignments of all of the signals used in the interface. The connectors are
designed for quick and reliable release of the store from the aircraft. Weapon
stores are typically released only when the aircraft is attacking a target, under
command of signals generated by the SMS. All types of stores may be released
during jettison, which is a nonoffensive release that can be used, for example,
to lighten the weight of the aircraft during an emergency.
2.9.2.2 Civil
SpaceWire [1] is a spacecraft communication network based in part on the
IEEE 1355 standard of communications. SpaceWire is defined in the Euro-
pean Cooperation for Space Standardization ECSS-E50-12A standard, and
Communication Buses for SoC Architectures 69
is used worldwide by the European Space Agency (ESA), and other interna-
tional space agencies including NASA, JAXA, and RKA. Within a SpaceWire
network the nodes are connected through low-cost, low-latency, full-duplex,
point-to-point serial links and packet switching wormhole routing routers.
SpaceWire covers two (physical and data-link) of the seven layers of the OSI
model for communications.
AFDX [11] is defined as the next-generation aircraft data network (ADN).
It is based upon IEEE 802.3 Ethernet technology, and utilizes COTS com-
ponents. AFDX is described specifically by Part 7 of the ARINC 664 Spec-
ification, as a special case of a profiled version of an IEEE 802.3 network.
This standard was developed by Airbus Industries for the A380. It has been
accepted by Boeing and is used on the Boeing 787 Dreamliner. AFDX bridges
the gap on reliability of guaranteed bandwidth from the original ARINC 664
standard. It utilizes a cascaded star topology network, where each switch can
be bridged together to other switches on the network. By utilizing this form of
network structure, AFDX is able to significantly reduce wire runs, thus reduc-
ing overall aircraft weight. Additionally, AFDX provides dual link redundancy
and Quality of Service (QoS). The six primary aspects of AFDX include full
duplex, redundancy, deterministic, high-speed performance, switched and pro-
filed network.
Prior to AFDX, ADN utilized primarily the ARINC 429 standard. The
ARINC 429 Specification [22] establishes how avionics equipment and sys-
tems communicate on commercial aircrafts. This standard, developed over
thirty years ago, has proven to be highly reliable in safety critical applications,
and is still widely used today on a variety of aircrafts from both Boeing and
Airbus, including the B737, B747, B757, B767, and Airbus A330, A340, A380,
and the A350. The specification defines electrical characteristics, word struc-
tures, and protocols necessary to establish the bus communication. Hardware
consists of a single transmitter, or source, connected to 1-20 receivers, or sinks,
on one twisted wire pair. A data word consists of 32 bits communicated over
the twisted pair cable using the Bipolar Return-to-Zero Modulation. There are
two speeds of transmission: high speed operates at 100 kbit/s and low speed
operates at 12.5 kbit/s. Data can be transmitted in one direction only (sim-
plex communication), with bidirectional transmission requiring two channels
or buses. ARINC 429 operates in such a way that its single transmitter com-
municates in a point-to-point connection, thus requiring a significant amount
of wiring that amounts to added weight.
Another standard, ARINC 629 [36], introduced by Boeing for the 777
provides increased data speeds of up to 2 Mbit/s and allows a maximum of
120 data terminals. This ADN operates without the use of a bus controller,
thereby increasing the reliability of the network architecture. The drawback of
this system is that it requires custom hardware that can add significant cost
to the aircraft. Because of this, other manufacturers did not openly accept the
ARINC 629 standard.
70 Communication Architectures for SoC
FIGURE 2.23
In-car intelligent elements.
FIGURE 2.24
Automated House.
the middle of the night. Automating your house means interconnecting many
sensors, actuators, and controls together to provide a centralized service. For
this reason, multiple standards have been developped aiming at efficiently
intercommunicating the elements of an automated environment. Currently,
the two most succesful home automation technologies are X10 and C-Bus.
X10 [20] is an international and open industry standard for communication
among electronic devices used for home automation. It primarily uses power
line wiring for signaling and control (a signal imposed upon the standard AC
power line). A wireless radio based protocol transport is also defined. X10
was developed in 1975 by Pico Electronics of Glenrothes, Scotland, in order
to allow remote control of home devices and appliances. It was the first general
purpose domotic network technology and remains the most widely available
due to its simplicity (can be installed without re-cabling the house) and low
price. An X-10 command usually includes two actions: activate a particular
device (message code indicating device), and then send the function to be
72 Communication Architectures for SoC
executed (message with the function code). Table 2.1 describes all commands
supported by the standard X-10.
TABLE 2.1
X10 commands
Code Function Description
0000 All units Off Switch off all devices with the house code indi-
cated in the message
0001 All lights On Switches on all lighting devices (with the ability
to control brightness)
0 0 1 0 On Switches on a device
0 0 1 1 Off Switches off a device
0 1 0 0 Dim Reduces the light intensity
0 1 0 1 Bright Increases the light intensity
0 1 1 1 Extended Code Extension code
1 0 0 0 Hail Request Requests a response from the device(s) with the
house code indicated in the message
1001 Hail Acknowledge Response to the previous command
101x Pre-Set Dim Allows the selection of two predefined levels of
light intensity
1100 Extended Data Additional data
1101 Status is On Response to the Status Request indicating that
the device is switched on
1110 Status is Off Response indicating that the device is switched off
1111 Status Request Request requiring the status of a device
2.10 Security
It is clear that programmability and high performance will increasingly take
center stage as desirable elements in future SoC designs. However, there is
another element, security, that is likely to become crucial in most systems.
In the last years, SoC designers have become increasingly aware of the need
for security functionality in consumer devices. Security has emerged as an in-
creasingly important consideration, protecting both the device and its content
from tampering and copying. A system is only as secure as its weakest link,
and security becomes ever more important as more equipment moves to a
system-on-chip approach.
With more and more e-commerce applications running on phone handsets
today, mobile systems are now looking to adopt security measures. While mo-
bile processors have previously relied on Subscriber Identity Module (SIM)
cards as the secure element, the processor architecture and integration archi-
tecture are now critical to the security of the whole system as more and more
of the peripherals are being integrated into a single chip.
74 Communication Architectures for SoC
2.11 Conclusions
MPSoCs designs are increasingly being used in today’s high-performance sys-
tems. The choice of the communication architecture in such systems is very
important because it supports the entire interconnect data traffic and has a
significant impact on the overall system performance. Through this chapter,
we have provided an overview of the most widely used on-chip communication
architectures.
Starting from the simple solution of the system bus that interconnects the
different elements, we have seen how, as technology scales down transistor
size and more elements fit inside a chip, scalability limitations of bus-based
solutions are making on-chip communications manufacturers rapidly evolve to
multilayer hierarchical buses or even packet-switched interconnect networks.
Selecting and configuring communication architectures to meet applica-
tion specific performance requirements is a very time-consuming process that
cannot be solved without advanced design tools. Such tools should be able
to automatically generate a topology, guided by the designer-introduced con-
straints, and report estimated power consumption and system performance,
as well as generate simulation and verification models.
New design methodologies allow simple system creation, based on the
instantiation of different IP components included in a preexisting library.
Thanks to the standarization of system interconnects, the designer can select
the components, and interconnect them in a plug-and-play fashion. If needed,
new components can be easily created, thanks to the existence of tools for
validation of the different protocols.
In this heterogeneous context, Open Standards, such as the OCP-IP, are
key towards unification. Bridges, adapters, and protocol converters are pro-
vided by manufacturers to ease interoperability among the different protocols,
76 Communication Architectures for SoC
so that designers are not restricted to one proprietary solution. They can trade
off performance and complexity to meet design requirements.
To conclude, we have visited some specific buses, present in the home
automation, avionics and automotive areas, showing their different inherent
characteristics.
The new big issue, for upcoming generations of chips, will be security, and
interconnect support is vital to provide systemwide protection.
2.12 Glossary
ADK: AMBA Design Kit
ADN: Aircraft Data Network
AFDX: Avionics Full-Duplex Switched Ethernet
AHB: AMBA High-performance Bus
AMBA: Advanced Microcontroller Bus Architecture
APB: Advanced Peripheral Bus
ASB: Advanced System Bus
ASIC: Application-Specific Integrated Circuit
ATB: Advanced Trace Bus
AVCI: Advanced VCI
AXI: Advanced eXtensible Interface
BCA: Bus-Cycle Accurate
BFL: Bus Functional Language
BFM: Bus Functional Model
BVCI: Basic VCI
CAN: Controller Area Network
COTS: Commercial, Off-the-Shelf
CRC: Cyclic Redundancy Code
DCR: Device Control Register Bus
DMA: Direct Memory Access
Communication Buses for SoC Architectures 77
RMW: Read-Modify-Write
SoC: System-on-a-Chip
2.13 Bibliography
[1] http://spacewire.esa.int.
[54] Xilinx. Fast Simplex Link (FSL) Bus (v2.11c) Data Sheet., 2010.
This page intentionally left blank
3
NoC Architectures
Martino Ruggiero
Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland
CONTENTS
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2 Advantages of the NoC Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3 Challenges of the NoC Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.4 Principles of NoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.4.2.1 Oblivious Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.4.2.2 Deterministic Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . 97
3.4.2.3 Adaptive Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.4.2.4 Problems on Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.4.3 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.4.3.1 Message-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.4.3.2 Packet-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.4.3.3 Flit-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.5 Basic Building Blocks of a NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.5.1 Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.5.1.1 Virtual Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.5.2 Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.5.3 Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.6 Available NoC Implementations and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.6.1 IBM Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.6.1.1 Element Interconnect Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.6.2 Intel TeraFLOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.6.2.1 TeraFLOPS Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.6.3 RAW Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.6.4 Tilera Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.6.4.1 iMesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.6.5 Intel Single-Chip Cloud Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.6.6 ST Microelectronics STNoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.6.7 Xpipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.6.8 Aethereal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.6.9 SPIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.6.10 MANGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.6.11 Proteo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.6.12 XGFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.6.13 Other NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.6.13.1 Nostrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.13.2 QNoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.13.3 Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
83
84 Communication Architectures for SoC
3.1 Introduction
Constant technology scaling enables the integration and implementation of
increasingly complex functionalities onto a single chip. Nowadays advance-
ments in chip manufacturing technology allow the integration of several hard-
ware components into a single integrated circuit reducing both manufacturing
costs and system dimensions. System-on-a-Chip (SoC) can integrate several
Intellectual Property (IPs) and industry chip vendors are releasing multicore
products with increasing core counts [18]. This multicore trend may lead to
hundreds and even thousands of cores integrated on a single chip. As core
counts increase, there is a corresponding increase in bandwidth demand to
facilitate high core utilization. The basic performance of the processing ele-
ments are of no use unless the data can be fed to them at the appropriate rates.
System-on-chip (SoC) architectures are indeed getting communication-bound
and a scalable and high-bandwidth communication fabric becomes critically
important [29]. It is clear that there is a critical need for scalable interconnec-
tion fabrics.
Considering the level of integration enabled by recent silicon technology
advances, a reliable communication of the system components is also becoming
a major concern. New challenges have to be faced by on-chip communication
in a billion-transistor SoC paradigm: not only scalability and performance, but
also reliability and energy reduction are issues. Deep submicron effects, like
crosstalk effects, capacitance coupling, and wire inductance caused by feature
size shrinking of transistors, will be more and more an issue. Process, thermal,
and voltage variations introduce unpredictable performance and errors at ev-
ery architectural level, mainly due by uncertainties in fabrication and run-time
operations [20]. Implementation constraints are also increasingly complicating
the design process. While silicon area still requires optimization, power con-
sumption becomes more critical along with other physical considerations like
thermal hotspots. Despite these technological and architectural difficulties,
the design time is always expected to be shortened due to the time-to-market
pressure. Reusing simple bus architectures for the communication does not
satisfy the mentioned requirements [6].
Traditionally, SoC designs utilize topologies based on shared buses. How-
ever, crossbar and point-to-point interconnections usually scale efficiently only
up to twenty cores [34], for larger number of IPs integrated in a single chip a
more scalable and flexible solution is needed. When shared buses and custom
point-to-point communication are no longer sufficient, more elaborate net-
works are the obvious choice. By turning from the current path of buses and
custom communication designs for the higher levels of interconnection on the
NoC Architectures 85
chip, it is possible to reach high performance with lower design and verification
costs. The solution consists of an on-chip data-routing network consisting of
communication links and routing nodes generally known as Network-on-Chip
(NoC) architectures [5].
A generic NoC design divides a chip into a set of tiles, with each tile con-
taining a processing element and an on-chip router. Each processing element
is connected to the local router through a network interface controller that
packetsizes/depacketizes the data into/from the underlying interconnection
network. Each router is connected to other routers forming a packet-based
on-chip network.
A NoC design is a very complex optimization problem and challenges lie in
the huge design space. The NoC design space has several spacial dimensions
[28]. The first one represented by the choice of the communication infras-
tructure, such as network topology, channel bandwidth, and buffer size. This
dimension defines how routers are interconnected to each other and reflects
the fundamental properties of the underlying network. Another dimension de-
pends on the communication paradigm, which dictates the dynamics of trans-
ferring and routing messages on the network. The application task mapping is
the last dimension: it decides how the different application tasks are mapped
to the network nodes.
NoCs have been widely studied and reported in several special issues in
journals, numerous special sessions in conferences, and books [26][7][17]. How-
ever, the aforementioned sources do not present many implementation exam-
ples about the current NoC proposals. This chapter deals with an in-depth
review of the state-of-the-art of the existing implementations for NoC. Here,
a vast set of studies are gathered from literature and analyzed. No optimal
NoC exists in the general case, but we want to point out that the main goal
of this review is only to show major architectural and technological trends.
The chapter is structured as follows. Sections 3.2 and 3.3 discuss main ad-
vantages and challenges introduced by the NoC paradigm. Section 3.4 presents
the design principles of NoC, while Section 3.5 describes its fundamental build-
ing blocks. Section 3.6 presents and discusses the available NoC implementa-
tions in literature and on the market.
magnitude higher than off-chip I/Os while obviating the inherent delay over-
heads associated with off-chip I/O transmission.
NoCs have the potential to bring a large number of advantages to on-chip
communication. One of those is the virtually unlimited architectural scalabil-
ity. Clearly, it is easy to comply with higher bandwidth requirements by larger
numbers of cores simply by deploying more switches and links.
NoCs also have a much better electrical performance since all connections
are point-to-point. The length of interswitch links is a design parameter that
can be adjusted. The wire parallelism in links can be controlled at will, since
packet transmission can be serialized. All these factors imply faster propaga-
tion times and total control over crosstalk issues.
In on-chip networks, routing concerns are greatly alleviated due to the
possibility of having narrower links than in buses. Wiring overhead is also
dramatically reduced leading to higher wire utilization and efficiency. More-
over, physical design improvements make NoCs more predictable than buses,
enabling faster and easier design closure achievement.
NoCs exhibit better performance under load than buses or custom com-
munication architectures since their operating frequency can be higher, the
data width is parameterizable, and communication flows can be handled in
parallel with suitable NoC topology design. Virtually any bandwidth load can
be tackled.
When dealing with system design embedding a NoC, IP cores are attached
in point-to-point fashion to on-chip networks via dedicated network inter-
faces. Network interfaces can be specialized for any interface that may be
needed and potentially any core may be seamlessly attached to a NoC given
the proper network interface. Computation and communication concerns are
clearly decoupled at the network interface level enabling a more modular and
plug&play-oriented approach to system assembly.
Very often hierarchical buses are assembled by hands and therefore man-
ual intervention must be taken into account to tune and validate the overall
design. On the contrary NoCs can be designed, optimized, and verified by
automated means, leading to large savings in design times, and getting a
solution closer to optimality. Moreover, NoCs can be tuned in a variety of pa-
rameters (topology, buffering, data widths, arbitrations, routing choices, etc.),
leading to higher chances of optimally matching design requirements. Being
distributed, modular structures, NoCs can also accommodate differently tuned
regions. For example, some portions of a NoC could be tuned statically for
lower resource usage and lower performance, or could dynamically adjust their
mode of operation.
NoC Architectures 87
3.4.1 Topology
The on-chip network topology determines the physical layout and connections
between nodes and channels in the network. Many different topologies exist,
from the simplest crossbar to very complex hierarchical cubes and beyond
[14].
The effect of a topology on overall network cost-performance is profound.
The implementation complexity cost of a topology depends on two factors: the
number of links at each node and the ease of laying out a topology on a chip.
A topology determines the number of hops a message must traverse as well as
the interconnect lengths between hops, thus influencing network latency sig-
nificantly. As traversing routers and links incurs energy, a topology’s effect on
hop count also directly affects network energy consumption. Furthermore, the
topology dictates the total number of alternate paths between nodes, affect-
ing how well the network can spread out traffic and hence support bandwidth
requirements.
A network topology can be classified as either direct or indirect. With a
direct topology, each terminal node (e.g., a processor core or cache in a chip
multiprocessor) is associated with a router, so all routers are sources and des-
tinations of traffic. In a direct topology, nodes can source and sink traffic, as
well as switch through traffic from other nodes. In an indirect topology, routers
are distinct from terminal nodes; only terminal nodes are sources and destina-
tions of traffic, intermediate nodes simply switch traffic to and from terminal
nodes. In a direct network, packets are forwarded directly between terminal
nodes. With an indirect network, packets are switched indirectly through a
series of intermediate switch nodes between the source and the destination. To
date, most designs of on-chip networks have used direct networks. Co-locating
switches with terminal nodes is often most suitable in area-constrained envi-
ronments such as on-chip networks.
The basic regular network topologies are discussed below.
A mesh-shaped network consists of m columns and n rows (see Figure
3.1). The routers are situated in the intersections of two wires and the com-
NoC Architectures 89
putational resources are near routers. Addresses of routers and resources can
be easily defined as x-y coordinates in mesh.
FIGURE 3.1
A Mesh Topology.
FIGURE 3.2
A Torus Topology.
In a tree topology nodes are routers and leaves are computational re-
sources. The routers above a leaf are called the leaf’s ancestors and corre-
spondingly the leafs below the ancestor are its children. In a fat tree topology
each node has replicated ancestors, which means that there are many alter-
native routes between nodes (see Figure 3.3).
90 Communication Architectures for SoC
FIGURE 3.3
A Fat Tree Topology.
FIGURE 3.4
A Butterfly Topology.
FIGURE 3.5
A Spidergon Topology.
FIGURE 3.6
A Star Topology.
FIGURE 3.7
A Ring Topology.
the torus, which will appear totally homogeneous. The average logical distance
between routers will become shorter in the torus, but at the cost of possibly
longer physical distances.
Early works on NoC topology design assumed that using regular topolo-
gies, such as meshes, would lead to regular and predictable layouts [22][7].
While this may be true for designs with homogeneous processing cores and
memories, it is not true for most MPSoCs as they are typically composed
of heterogeneous cores in terms of area and communication requirements. A
regular, tile-based floorplan, as in standard topologies, would result in poor
performance, with large power and area overheads. Moreover, for most state-
of-the-art MPSoCs the system is designed with static (or semistatic) mapping
of tasks to processors and hardware cores, and hence the communication traffic
characteristics of the MPSoC can be obtained statically. The clear advantage
with allowing implementations with custom topology is the possibility to
adapt the network to the application domain at hand. This can give signifi-
cant savings for parts of the network where little traffic is handled while still
being able to support significantly more traffic in other parts.
Since the first decision designers have to make when building an on-chip
network is the choice of the topology, it is useful to have a means for quick
comparisons of different topologies before the other aspects of a network are
even determined. Bisection bandwidth is a metric that is often used in the dis-
cussion of the cost of off-chip networks. Bisection bandwidth is the bandwidth
across a cut down the middle of the network. Bisection bandwidth can be used
as a proxy for cost since it represents the amount of global wiring that will be
necessary to implement the network. The degree of a topology instead refers
to the number of links at each node. Degree is useful as an abstract metric of
cost of the network, as a higher degree requires more ports at routers, which
increases implementation complexity. The number of hops a message takes
NoC Architectures 93
from source to destination, or the number of links it traverses, defines the hop
count. This is a very simple and useful means for measuring network latency,
since every node and link incurs some propagation delay, even when there is
no contention.
The first aspect to take into account when selecting which topology to use
for a network is the patterns of traffic that will go through the network. So, in
order to determine the most appropriate topology for a system an investigation
of the advantages and drawbacks of a number of common topologies with
respect to the application at hand must be done during the early design stages.
The most common topologies in NoC designs are 2-D mesh and torus,
which constitute over 60% of cases.
3.4.2 Routing
NoC architectures are based on packet-switched networks. Routers can im-
plement various functionalities—from simple switching to intelligent routing.
Routing on NoC is quite similar to routing on any network. A routing algo-
rithm determines how the data is routed from sender to receiver.
The routing algorithm is used to decide what path a message will take
through the network to reach its destination. The goal of the routing algorithm
is to distribute traffic evenly among the paths supplied by the network topol-
ogy, so as to avoid hotspots and minimize contention, thus improving network
latency and throughput. All of these performance goals must be achieved while
adhering to tight constraints on implementation complexity: routing circuitry
can stretch critical path delay and add to a router’s area footprint. While
energy overhead of routing circuitry is typically low, the specific route chosen
affects hop count directly, and thus substantially affects energy consumption.
Since embedded systems are constrained in area and power consumption, but
still need high data rates, routers must be designed with hardware usage in
mind.
Routing algorithms can be classified in various ways. For on-chip commu-
nication, unicast routing strategies (i.e., the packets have a single destination)
seem to be a practical approach due to the presence of point-to-point commu-
nication links among various components inside a chip. Based on the routing
decision, unicast routing can be further classified into four classes: centralized
routing, source routing, distributed routing, and multiphase routing.
In centralized routing, a centralized controller controls the data flow in
a system. In case of source routing, the routing decisions are taken at the
point of data generation, while in distributed routing, the routing decisions
are determined as the packets/flits flow through the network. The hybrid of
the two schemes, source and destination routing, is called multiphase routing.
Routing algorithms can also be defined based on their implementation:
lookup table and Finite State Machine (FSM). Lookup table routing algo-
rithms are more popular in implementation. They are implemented in soft-
ware, where a lookup table is stored in every node. We can change the routing
94 Communication Architectures for SoC
routing never runs into deadlock or livelock. There are some problems in the
traditional XY routing. The traffic does not extend regularly over the whole
network because the algorithm causes the biggest load in the middle of the
network. There is a need for algorithms that equalize the traffic load over the
whole network.
FIGURE 3.8
XY routing from router A to router B.
FIGURE 3.9
Surround horizontal XY.
FIGURE 3.10
Surround vertical XY. There are 2 optional directions.
FIGURE 3.11
Allowed turns in west-first.
FIGURE 3.12
Allowed turns in north-last.
FIGURE 3.13
Allowed turns in negative-first.
tor routing and a link state routing are shortest path routing algorithms. In
Distance Vector Routing, each router has a routing table that contains infor-
mation about neighbor routers and all recipients. Routers exchange routing
table information with each other and this way keep their own tables up to
date. Routers route packets by counting the shortest path on the basis of their
routing tables and then send packets forward. Distance vector routing is a sim-
ple method because each router does not have to know the structure of the
whole network. Link state routing is a modification of distance vector routing.
The basic idea is the same as in distance vector routing, but in link state rout-
ing each router shares its routing table with every other router in the network.
Link state routing in a Network-on-Chip systems is a customized version of
the traditional one. The routing tables covering the whole network are stored
in a router’s memory already during the production stage. Routers use their
routing table updating mechanisms only if there are remarkable changes in
the network’s structure or if some faults appear.
In a source routing a sender makes all decisions about the routing path
of a packet. The whole route is stored in the header of the packet before
sending, and routers along the path do the routing just like the sender has
determined it. A vector routing works basically like the source routing. In the
vector routing the routing path is represented as a chain of unit vectors. Each
unit vector corresponds to one hop between two routers. Routing paths do
not have to be the shortest possible. Arbitration lookahead scheme (ALOAS)
is a faster version of source routing. The information from the routing path
has been supplied to routers along the path before the packets are even sent.
Route information moves along a special channel that is reserved only for this
purpose.
A contention-free routing is an algorithm based on routing tables and time
division multiplexing. Each router has a routing table that involves correct
output ports and time slots to every potential sender-receiver pairs.
A destination-tag routing is a bit like an inversed version of the source rout-
ing. The sender stores the address of the receiver, also known as a destination-
tag, to the header of the packet in the beginning of the routing. Every router
makes a routing decision independently on the basis of the address of the
receiver. The destination-tag routing is also known as floating vector routing.
Deterministic routing algorithms can be improved by adding some adaptive
features to them. A topology adaptive routing algorithm is slightly adaptive.
The algorithm works like a basic deterministic algorithm but it has one fea-
ture that makes it suitable to dynamic networks. A systems administrator can
update the routing tables of the routers if necessary. A corresponding algo-
rithm is also known as online oblivious routing. The cost and latency of the
topology adaptive routing algorithm are near to costs and latencies of basic
deterministic algorithms. A facility of topology adaptiveness is its suitability
to irregular and dynamic networks.
Routing with stochastic routing algorithms is based on coincidence and an
assumption that every packet sooner or later reaches its destination. Stochas-
NoC Architectures 99
through the router in both forward and backward directions. The algorithm is
deadlock-free because packets only turn around once from a forward channel
to a backward channel.
FIGURE 3.14
Turnaround routing in a butterfly network.
uses the minimal odd-even routing, which reduces energy consumption and
also removes the possibility of livelock.
A hot-potato routing algorithm routes packets without temporarily storing
them in routers’ buffer memory. Packets are moving all the time without stop-
ping before they reach their destination. When one packet arrives to a router,
the router forwards it right away towards a packet’s receiver but if there are
two packets going in the same direction simultaneously, the router directs one
of the packets to some other direction. This other packet can flow away from
its destination. This occasion is called misrouting. In the worst case, pack-
ets can be misrouted far away from their destination and misrouted packets
can interfere with other packets. The risk of misrouting can be decreased by
waiting a little random time before sending each packet. Manufacturing costs
of the hot-potato routing are quite low because the routers do not need any
buffer memory to store packets during routing.
3.4.3.1 Message-Based
Circuit-switching is a technique that operates at the message-level, which
is the coarsest granularity, and then refines these techniques to finer gran-
ularities. It preallocates resources (links) across multiple hops to the entire
message.
A small setup message is sent into the network and reserves the links
needed to transmit the entire message from the source to the destination.
Once the setup message reaches the destination and has successfully allocated
the necessary links, an acknowledgment message will be transmitted back to
the source. When the source receives the acknowledgment message, it will
release the message that can then travel quickly through the network. Once
the message has completed its traversal, the resources are deallocated. After
the setup phase, per-hop latency to acquire resources is avoided.
With sufficiently large messages, this latency reduction can amortize the
cost of the original setup phase. In addition to possible latency benefits, circuit
switching is also bufferless. As links are pre-reserved, buffers are not needed at
each hop to hold packets that are waiting for allocation, thus saving on power.
While latency can be reduced, circuit switching suffers from poor bandwidth
utilization. The links are idle between setup and the actual message transfer
and other messages seeking to use those resources are blocked.
NoC Architectures 103
3.4.3.2 Packet-Based
Packet-based flow control techniques first break down messages into packets,
then interleave these packets on the links, thus improving link utilization.
Unlike Message-Based, the remaining techniques will require per-node buffer-
ing to store in-flight packets. There are two main choices how packets are
forwarded and stored: store-and-forward and cut-through.
The basic mode for packet transport is store-and-forward where a packet
will be received at a router in its entirety before forwarding is done to the out-
put. Clearly, the store-and-forward method waits for the whole packet before
making routing decisions The node stores the complete packet and forwards
it based on the information within its header. The packet may stall if the
router does not have sufficient buffer space. The drawback is that store-and-
forward is fairly inefficient for smaller, dedicated networks. Latency as well as
the requirements on buffer memory size will be unnecessarily high.
Cut-through forwards the packet already when the header information is
available. Cut-through works like the wormhole routing but before forwarding
a packet the node waits for a guarantee that the next node in the path will
accept the entire packet. The main forwarding technique used in NoCs is
wormhole because of the low latency and the small realization area as no
buffering is required. Most often connectionless routing is employed for best
effort while connection-oriented routing is preferable for guarantee throughput
needed when applications have QoS requirements. Once the destination node
is known, in order to determine to which of the switch’s output ports the
message should be forwarded, static or dynamic techniques can be used.
Both store-and-forward and cut-through methods need buffering capacity
for one full packet at the minimum. Wormhole switching is the most popular
and well suited on chip. It splits the packets into several flits (flow control
digits). Routing is done as soon as possible, similarly to cut-through, but
the buffer space can be smaller (only one flit at the smallest). Therefore, the
packet may be spread into many consecutive routers and links like a worm.
3.4.3.3 Flit-Based
To reduce the buffering requirements of packet-based techniques, flit-based
flow control mechanisms exist. Low buffering requirements help routers meet
tight area or power constraints on-chip.
In wormhole, the node looks at the header of the packet (stored in the first
flit) to determine its next hop and immediately forwards it. The subsequent
flits are forwarded as they arrive at the same destination node. As no buffer-
ing is done, wormhole routing attains a minimal packet latency. The main
drawback is that a stalling packet can occupy all the links a worm spans. The
general drawback with wormhole routing is the increased resource occupancy
that can increase the deadlock problems in the network.
104 Communication Architectures for SoC
3.5.3 Link
The link represents the realization through wires of the physical connection
between two nodes in a NoC. The transportation of data packets among var-
ious nodes in a NoC can be performed by using either a serial or a parallel
link. Parallel links make use of a buffer-based architecture and can be op-
erated at a relatively lower clock rate in order to reduce power dissipation.
Unfortunately, parallel links usually incur a high silicon cost due to interwire
spacing, shielding, and repeaters. This can be minimized up to a certain limit
by employing multiple metal layers.
Serial links allow savings in wire area, reduction in signal interference and
noise, and further eliminate the need for having buffers. However, serial links
would need serializer and deserializer circuits to convert the data into the right
format to be transported over the link and back to the cores. They offer the
advantages of a simpler layout and simpler timing verification, but sometimes
suffer from ISI (Intersymbol Interference) between successive signals while
operating at high clock rates. Nevertheless, such drawbacks can be addressed
by encoding and with asynchronous communication protocols.
FIGURE 3.15
Cell Broadband Engine Hardware Architecture.
The PPE is a multithreaded core and has two levels of on-chip cache, however,
the main computing power of the cell processor is provided by the eight SPEs.
The SPE is a compute-intensive coprocessor designed to accelerate me-
dia and streaming workloads. Each SPE consists of a synergistic processor
unit (SPU) and a memory flow controller (MFC). The MFC includes a DMA
controller, a memory management unit (MMU), a bus interface unit, and an
atomic unit for synchronization with other SPUs and the PPE.
Efficient SPE software should heavily optimize memory usage, since the
SPEs operate on a limited on-chip memory (only 256 KB local store) that
stores both instructions and data required by the program. The local memory
of the SPEs is not coherent with the PPE main memory, and data transfers
to and from the SPE local memories must be explicitly managed by using
asynchronous coherent DMA commands. Both PPE and SPEs can execute
vector operations.
FIGURE 3.16
Element Interconnect Bus Architecture.
local or main memory or I/O. However, the bus access semantics and the ring
topology can lead to a worst-case throughput of 50% with adversarial traffic
patterns.
The access to rings (i.e., resource allocation) is controlled for the ring ar-
biter by priority policy. The highest priority is given to the memory controller
so requestors will not be stalled on read data. Other elements on the EIB have
equal priority and are served in a round-robin manner.
The routing in the EIB consists in the choice between left or right on
each of the four unidirectional rings. The ring arbiter will prevent allocations
of transfers that are going more than halfway around the ring, i.e., only the
shortest path routes are permitted. As the Cell interfaces with the EIB through
DMA bus transactions, the unit of communications is large DMA transfers in
bursts, with flow control semantics of buses rather than packetized networks.
As the IBM Cell essentially mimics a bus using the four rings, it does not
have switches or routers at intermediate nodes. Instead, bus interface units
(BIU) arbitrate for dedicated usage of segments of the ring, and once granted,
injects into the ring.
across several core tiles whose functional units are connected through a NoC.
The first prototype (realized in 1997) had 16 individual tiles, running at a
clock speed of 225 MHz.
Each tile contains a general-purpose processor, which is connected to its
neighbors by both the static router and the dynamic router. The processor is
an eight-stage single-issue MIPS-style pipeline. It has a four-stage pipelined
FPU, a 32 KByte two-way associative SRAM data cache, and 32 KBytes of
instruction SRAM. The tiles are connected through four 32-bit NoCs with
a length of wires that is no greater than the width of a tile, allowing high
clock frequencies. Two of the networks are static and managed by a single
static router (which is optimized for lowlatency), while the remaining two
are dynamic. The networks are integrated directly into the pipeline of the
processors, enabling an ALU-to-network latency of 4 clock cycles (for a 8-
stage pipeline tile).
The static router is a 5-stage pipeline that controls 2 routing crossbars and
thus 2 physical networks. Routing is performed by programming the static
routers in a per-clock-basis. These instructions are generated by the compiler
and as the traffic pattern is extracted from the application at compile time,
router preparation can be pipelined allowing data words to be forwarded to-
wards the correct port upon arrival.
The dynamic network is based on packet-switching and the wormhole rout-
ing protocol is used. The packet header contains the destination tile, a user
field, and the length of the message. Two dynamic networks are implemented
to handle deadlocks. The memory network has a restricted usage model that
uses deadlock avoidance. The general network usage is instead unrestricted
and when deadlocks happen the memory network is used to restore the cor-
rect functionality.
The success of the RAW architecture is demonstrated by the Tilera com-
pany, founded by former MIT members, which distributes CPUs based on the
RAW design. The current top microprocessor is the Tile-GX, which packs into
a single chip 100 tiles at a clock speed of 1.5 Ghz [39].
FIGURE 3.17
TILE64 Processors Family Architecture.
the Gx series chips, there is some FP hardware to catch the odd instruction
without a huge speed hit. The Tilera Gx can host up to 100 cores and can
provide 50 GigaFLOPS of FP (see Figure 3.18). The TILE64 cores are a
proprietary 32-bit ISA, and in the Gx it is extended to 64-bit.
FIGURE 3.18
TILE-Gx Processors Family Architecture.
112 Communication Architectures for SoC
3.6.4.1 iMesh
Tilera’s processors are based on a mesh networks, called iMesh. The iMesh
consists of five up to 8x8 (10x10 in Gx) meshes. In the TILE64 generations of
Tilera chips, all of these networks were 32 bits wide, but on the Gx, the widths
vary to give each one more or less bandwidth depending on their functions.
Traffic is statically distributed among the five meshes: each mesh handles a
different type, namely user-level messaging traffic (UDN), I/O traffic (IDN),
memory traffic (MDN), intertile traffic (TDN), and compiler-scheduled traffic
(STN). The chip frequency is of 1 GHz and iMesh can provide a bisection
bandwidth of 320GB/s.
The user-level messaging is supported by UDN (user dynamic network):
threads can communicate through message passing in addition to the cache
coherent shared memory. Upon message arrivals, user-level interrupts are is-
sued for fast notification. Message queues can be virtualized onto off-chip
DRAM in case of buffer overflows in the network interface. IDN is instead
I/O Dynamic Network, and passes data on and off the chip. The MDN (mem-
ory dynamic network) and TDN (tile dynamic network) connect the caches
and memory controllers, with intertile cache transfers going through the TDN
and responses going through the MDN. The usage of two separate physical
networks thus provides system-level deadlock freedom.
The four dynamic networks (UDN, IDN, MDN, and TDN) use the
dimension-ordered routing algorithm, with the destination address encoded
in X-Y ordinates in the header. The static network (STN) allows the rout-
ing decision to be preset. This is achieved through circuit switching: a setup
packet first reserves a specific route, the subsequent message then follows this
route to the destination.
The iMesh’s four dynamic networks use simple wormhole flow control with-
out virtual channels to lower the complexity of the routers, trading off the
lower bandwidth of wormhole flow control by spreading traffic over multiple
networks. Credit-based flow control is used. The static network uses circuit
switching to enable the software to preset arbitrary routes while enabling fast
delivery for the subsequent data transfer; the setup delay is amortized over
long messages.
The iMesh’ wormhole networks have a single-stage router pipeline during
straight portions of the route, and an additional route calculation stage when
turning. Only a single buffer queue is needed at each of the 5 router ports,
since no virtual channels are used. Only 3 flit buffers are used per port, just
sufficient to cover the buffer turnaround time. This emphasis on simple routers
results in a low-area overhead of just 5.5% of the tile footprint.
tiled core clusters with high-speed I/Os on the periphery. Each core has a
private 256 KB L2 cache (12 MB total on-die) and is optimized to support
a message-passing-programming model whereby cores communicate through
shared memory. A 16 KB message-passing buffer (MPB) is present in every
tile, giving a total of 384 KB on-die shared memory, for increased performance.
Memory accesses are distributed over four on-die DDR3 controllers for an
aggregate peak memory bandwidth of 21 GB/s. The die area is 567 mm2 ,
implemented in 45 nm.
FIGURE 3.19
The Single-Chip Cloud Computer Architecture.
The design is organized in a 6×4 2D-array of tiles (see Figure 3.19). Each
tile is a cluster of two enhanced IA-32 cores sharing a router for intertile com-
munication. A new message-passing memory type (MPMT) is introduced as
an architectural enhancement to optimize data sharing. A single bit in a core’s
TLB designates MPMT cache lines. The MPMT retains all the performance
benefits of a conventional cache line, but distinguishes itself by addressing
noncoherent shared memory. All MPMT cache lines are invalidated before
reads/writes to the shared memory to prevent a core from working on stale
data.
The 5-port virtual cut-through router used to create the 2D-mesh net-
work employs a credit-based flow-control protocol. Router ports are packet-
switched, have 16-byte data links, and can operate at 2 GHz at 1.1 V. Each in-
put port has five 24-entry queues, a route precomputation unit, and a virtual-
channel allocator. Route precomputation for the outport of the next router is
done on queued packets.
An XY-dimension ordered routing algorithm is strictly followed. Deadlock
free routing is maintained by allocating 8 virtual channels between 2 message
classes on all outgoing packets.
Input port and output port arbitrations are done concurrently using a
wrapped wavefront arbiter. Crossbar switch allocation is done in a single clock
cycle on a packet’s granularity. No-load router latency is 4 clock cycles, includ-
114 Communication Architectures for SoC
FIGURE 3.20
The ST Microelectronics STNoC Spidergon.
3.6.7 Xpipes
Xpipes was developed by the University of Bologna and Stanford University
[8]. Xpipes consists of a library if soft macros of switches and links that can
be turned into instance-specific network components at instantiation time.
Xpipes library components are fully synthesizeable and can be parameterized
in many respects, such as buffer depth, data width, arbitration policies, etc.
Components can be assembled together allowing users to explore several NoC
designs (e.g., different topologies) to better fit the specific application needs.
The Xpipes NoC library also provides a set of link design methodologies
and flow control mechanisms to tolerate any wiring parasitics, as well as net-
work interfaces that can be directly plugged to existing IP cores, thanks to the
usage of the standard OCP interface. It promotes the idea of pipelined links
with a flexible number of stages to increase throughput. Important attention is
given to reliability as distributed error detection techniques are implemented
at link level.
Xpipes is fully synchronous, however facilities to support multiple frequen-
cies are provided in the network interfaces but only by supporting integer fre-
quency dividers. Routing is static and determined in the network interfaces
(source routing). Xpipes adopts wormhole switching as the only method to
deliver packets to their destinations. Xpipes supports both input and/or out-
put buffering, depending on circumstances and designer choices. In fact, since
Xpipes supports multiple flow controls, the choice of the flow control protocol
is intertwined with the selection of a buffering strategy. Xpipes does not lever-
age virtual channels. However, parallel links can be deployed among any two
switches to fully resolve bandwidth issues. Deadlock resolution is demanded
to the topology design phase.
One of the main advantages of Xpipes over other NoC libraries is the pro-
vided tool set. The XpipesCompiler is a tool to automatically instantiate an
application-specific custom communication infrastructure using Xpipes com-
ponents. It can tune flit size, degree of redundancy of the CRC error detection,
address space of cores, number of bits used for packet sequence count, max-
imum number of hops between any two network nodes, number of flit sizes,
116 Communication Architectures for SoC
etc. In a top-down design methodology, once the SoC floorplan is decided, the
required network architecture is fed into the XpipesCompiler. The output of
the XpipesCompiler is a SystemC description that can be fed to a back-end
RTL synthesis tool for silicon implementation.
3.6.8 Aethereal
The Aethereal is a NoC developed by Phillips that aims at achieving com-
posability and predictability in system design [19]. It also targets eliminating
uncertainties in interconnects, by providing guaranteed throughput and la-
tency services.
The Aethereal NoC has an instance of a 6-port router with an area of
0,175 mm2 after layout, and a network interface with four IP ports having a
synthesized area of 0,172 mm2 . All the queues are 32-bits wide and 8-words
deep. With regard to buffering, input queuing is implemented using custom-
made hardware FIFOs to keep the area costs down. Both the router and the
network interface are implemented in 0.13 μm technology, and run at 500
MHz. The network interface is able to deliver the bandwidth of 16 Gbits/sec
to all the routers in the respective directions.
Aethereal is a topology-independent NOC and mainly consists of two
components: the network interface and the router, with multiple links be-
tween them. The Aethereal router provides best-effort (BE) and guaranteed-
throughput (GT) service levels. Aethereal uses wormhole routing with input
queuing to route the flits and the router exploits source routing. The archi-
tecture of the combined GT-BE router is depicted in Figure 3.21.
FIGURE 3.21
Aethereal Router Architecture.
The Aethereal uses virtual channels and shares the channels for different
connections by using a time-division multiplexing. In the beginning of the
NoC Architectures 117
routing the whole routing path is stored on the header of the packet’s first
flit. When the flits arrive to a router a header parsing unit extracts the first
hop from the header of the first flit, moves the flits to a GT or BE FIFO and
notifies the controller that there is a packet. The controller schedules flits for
the next cycle. After scheduling the GT-flits, the remaining destination ports
can serve the BE-flits.
A time-division multiplexed circuit switching approach with contention-
free routing has been employed for guaranteed throughput. All routers in
the network have a common sense of time, and the routers forward traffic
based on slot allocation. Thus, a sequence of slots implement a virtual circuit.
The allocation of slots can be setup statically, during an initialization phase,
or dynamically, during runtime. Best-effort traffic makes use of non-reserved
slots and of any slots reserved but not used. Best-effort packets are used to
program the guaranteed-throughput slots of the routers.
The Aethereal implements the network interface in two parts: the kernel
and the shell. The kernel communicates with the shell via ports.
3.6.9 SPIN
The Scalable Programmable Integrated Network-on-chip (SPIN) is a packet-
switching on-chip micronetwork, which is based on a fat-tree topology [1]. It
is composed of two types of components: initiators and targets. The initiator
components are traffic generators, which send requests to the target compo-
nents. The target component sends a response as soon as it receives a request.
The system can have different numbers of cores for each type, and all the
components composing the system are designed to be VCI (Virtual Socket
Interface) compliant.
SPIN uses wormhole switching, adaptive routing, and credit-based flow
control. The packet routing is realized as follows. First, a packet flows up
the tree along any one of the available paths. When the packet reaches a
router, which is a common ancestor with the destination terminal, the packet
is turned around and routed to its destination along the only possible path.
Links are bidirectional and full-duplex, with two unidirectional channels. The
channel’s width is 36 bits wide, with 32 data bits and 4 tag bits used for packet
framing, parity, and error signaling. Additionally, there are two flow control
signals used to regulate the traffic on the channel.
SPIN’s packets are defined as sequences of data words of 32 bits, with the
header fitting in the first word. An 8-bit field in the header is used to identify
the destination terminal, allowing the network to scale up to 256 terminals.
The payload has an unlimited length as defined by two framing bits (Begin
Packet / End of Packet). The input buffers have a depth of 4 words, which
results in cheaper routers.
The basic building block of the SPIN network is the RSPIN router, which
includes eight ports having a pair of input and output channels compliant
with the SPIN link.
118 Communication Architectures for SoC
FIGURE 3.22
RSPIN Router Architecure used in SPIN Systems.
3.6.10 MANGO
MANGO (Message-Passing Asynchronous Network-on-Chip providing Guar-
anteed services through OCP interfaces) is a clockless Network-on-Chip sys-
tem [9, 11, 10]. It uses wormhole network flow control with virtual channels
and provides both guaranteed throughput and best-effort routing. Because the
network is clockless, the time-division multiplexing cannot be used in sharing
the virtual channels. Therefore, some virtual channels are dedicated to best-
effort traffic and others to guaranteed-throughput traffic. The benefits of the
clockless system are maximum possible speed and zero idle power.
The MANGO router architecture (depicted in Figure 3.23) consists of sep-
arated guaranteed throughput and best-effort router elements, input and out-
put ports connected to neighboring routers, and local ports connected to the
local IP core through network adapters that synchronize the clockless network
NoC Architectures 119
FIGURE 3.23
MANGO Router Architecture.
and clocked IP core. The output port elements include output buffers and link
arbiters.
The BE router routes packets using basic source routing where the routing
path is stored in the header of the packet. The paths are shaped like in the XY
routing. The GT connections are designed for data streams and the routing
acts like a circuit switched network. In the beginning of GT routing, the GT
connection is set up by programming it into the GT router via the BE router.
3.6.11 Proteo
The Proteo network consists of several sub-networks that are connected to
each other with bridges [35]. The main subnetwork in the middle of the system
is a ring but the topologies of the other subnetworks can be selected freely.
The layered structure of the Proteo router is depicted in Figure 3.24. Each
layer has one input and one output port so a router with one layer is one-
directional and suits only on subnetworks with simple ring topology. In more
complex networks, more than one layer have to be connected together.
The Proteo system has two different kinds of routers, initiators and targets.
The initiator routers can generate requests to the target routers while targets
can only respond to these requests. The only difference between initiator and
target routers is a structure of the interface. The task of the interface is to
create and extract packets.
The routing on the Proteo system is destination-tag routing, where the des-
tination address of the packet is stored on the packet’s header. When a packet
arrives to the input port the greeting block detects the packet’s destination
120 Communication Architectures for SoC
FIGURE 3.24
Proteo Router Architecture.
address and compares it to the address of the local core. If the addresses are
equal the greeting block writes the packet to the input FIFO through the over-
flow checker, otherwise the packet is written to the bypass FIFO. Finally, the
distributor block sends packets forward from the output and bypass FIFOs.
3.6.12 XGFT
XGFT (eXtended Generalized Fat Tree) Network-on-Chip is a fault-tolerant
system that is able to locate the faults and reconfigure the routers so that
the packets can be routed correctly [27]. The network is a fat tree and the
wormhole network flow control is used. Besides the traditional wormhole mech-
anism, there is a variant called pipelined circuit switching. If the packet’s first
flit is blocked, it is routed one stage backwards and routed again along some
alternative path.
When there are no faults in the network, the packets are routed using
adaptive turn-around routing. However, when faults are detected, the routing
path is determined to be deterministic using source routing so that packets
are routed around faulty routers. To detect the faults, there has to be some
system that diagnoses the network.
3.6.13.1 Nostrum
The Nostrum NoC is the work of researchers at KTH in Stockholm and the
implementation of guaranteed services has been the main focus point. The
Nostrum network adopts a mesh-based approach, and guaranteed services are
provided by so-called looped containers. These are implemented by virtual
circuits, using an explicit time-division multiplexing mechanism that they call
Temporally Disjoint Networks (TDN).
The Nostrum uses a deflective routing algorithm aimed at keeping its area
small and its power consumption low due to the absence of internal buffer
queues.
More detailed information on Nostrum can be found in [30].
3.6.13.2 QNoC
The architecture of QNoC is based on a regular mesh topology. It makes use
of wormhole packet routing and packets are forwarded using the static X-Y
coordinate-based routing.
QNoC does not provide any support for error correction logic and all links
and data transfers are assumed to be reliable. Packets are forwarded based on
the number of credits remaining in the next router.
QNoC aims at providing different levels of quality of service for the end
users. QNoC has identified four different service levels based on the on-chip
communication requirements. These service levels include Signaling, Real-
Time, Read/Write (RD/WR) and Block Transfer, Signaling being the top
priority and Block transfer being the least in the order as listed.
More detailed information on QNoC can be found in [12, 16, 15].
3.6.13.3 Chain
The CHAIN network (CHip Area INterconnect) has been developed at the
University of Manchester. CHAIN is implemented entirely using asynchronous,
or clockless, circuit techniques.
CHAIN is targeted for heterogeneous low-power systems in which the net-
work is system specific.
More detailed information on CHAIN can be found in [3, 4].
3.7 Bibliography
[1] A. Adriahantenaina, H. Charlery, A. Greiner, L. Mortiez, and C. Albenes
Zeferino. Spin: A scalable, packet switched, on-chip micro-network. In
DATE ’03: Proceedings of the Conference on Design, Automation and
122 Communication Architectures for SoC
3.8 Glossary
Flit: A flit is the smallest flow control unit handled by the network. The first
flit of a packet is the head flit and the last flit is the tail.
Federico Angiolini
iNoCs SaRL, 1007 Lausanne, Switzerland
Srinivasan Murali
iNoCs SaRL, 1007 Lausanne, Switzerland
CONTENTS
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.2 Architectures for QoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2.1 Traffic Class Segregation in Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.2.2 Traffic Class Segregation in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.2.3 Other Methods for Traffic Class Segregation . . . . . . . . . . . . . . . . . . . . . 137
4.2.4 Fairness of Traffic Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.2.5 Monitoring and Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.2.6 Memory Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.2.7 Methods for QoS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.2.7.1 Worst-Case Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.2.7.2 Average-Case Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.2.8 Synthesis Methods for Supporting QoS . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.2.9 Meeting Average-Case Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.2.9.1 Meeting Worst-Case QoS Constraints . . . . . . . . . . . . . . . . . . . 149
4.3 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.4 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.1 Introduction
Network-on-Chip (NoCs) are being envisioned for, and adopted in, extremely
complex SoCs, featuring tens of processing elements, rich software stacks, and
many operating modes. Consequently, the on-chip traffic is very varied in
nature and requirements. For example, consider the following:
• A processor running into a cache miss needs to transfer few bytes, but
as urgently as possible, else it cannot resume execution.
• A Direct Memory Access (DMA) controller is programmed to transfer
large, contiguous chunks of data. Sometimes the transfers may be partic-
ularly urgent, sometimes they may be background activities. Whenever
127
128 Communication Architectures for SoC
the DMA starts accessing the network, it is going to flood it with traffic.
As soon as it is done, it may then remain idle for extended periods.
• An H.264 decoder block, when in operation, generates streams of data at
constant bandwidth. These streams may persist for hours, depending on
end-user demands. If the network introduces jitter or bandwidth drops,
the user experience is affected.
These examples show that the problem of getting satisfactory system per-
formance is multifaceted. The NoC, with its finite set of resources, is subject
to contrasting demands by multiple cores trying to simultaneously access the
interconnect, and should allocate performance optimally.
In the NoC context, the Quality-of-Service (QoS) perceived by a given
traffic flow is defined as how well the NoC is fulfilling its needs for a certain
“service.” Usually, some “services” of the NoC are taken for granted, even
contrary to some assumptions for wide-area networks, such as reliable and
in-order delivery of messages. Other “services” are instead subject to resource
availability, for example the performance-related ones, such as the availability
of at least a certain average or peak bandwidth threshold, the delivery of mes-
sages within a deadline, or the amount of jitter (deviation in or displacement
of some aspect of the pulses in a high-frequency digital signal) in delivery
times. Yet other “services” are more functional in nature, for example the ca-
pability to broadcast messages or to ensure cache coherence. In this chapter,
the focus will be on performance-related QoS metrics, as they are one of the
most common NoC design issues.
To discuss the QoS offered by a Network-on-Chip, it is first of all necessary
to understand the requirements of the traffic flows traversing it. The traffic
flows can be clustered in a number of traffic classes. The number of classes
depends on the design. In the simplest classification [15], all traffic is split as
either Best-Effort (BE) or Guaranteed Service (GS). The latter repre-
sents all urgent communication, and must be prioritized; all remaining traffic
is delivered on a best-effort basis, i.e., when resources permit. The classifica-
tion of traffic can be much more detailed, with three [10], four [3] or more [24]
classes. Each class may have a completely different set of requirements, as in
the example in Table 4.1.
Quality-of-Service in NoCs 129
TABLE 4.1
Traffic classes
Traffic Type Bandwidth Latency Jitter
Control traffic Low Low Low
Cache refills Medium Low Tolerant
Cache prefetches High Tolerant Tolerant
Hard real-time video High Tolerant Low
Soft real-time video High Tolerant Tolerant
Audio and MPEG2 bitstreams Medium Tolerant Low
Graphics Tolerant Tolerant Tolerant
130 Communication Architectures for SoC
are handled as First In First Out queues (FIFOs). Whenever a packet does
not immediately find a way to leave a FIFO (e.g., due to congestion), all
packets enqueued behind it in the same FIFO are also unable to make any
progress. This queuing effect can propagate backwards to upstream switches
as more packets queue up, potentially stalling large parts of the NoC until
the first packet eventually frees its resources. The phenomenon is also known
as “saturation tree” since it resembles a tree, with the root in the congestion
point and branches propagating outwards. If a packet ever finds itself in such
a queue, it may be severely delayed, disrupting the QoS. Notably, best-effort,
low-priority packets are more likely to incur stalling and head-of-line blocking
due to the QoS mechanisms themselves. If a higher-priority packet ever finds
itself on the same route as lower-priority packets, it may be unable to pro-
ceed, despite its priority level, due to stalling ahead on its path, as it cannot
overtake the lower-priority packets ahead. In principle, high-priority packets
could then become unexpectedly stalled due to contention among low-priority
flows originated at the opposite side of the NoC. Figure 4.1 illustrates this
condition.
FIGURE 4.1
Head-of-line blocking. Switch C is congested, while Switch B is free. Packet 1
enters Switch A from West, requests to exit from South, and remains blocked
along the way. Packet 2 enters the same port of Switch A and tries to go East.
Although Switch B is completely free, Packet 2 cannot make progress as it
must wait for Packet 1 to move ahead first. Congestion can build up further
behind Packet 2, leading to a “saturation tree,” until Packet 1 finally resumes.
Quality-of-Service in NoCs 131
FIGURE 4.2
Allocation fairness. Cars attempting to leave a parking lot through a single
exit. If fair local arbitration occurs at each intersection, the cars in the leftmost
rows must wait much longer than the cars in the rightmost rows to exit.
(a) (b)
FIGURE 4.3
Virtual channels. Virtual channels implementations in a m × n NoC router
(m = 3, n = 2). v = 2 virtual channels are shown. Variant (a): full crossbar
implementation for maximum efficiency; variant (b): hierarchical implemen-
tation for smaller area and higher frequency.
ple, among all virtual channel buffers of all input ports for all virtual channel
buffers of all output ports; as this impacts the area and latency of the switch
arbiter, hierarchical arbitration has been proposed. In particular, the switch
arbitration will take into account the QoS requirements, e.g., by prioritizing
some virtual channels (traffic classes) [3, 10]. Once the packet has been arbi-
trated towards an output port, it is enqueued into one of the virtual channel
buffers of that output port. The buffers of the output port will take turns in
sending traffic on the link, in multiplexed fashion, again paying attention to
the QoS policy.
The choice of using VCs has advantages and disadvantages compared to
the full decoupling ensured by multiple physical links [14]. In large-area net-
works, where cabling is very expensive compared to on-router buffers, a key
benefit of virtual channels is the multiplexing of traffic streams onto a single
cable. However, this is a less compelling argument in NoCs; on-chip wires are
cheap, while a tighter constraint is the power consumption of the datapath,
which VCs do not significantly modify. Instead, virtual channels can provide
advantages over physical channels in terms of flexibility and reconfigurabil-
ity at runtime. Virtual channels however provide less total bandwidth than
a solution with the same number of physical channels, since the channels are
multiplexed. Further, if the links are to be pipelined due to timing constraints,
the pipeline stages must also be made VC-aware (i.e., have multiple buffers in
parallel), with the corresponding overhead, else the architecture will not any
longer be able to offer traffic class separation.
Whether using physical or virtual channels, additional pitfalls may be en-
countered. A constraint to keep in mind is that the NoC can only guarantee
the requested Quality-of-Service if the cores attached to it are designed to
guarantee a suitable lower bound on ejection rates. Buffering in the Network
134 Communication Architectures for SoC
Interfaces can alleviate this concern. Moreover, the spatial separation of traf-
fic classes requires a preliminary step to concretely decide how many channels
are needed, and how to allocate flows and classes to channels. Automatic tools
have been designed to tackle this challenge, as will be seen in Section 4.2.6.
FIGURE 4.4
Virtual circuit. Virtual circuit set up among a Master and a Slave, across
four switches. During a set-up phase, the shaded buffers are reserved for the
circuit. Hence, during operation, Flow B and Flow C cannot make progress.
When communication along the circuit is over, the circuit is torn down and
the other flows can resume.
FIGURE 4.5
Multimedia benchmark used for analysis.
create it and the time needed for the current packet to move to the input
buffer of the first switch, given by:
FIGURE 4.6
A five-switch NoC topology for the benchmark.
FIGURE 4.7
Worst-case latency values for the flows of the benchmark.
FIGURE 4.8
Minimum interval between packets for the flows of the benchmark.
146 Communication Architectures for SoC
FIGURE 4.9
Guaranteed bandwidth for the flows of the benchmark.
FIGURE 4.10
NoC architecture synthesis steps.
varied in a set of suitable values. The bandwidth available on each NoC link
is the product of the NoC frequency and the link width. During the topology
synthesis, the algorithm ensures that the traffic on each link is less than or
equal to its available bandwidth value.
The synthesis step is performed once for each set of the architectural pa-
rameters. In this step, several topologies with different number of switches are
explored, starting from a topology where all the cores are connected to one
switch, to one where each core is connected to a separate switch. The syn-
thesis of each topology includes finding the size of the switches, establishing
the connectivity between the switches and connectivity with the cores, and
finding deadlock-free routes for the different traffic flows.
In the next step, to have an accurate estimate of the design area and
wire lengths, the floorplanning of each synthesized topology is automatically
performed. The floorplanning process finds the 2D position of the cores and
network components used in the design. Based on the frequency point and
the obtained wire lengths, the timing violations on the wires are detected and
the power consumption on the links is obtained. In the last step, from the set
of all synthesized topologies and architectural parameter design points, the
topology and the architectural configuration that best optimizes the user’s
objectives, satisfying all the design constraints is chosen. Thus, the output is
a set of application-specific NoC topologies that meet the input constraints.
The process of meeting the QoS constraints is performed during the synthe-
sis step. For a particular switch count, the cores are assigned to the different
switches, such that cores that have high bandwidth and low-latency traffic
between them are mapped onto the same switch. When computing the paths
for a particular traffic flow, all available paths from the source to the destina-
tion that support the bandwidth requirement of the flow are checked and the
least cost path is chosen. At the beginning, the cost of a path is computed
only based on the power consumption of the traffic flows on that path. If no
existing path can support the bandwidth, then new physical links are opened
between one or more switches to route the flow. Once a path is found, it is
Quality-of-Service in NoCs 149
checked to see whether the latency constraints are met. If not, then the cost
of a path is gradually changed to the length of the path, rather than just
the power consumption of the flows. This helps in achieving the zero-load la-
tency constraint of the flow. The process is repeated for all the flows of the
application.
FIGURE 4.11
Worst-case latency on each flow.
FIGURE 4.12
Minimum guaranteed bandwidth for each flow.
FIGURE 4.13
Worst-case latency when only 5 flows are constrained.
152 Communication Architectures for SoC
destination. Whereas, in a multiswitch case, this may not happen, for exam-
ple, if there are 3 flows to the same destination. In the multiswitch case, two
of them may share a path until a point where they contend with the third
flow. The third flow only has to wait for one of them (with the maximum
delay) to go through. Whereas, in a full crossbar, the third flow will have to
wait for both the flows, in the worst case. Thus, we can see that, when only
few flows require real-time guarantees a multiswitch topology can give better
bounds and it is really difficult to come with the best topology directly using
designer’s intuition. In Figure 4.12, we show the calculated minimum guaran-
teed bandwidth for the communication flows for the 14-switch topology.
So far we showed what happened to the worst-case latency when a con-
straint is set to all the flows. In Figure 4.13, we show the behavior of the RT
synthesis algorithm when only 5 flows have worst-case latency constraints.
The flows that had constraints are marked with bubbles on the figure. The
latency constraints were added to flows going to and from peripherals. This is
a realistic case, as many peripherals have small buffers and data has to be read
at a constant rate, so that it would not be overwritten. In this case, the bounds
on those 5 flows could be tightened further (two flows at 160 cycles and three
flows at 60 cycles). Putting these constraints also leads to a reduction in the
worst-case latency of other flows as well. Due to the tight constraints, the RT
algorithm maps the RT flows first. Then, the unconstrained flows also have to
be mapped with more care so that they do not interfere with the previously
mapped ones.
4.3 Glossary
CDMA: Code-Division Multiplexed Access, a mechanism to share the uti-
lization of a transmission medium based on the use of orthogonal codes
to differentiate simultaneously transmitting channels.
FIFO: First-In First-Out, a type of buffer where the first input data must be
the first to be output.
munication requests of cores for transmission across the NoC, and of the
inverse process at the receiving core.
NoC: Network-on-Chip, an on-chip interconnect design style based on packet
switching.
4.4 Bibliography
[1] B. Akesson, K. Goossens, and M. Ringhofer. Predator: A pre-
dictable SDRAM memory controller. In International Conf. on Hard-
ware/Software Codesign and System Synthesis (CODES+ISSS), 251–256.
ACM, October 2007.
[2] T. Bjerregaard and J. Sparsø. Scheduling discipline for latency and band-
width guarantees in asynchronous network-on-chip. In Proceedings of the
11th IEEE International Symposium on Asynchronous Circuits and Sys-
tems (ASYNC), 34–43, 2005.
[3] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS architec-
ture and design process for network on chip. In J. Syst. Archit.. Elsevier,
North Holland, New York, 2004.
[4] A. Campbell, C. Aurrecoechea, and L. Hauw. A review of QoS architec-
tures. Mult. Syst., 6:138–151, 1996.
[5] C. Ciordas, A. Hansson, K. Goossens, and T. Basten. A monitoring-aware
network-on-chip design flow. J. Syst. Archit., 54(3-4):397–410, 2008.
[6] W. Dally and B. Towles. Principles and Practices of Interconnection
Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
2003.
[7] W. J. Dally. Virtual-channel flow control. In ISCA ’90: Proceedings of the
17th Annual International Symposium on Computer Architecture, 60–68,
1990.
[8] M. Daneshtalab, M. Ebrahimi, P. Liljeberg, J. Plosila, and H. Tenhunen.
A Low-Latency and Memory-Efficient On-chip Network. In Proceedings
of the 4th ACM/IEEE International Symposium on Networks-on-Chip
(NOCS1́0), May 2010.
[9] Denali. Databahn DDR memory controller IP, 2010. http://www.
denali.com.
[10] J. Diemer and R. Ernst. Back Suction: Service Guarantees for Latency-
Sensitive On-Chip Networks. In Proceedings of the 4th ACM/IEEE In-
ternational Symposium on Networks-on-Chip (NOCS1́0), May 2010.
[11] J. Duato, S. Yalamanchili, and N. Lionel. Interconnection Networks: An
Engineering Approach. Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 2002.
[12] M. Kobrinsky et al. On-chip optical interconnects. Intel Tech. J.,
8(2):129–142, 2004.
Quality-of-Service in NoCs 155
[40] D. Wiklund and D. Liu. SoCBUS: Switched network on chip for hard
real time embedded systems. In Proceedings of the 17th International
Symposium on Parallel and Distributed Processing Symposium, 781–789,
2003.
This page intentionally left blank
5
Emerging Interconnect Technologies
Davide Sacchetto
Ecole Polytechnique Fédérale de Lausanne, Switzerland
Fengda Sun
Ecole Polytechnique Fédérale de Lausanne, Switzerland
CONTENTS
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.1.1 Traditional Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.1.2 General Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.2 Optical Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.3 Plasmonic Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.4 Silicon Nanowires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.4.1 Bottom-Up Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.4.1.1 Vapor-Liquid-Solid Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.4.1.2 Laser-Assisted Catalytic Growth . . . . . . . . . . . . . . . . . . . . . . . . 168
5.4.1.3 Chemical Vapor Deposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.4.1.4 Opportunities and Challenges of Bottom-Up Approaches 169
5.4.2 Top-Down Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.4.2.1 Standard Photolithography Techniques . . . . . . . . . . . . . . . . . 170
5.4.2.2 Miscellaneous Mask-Based Techniques . . . . . . . . . . . . . . . . . . 173
5.4.2.3 Spacer Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.4.2.4 Nanomold-Based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.4.2.5 Opportunities and Challenges of Top-Down Approaches 175
5.4.3 Focus on the Spacer Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.4.4 Focus on the DRIE Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.5 Carbon Nanotubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.5.1 Physics of Carbon Nanotubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.5.2 Types of CNTs for Various Interconnects . . . . . . . . . . . . . . . . . . . . . . . . 183
5.5.3 Synthesis of CNT: A Technology Outlook . . . . . . . . . . . . . . . . . . . . . . . . 185
5.6 3D Integration Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.6.1 Metal/Poly-Silicon TSVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.6.2 Carbon Nanotube TSVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.6.3 Optical TSVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
159
160 Communication Architectures for SoC
5.1 Introduction
The recent years have seen an exponential increase of devices per unit area. As
the transistors’ size shrinks following Moore’s Law, the increased delay of wires
began to outweigh the increased performance of small transistors. Decreasing
the wire delay, e.g., by using the low-dielectric (low-κ) isolating materials
releases the problem in 90 nm; but for 65 nm and beyond, an ultralow-κ
material is required. Although this is technologically feasible, it may increase
the fabrication costs [59]. Industry has been following the scaling dictate and
technology innovation was pushed by the design requirements. Along with the
miniaturization coming with new technology nodes, profound modifications
to routing have been carried out. An increasing number of metal layers at
ever smaller lithography pitch became necessary for the increasing number of
devices to interconnect. In the last decade the scaling of the interconnect has
slowed down due to technological barriers such as wire resistivity increase as
well as high capacitive coupling between adjacent wires. All this requires more
effort on the technological research side for less resistive materials and low-κ
insulators with ever lower dielectric constant.
In the International Roadmap for Semiconductors [74], the future of inter-
connects may lead to completely new concepts (see Table 5.1), either exploit-
ing a totally different physics, such as wireless or optical/plasmonics signal-
ing or a radical re-adaptation of the actual dual damascene Cu process into
another one that employs innovative conductors, such as carbon nanotubes,
nanowires, or graphene. These novel approaches leverage specific properties
of the underlying technologies. For instance the optical signalling has the ad-
vantages of high bandwidth and no cross-talk; the radio frequency (RF) wire-
less invests on the possibility of interconnecting different components without
the need of routing. In the case of nanotubes/nanowires/graphene, the in-
trinsic force as interconnect is a higher conductivity than Cu when properly
engineered. Moreover, since those innovative materials are also investigated
for their promising properties as field effect transistors for future technology
nodes, there is the likelihood that these two branches of research, the one
being on interconnects, and the other one being on devices, will one day lad
to a unique platform entirely based on nanowires/nanotubes technology.
Another solution is 3D integration. This technology was proposed first in
the 1970s. The main goal of 3D circuit processing is creating additional semi-
conducting layers of silicon, germanium, gallium arsenide, or other materials
on top of an existing device layer on a semiconducting substrate. There are
several possible fabrication technologies to form these layers. The most promis-
Emerging Interconnect Technologies 161
TABLE 5.1
Alternative interconnect technologies [74]
Technology Advantages Disadvantages
Optical High Bandwidth Low Density
Plasmonics High Density Low Distance
Wireless Scaling Compatible Process Integration
Nanowires/Nanotubes High Density CMOS Compatibility
TABLE 5.2
Electrical interconnects issues [74]
Problem Potential Solution
Electromigration Blech Length Via
Stress Migration Via-to-Line Geometric Design
Dielectric Breakdown N/A
Line Edge Roughness N/A
κ increase in low-κ dielectrics due to CMP N/A
ing near-term techniques are wafer bonding [61], silicon epitaxial growth [77],
and recrystallization of polysilicon [39].
FIGURE 5.1
SoC interconnect layers. Lateral cross-section of a System-on-Chip showing
the Back End of Line with the different interconnect layers [74].
FIGURE 5.2
Optical Interconnects [13]: (a) System for on-chip optical interconnects. (b)
Bandwidth/latency ratio for 1 cm long optical, nonscaled Cu, WDM optical
and scaled Cu interconnect.
164 Communication Architectures for SoC
TABLE 5.3
Optical vs. electrical interconnects
Optical Electrical
Speed-of-Ligth Speed Requirements Matched
High Bandwidth Bandwdith Requirements Matched
No Crosstalk Crosstalk Constraint
Low Density High Density
Global Interconnects All Interconnect Levels
lowest turn radii of a few μms are achievable with Si/SiO2 waveguides, which
is also representative of the integration limit for this type of optical intercon-
nect. As the other type of components would require to be significantly scaled,
optical interconnects will most likely be used for top-level global signaling or
for clock-tree distribution. Another constraint is that the waveguide pitch that
cannot be reduced more than 300 nm–400 nm lateral dimensions, although
the WDM technique can be useful in packing more signals per waveguide.
A different type of photonic device that requires to be miniaturized is the
light modulator, the equivalent of the switch for optical signals. Among many
alternatives, one possible implementation is the gate-all-around construction
over a silicon nanowire waveguide [52]. This solution efficiently combines the
waveguiding of Si/SiO2 nanowire together with high-speed and high-efficiency
modulation due to the capacitive system. Applying a voltage bias to the gate
modifies the free-carrier concentration at the Si/SiO2 interface where the max-
imum of the electric field is present. This variation in carrier concentration can
be obtained in either accumulation or inversion mode, and both modes modify
the effective refractive index of the modulator. Thus, Si nanowires embedded
in SiO2 cladding can be efficiently used for implementing both waveguides and
light modulators.
FIGURE 5.3
Nanowire lasing: (a) Single nanowire lasing principle [20]. (b) Plasmonic lasing
based on a compound II–VI semiconducting nanowire on top of a MgF2 gain
layer [58].
Emerging Interconnect Technologies 167
TABLE 5.4
Top-down vs. bottom-up approach
Top-Down Bottom-Up
Advantages Alignment, Reliability Dimension Control,
Heterojunctions, Materials
Disadvantages Variability No Alignment,
Metal Catalyst Contaminations
168 Communication Architectures for SoC
TABLE 5.5
Survey of reported nanowire crossbars. Functionalized arrays are those
including molecular switches
Reference [24] [30] [6] [7]
NW material Si/Ti Ti/Pt Si poly-Si
NW width [nm] 16 30 20 54
NW pitch [nm] 33 60 > 1000 100
Crossbar density [cm−2 ] 1011 2.7 × 1010 N/A 1010
Self-
Technique SNAP NIL MSPT
assembly
Functionalized? yes yes no no
shows that the MSPT patterning technique has both advantages of yielding
semiconducting NW and a high-crosspoint density of ∼ 1010 cm−2 while using
conventional photolithographic processing steps.
FIGURE 5.4
Vapor-liquid-solid growth of a silicon nanowire [31]: (a) Free Si atoms from
silane dissolve in the Au seed until reaching the Si:Au supersaturation. Then
Si is expelled as nanowire. (b) TEM image of SiNW synthesized at 500◦ C in
hexane at 200 bar. (c) TEM of a part of SiNW inside the square in (b) shows
high crystalline SiNWs.
since the laser pulses locally heat the substrate generating the particle for
the nanowire growth. It is also suitable for multicomponent nanowires, in-
cluding doped nanowires, and for nanowires with a high-quality crystalline
structure [18].
a A
A
b B A
B A
c B A B A
B A B A
FIGURE 5.5
In situ axial and nanowire doping: (a) Doping along the nanowire axis (axial
doping) [25]. (b) Doping around the nanowire axis (radial doping) [42].
over the substrate (Figure 5.6) [34], or they can stand vertically aligned with
respect to the substrate (Figure 5.7) [29, 64]. The growth substrate is in
general different from the functional substrate. Consequently, it is necessary
to disperse the as-grown nanowires in a solution, and then to transfer them
onto the functional substrate, making the process more complex. In [27], the
nanowires were dispersed in ethanol; then the diluted nanowire suspension
was used to flow-align the nanowires by using microfluidic channels. A similar
technique was used in [35] in order to assemble arrays of nanowires through
fluidic channel structures formed between a polydimethylsiloxane (PDMS)
mold and a flat substrate. This technique yields parallel nanowires over long
distances, as shown in Figure 5.8.
(a) (b)
FIGURE 5.6
Growth of meshed nanowires [34]: (a) Scanning electron microscope (SEM)
image of gold-catalyzed growth of SiNWs on Si3 N4 /Si substrate. Image width
= 7 μm. (b) High-magnification image of branched nanowires. Image width
= 0.7 μm.
FIGURE 5.7
Growth of vertical nanowires [29]: (a) Conformal growth of nanowires to the
substrate. (b) Tilted SEM image and (c) a cross-sectional SEM image of the
structure. Scale bars are 10 μm.
172 Communication Architectures for SoC
FIGURE 5.8
PDMS-mold-based assembly of InP nanowires [35]: (a) Schematic represen-
tation of the technique. (b) SEM image of the aligned nanowires (scale bar
= 50 μm). (c) Higher magnification SEM image of the aligned nanowires (scale
bar = 2 μm).
Emerging Interconnect Technologies 173
spacer technique was used to define the nanomold and not the nanowires di-
rectly, this process is closer in nature to the nanomold-based techniques than
to the spacer techniques.
nanowire layer. In this part of the chapter, the efforts are concentrated on
related challenges: first, the demonstration of the ability of this technology to
yield a crossbar structure; then the assessment of the limits of this technology
in terms of nanowire dimensions and pitch; and finally, the characterization
of access devices operating as single poly-Si nanowire field effect transistors
(poly-SiNWFET). The main idea of the process is the iterative definition of
thin spacers with alternating semiconducting and insulating materials, which
result in semiconducting and insulating nanowires. The structures are defined
inside a 1 μm high wet SiO2 layer over the Si substrate (Figure 5.9(a)). This
SiO2 layer has two functions: on the one hand, it insures the isolation between
the devices; on the other hand, it is used to define a 0.5 μm high-sacrificial
layer on which the multispacer is defined. Then, a thin conformal layer of
poly-Si with a thickness ranging from 40 to 90 nm is deposited by LPCVD in
the Centrotherm tube 1-1 (Figure 5.9(b)). During the LPCVD process, silane
(SiH4 ) flows into the chamber and silicon is deposited onto the substrate.
The type of deposited silicon (amorphous or poly-crystalline) depends on
the chamber temperature and pressure [69, 2, 70]. The deposition has been
specifically optimized for the CMI facilities [60]. At the deposition tempera-
ture of 600◦ C, the LPCVD process yields poly-crystalline silicon. Thereafter,
this layer is etched with the Reactive Ion Etching (RIE) etchant STS Multiplex
ICP using a Cl2 plasma, in order to remove the horizontal layer while keeping
the sidewall as a spacer (Figure 5.9(c)). As the densification of deposited sil-
icon improves the crystalline structure [54], the poly-Si spacer is densified at
700◦ C for 1 hour under N2 flow in the Centrotherm tube 2-1. Then, a confor-
mal insulating layer is deposited as a 40 to 80 nm thin Low-Temperature Oxide
(LTO) layer obtained by LPCVD in the Centrotherm tube 3-1 following the
reaction of SiH4 with O2 at 425◦ C (Figure 5.9(d)). The quality of the LTO can
be improved through densification [9]. Thus, the deposited LTO is densified
at 700◦ C for 45 minutes under N2 flow. Then it is etched in the RIE etchant
Alcatel AMS 200 DSE using C4 F8 plasma in order to remove the horizontal
layer and just keep the vertical spacer (Figure 5.9(e)). Alternatively, instead
of depositing and etching the LTO, the previously defined poly-Si spacer can
be partially oxidized in the Centrotherm tube 2-1 in order to directly form
the following insulating spacer. These two operations (poly-Si and insulating
spacer definition) are performed one to six times in order to obtain a multi-
spacer with alternating poly-Si and SiO2 nanowires (Figure 5.9(f)). Then, the
batch is split into two parts: some of the wafers are dedicated to the definition
of a second perpendicular layer of nanowires, some others are processed fur-
ther with the gate stack and the back-end steps and are dedicated to perform
electrical measurements.
In order to address the issue of characterizing a single access device (poly-
SiNWFET), a single nanowire layer is used, on top of which a poly-Si gate
stack is defined with an oxide thickness of 20 nm, obtained by dry oxidation
of the poly-SiNW, and different gate lengths (Figure 5.9(g)). The drain and
source regions of the undoped poly-SiNW are defined by the e-beam evapo-
Emerging Interconnect Technologies 177
conformal
cave
poly-Si layer
sacrificial layer
Substrate
(a) Cave definition inside oxidized (b) Conformal thin poly-Si layer
Si substrate deposition
conformal
LTO layer
spacers
multi-spacers
gates vias
(g) Gate oxide and gate poly-Si (h) Passivation and metallization
deposition and patterning
FIGURE 5.9
MSPT process steps in (a)-(h).
178 Communication Architectures for SoC
(a) Optical lithography (b) Four steps DRIE (c) Wet oxidation
etch
(d) Cave filling with (e) BHF oxide removal (f) Photoresist removal
photoresist
FIGURE 5.10
Vertically stacked Si nanowire process steps in (a)-(i).
180 Communication Architectures for SoC
the successive processes (Figure 5.10(f)). Nanowires are oxidized in dry atmo-
sphere, for a 10 – 20 nm higher quality oxide, as the dielectric for Field Effect
Transistor (FET) devices (Figure 5.10(g)) as gate dielectric. Then between
200 nm and 500 nm of LPCVD polysilicon is deposited (Figure 5.10(h)). The
LPCVD polySi layer allows conformal coverage of the 3D structure, enabling
the formation of gate-all-around devices, such as FETs [62] or optical mod-
ulators [52]. The polysilicon gate is patterned by means of a combination of
isotropic and anisotropic recipes (see Figure 5.10(i)). Depending then on the
structure, implantation or metalization of the Si pillars can be carried out, so
to produce Metal Oxide Semiconductor Field Effect Transistors (MOSFETs)
or Schottky Barrier Field Effect Transistors (SBFETs), respectively.
Examples of fabricated structures demonstrating arrays having from 3 up
to 12 vertically stacked Si nanowires are shown in Figure 5.11. The obtained
nanowires can be used to build gate-all-around field effect transistors (see
Figure 5.12) interconnected through Si pillars.
(a) (b)
FIGURE 5.11
Arrays of vertically stacked Si nanowires [63]: (a) Silicon nanowire arrays with
12 vertical levels. (b) Silicon nanowire arrays with 3 vertical levels.
Emerging Interconnect Technologies 181
(a) (b)
FIGURE 5.12
Vertically stacked Si nanowire transistors [63]: (a) Three horizontal Si
nanowire strands with two parallel polysilicon gates. (b) Focused ion beam
(FIB) cross-section showing triangular and rhombic nanowires embedded in a
gate-all-around polysilicon gate.
TABLE 5.6
Cu vs. CNT
Properties CNT Copper
Mean Free Path [nm] >1000 [49] 40
Max Current Density [A/cm2 ] >1010 [72] 106
Thermal Conductivity [W/mK] 5800 [32] 385
182 Communication Architectures for SoC
FIGURE 5.13
Multiwall carbon nanotube discovered by Sumio Iijima in 1991 [36]: (a)
MWCNT schematic. (b) TEM image.
FIGURE 5.14
Chiral vectors of SWCNTs determining the type of CNTs: zigzag (semi-
conducting CNTs) and armchair (metallic CNTs) [10].
where D is given in nm. Hence, the large semiconducting shells (D >5 nm) of
the MWCNTs have bandgaps comparable to the thermal energy of electrons
and act like conductors at room temperature [47]. This makes the MWCNTs
mostly conducting.
FIGURE 5.15
A CMOS-CNT hybrid interconnect technology with bundles of MWCNTs
grown in via holes: (a) Schematic cross section. (b) SEM image of a 1μm
via [55].
FIGURE 5.16
Schematic of process sequence for bottom-up fabrication of CNT bundle
vias [45].
process or dc plasma assisted hot-filament CVD. Next, the free space between
the individual CNTs is filled with SiO2 by CVD using tetraethylorthosilicate
(TEOS). This is followed by CMP to produce a CNT array embedded in SiO2
with only the ends exposed over the planarized solid surface. Bundles of CNTs
offer many advantages for on-chip interconnects, but a number of hurdles must
be overcome before CNTs can enter mainstream VLSI processing. The major
issue is the maturity of the CNT synthesis techniques, which still cannot guar-
antee a controlled growth process to achieve prescribed chirality, conductivity,
diameter, spacing, and number of walls. The second major challenge that has
yet to be solved is the hybrid fabrication of CNTs and CMOS components with
optimal thermal budgets without degradation, and in realizing high-quality
contacts. An overview of CNT interconnect research has been presented in
this section. This provides an early look at the unique research opportunities
that CNT interconnects provide, before they see widespread adoption.
to-chip). Wafer-level bonding has much higher accuracy and allows for larger
density of interconnects comparing to chip-level bonding nowadays. In mid-
2006, Tezzaron’s wafer-level process consistently achieves alignment accuracy
of less than a micrometer. At the same time, chip-level placement accuracy
is about 10 μm [59]. Today, the most used method for chip-level bonding is
flip chip using solder bumps. It requires one by one manipulation that means
higher cost. Some novel ideas were proposed to solve the low through-put
limitation of flip chip technology and enhance the alignment accuracy. Tohoku
University’s fluidic self-alignment method claims to have achieved a high-chip
alignment accuracy of 1 μm [22].
Another crucial technology in 3D integration is thinning. Lower-aspect
ratio TSVs are always more preferred because of less fabrication difficulties.
Silicon wafers can be thinned to less than 50 μms. The process usually starts
with mechanical grinding, followed with chemical mechanical polishing, then
finishes with unselective etching to achieve smooth and planar surface. Alter-
natively, some groups such as MIT Lincoln Lab choose Silicon-on-Insulator
(SOI) wafers. The thinning process is automatically done by etching away
the oxide layer in the SOI wafers to strip the backside thick silicon part [11].
3D integration technology with TSVs can be categorized into in-processing
and postprocessing approaches. In-process makes the TSVs ready before the
fabrication of metal wires on chips, meaning that generally this approach can
tolerate high temperature of above 1000◦C. Another advantage of this ap-
proach is that the interconnection length can be minimized with 3D place
and routing design. In-process approach is suitable for Integrated Circuit (IC)
fabs. Anther approach is postprocessing. The fabricated chips are formed with
TSVs before/after dicing, and then stacked. The temperature budget is limited
under 350◦C to avoid any degeneration in preprocessed circuits. Wafer-level
and chip-level bonding can be used in both approaches.
FIGURE 5.17
3D stacking technologies [74].
Emerging Interconnect Technologies 189
to grow vertically from the catalyst layer Fe on the bottom wafer. By using
thermal chemical vapor deposition technique, the authors have demonstrated
the capability of growing aligned carbon nanotube bundles with an average
length of 140 μm and a diameter of 30 μm from the through holes.
5.8 Glossary
BER: Bit Error Rate
BHF: Buffered Hydrofluoridric acid
CMOS: Complementary Metal Oxide Semiconductor
CMP: Chemical Mechanical Polishing
CVD: Chemical Vapor Deposition
DRIE: Deep Reactive Ion Etching
EUV-IL: Extrema Ultraviolet Interference Lithography
FET: Field Effect Transistor
FIB: Focused Ion Beam
F2B: Face to Back
F2F: Face to Face
HF-CVD: Hot Filament Chemical Vapor Deposition
IC: Integrated Circuits
IST: Iterative Spacer Technique
LPCVD: Low Power Chemical Vapor Deposition
MOSFET: Metal Oxide Semiconductor FET
MSM: Metal Semiconductor Metal
MSPT: Multiple Spacer Patterning Technique
MWCNT: Multiple-Walled CNT
192 Communication Architectures for SoC
NW: Nanowire
PDMS: Polydimethylsiloxane
TEOS: Tetraethylorthosylicate
5.9 Bibliography
[1] I. Ahmed, C. E. Png, E.-P. Li, and R. Vahldieck. Electromagnetic wave
propagation in a Ag nanoparticle-based plasmonic power divider. Optics
Express, 17:337+, January 2009.
[5] K. Banerjee, H. Li, and N. Srivastava. Current status and future per-
spectives of carbon nanotube interconnects. Proceedings of the 8th Int.
International Conference on Nanotechnology , 432–436, 2008.
[6] R. Beckman, E. Johnston-Halperin, Y. Luo, J. E. Green, and J. R. Heath.
Bridging dimensions: demultiplexing ultrahigh density nanowire circuits.
Science, 310(5747):465–468, 2005.
[15] C.-C. Chiu, T.-Y. Tsai, and N.-H. Tai. Field emission properties of carbon
nanotube arrays through the pattern transfer process. Nanotechnology,
17:2840–2844, June 2006.
[18] Y. Cui, X. Duan, J. Hu, and C. M. Lieber. Doping and electrical transport
in silicon nanowires. The Journal of Physical Chemistry B, 4(22):5213–
5216, 2000.
[19] M. S. Dresselhaus, G. Dresselhaus, and P. Avouris. Carbon Nanotubes:
Synthesis, Structure, Properties, and Applications. Springer, 2001.
[34] J.-Fu Hsu, B.-R. Huang, and C.-S. Huang. The growth of silicon
nanowires using a parallel plate structure. Volume 2:605–608, July 2005.
196 Communication Architectures for SoC
[43] K.-N. Lee, S.-W. Jung, W.-H. Kim, M.-H. Lee, K.-S. Shin, and W.-K.
Seong. Well controlled assembly of silicon nanowires by nanowire transfer
method. Nanotechnology, 18(44):445302 (7 pp.), 2007.
Yuan Xie
Pennsylvania State University
Suman Datta
Pennsylvania State University
Chita R Das
Pennsylvania State University
CONTENTS
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.2 RF-Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.3 State-of-Art Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.3.1 Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.3.2 Concentrated Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.3.3 Flattened Butterfly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.3.4 Hierarchical Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.3.5 Impact of Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.4 RF-Topology Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.5 RF-Based Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.6.1 Technology Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.6.2 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.7.1 Simple Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.7.1.1 16 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.7.1.2 36 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.7.1.3 64 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.7.1.4 256 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
6.7.2 Cmesh and Hierarchical Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.7.2.1 CMESH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.7.2.2 Hierarchical Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.11 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.12 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
201
202 Communication Architectures for SoC
6.1 Introduction
Future microprocessors are predicted to consist of 10s to 100s of cores run-
ning several concurrent tasks. A scalable communication fabric is required
to connect these components and thus, giving birth to networks on silicon,
also known as Network-on-Chip (NoC). NoCs are being used as the de facto
solution for integrating the multicore architectures, as opposed to point-to-
point global wiring, shared buses, or monolithic crossbars, because of their
scalability and predictable electrical properties.
The network topology is a vital aspect of on-chip network design as it
determines several power-performance metrics. The key challenge to design a
NoC topology is to provide both high throughput and low latency while operat-
ing under constrained power budgets. The 2D-mesh topologies are popular for
tiled Chip Multi-Processores (CMPs) [36, 29, 32] due to their simplicity and
2D-layout properties. 2D-meshes provide the best network throughput [11] al-
beit their scalability limitations in terms of latency and power. The scalability
setbacks in MESHes is due to the large network diameter that grows linearly.
To address the scalability of 2D-meshes, researchers have proposed concentra-
tion [3], richly connected topologies [15, 13] and hierarchical topologies [11].
Concentration achieves smaller network diameter by sharing a router among
multiple injecting nodes and thus resulting in fewer routers. However, con-
centrated topologies trade-off achievable network throughput and bandwidth
for lower latency. In addition, concentrated networks also consume higher
power as they need larger switches and fatter physical channels. Richly con-
nected topologies achieve lower latency by trading-off throughput at moderate
power consumption. Hierarchical topologies take advantage of communica-
tion locality in applications to achieve low latency and low power. However,
these topologies provide suboptimal throughput because the global network
becomes the bottleneck in the proposed clustered communication architec-
ture [11].
The proposal of express paths (virtual and physical) has been shown to en-
hance latency and throughput [17]. For the throughput constrained hierarchi-
cal and concentrated topologies, adding express paths could provide substan-
tial throughput benefit. As the global interconnect delay exasperates in future
technologies, express paths will be very challenging to implement with tradi-
tional RC interconnect technologies. Alternate interconnect technologies such
as optical networks, radio-frequency (RF) based signal transmission and low-
dimensional materials (LDM) such as nanowires, nanotubes etc., are being ex-
plored [1]. Low-dimensional materials are considered far-term solutions while
optical and RF-based technologies are predicted as near-term solutions due
to their Complementary Metal-Oxide-Semiconductor (CMOS)-compatibility.
These emerging technologies have one thing in common, which is low-latency
Hybrid Topology Exploration for RF-Based On-Chip Networks 203
for long distance communication. For on-chip networks, this property of the
emerging technology translates to cheaper express paths.
In this chapter, we will explore network topology designs that facilitate
high throughput and low latency while operating under tight power con-
straints by using radio-frequency interconnect technology. RF-based intercon-
nect (RF-I) incur lower energy and higher data rate density than their electri-
cal counterparts [7]. Radio-frequency (RF) mm-wave propagation modulates
data onto a carrier electromagnetic wave that is guided along a wire. Such a
propagation has the least latency that is physically possible as the electromag-
netic wave travels at the speed of light. As a result, high data rates limited by
the speed of the modulator can be achieved in RF-interconnect. This high RF
bandwidth can be multiplexed among multiple carriers using techniques such
as frequency division multiple access (FDMA), leading to higher throughput
as well. Thus, for the distances on-chip, RF-I can provide high bandwidth low
latency super-express paths (from one-end of the chip to the other end). In
addition, RF-interconnect components, namely the transmitter consisting of
modulators, mixers, and receivers can benefit from CMOS technology scal-
ing. An RF integration of mesh topology has been explored in [9, 6]. Even
though RF technology requires significant design efforts to mature before be-
coming mainstream, assessing the benefits it offers architecturally will be an
important factor in determining their usage model in future.
In this chapter, we use RF-interconnect in various state-of-art topologies.
Das et al., grouped the network nodes into logical clusters and showed that a
hierarchical network made up of bus network for intracluster communication
and a global mesh network for intercluster communication achieved the best
performance and power trade-offs when compared to state-of-art topologies
[11]. This hierarchical design, however, had lower throughput and the global
network was the throughput bottleneck. We will adopt this hierarchical phi-
losophy for our study as the high bandwidth of RF could address the low
throughput problem of the global network. On replacing the global mesh net-
work with RF-enhanced mesh network, a energy delay product reduction of
upto 30% is obtained while providing upto 40% higher throughput than the
base hierarchical design. The main insights of our chapter are
• Hierarchical networks provide superior delay and power trade-offs but
suffer in throughput. RF-I when applied to hierarchical network en-
hances the throughput and also result in lower latency at approximately
the same power.
• The throughput improvement obtained by using RF-I increases with
increase in concentration.
• For medium-sized networks RF-I enhanced concentrated network is
attractive.
The rest of this chapter is organized into 8 sections. Section 6.2 gives a
brief background of RF-interconnect. Section 6.3 describes the state-of-art
204 Communication Architectures for SoC
topologies and Section 6.4 describes the RF-enhanced topologies. Section 6.5
describes the simulation setup and the results are presented in Section 6.6.
Section 6.7 outlines some of the prior work in this area and Section 6.8 con-
cludes the chapter.
6.2 RF-Interconnect
Current and future technology generations are faced with interconnect delay
problems due to the high latency of charging and discharging the repeated RC
wires [1]. Differential signaling techniques can help in decreasing the time as
well as the power of RC wires. Yet, these techniques are not sufficient for miti-
gating the high global interconnect delay as the differential wires are typically
not buffered (they require differential buffers). Technological alternatives to
address this problem are being thoroughly explored. A promising near-term
solution is the use of through-silicon-vias (TSVs) as an enabling technology
for three-dimensional (3D) stacking of dies. 3D-integration leads to smaller
die area and thus, leads to reduced wire length. Long-term solutions that are
being examined include fundamental changes such as using different material
for interconnection wires, and using novel signaling techniques. Single-walled
carbon nanotubes bundled together are being investigated as a possible re-
placement for copper in the future. The ballistic transport of electrons in
Single-Walled Carbon Nanotube (SWCNT) makes them highly resistant to
electromigration and have lower resistivity than copper at the same tech-
nology. While there are several factors that will influence the deployment of
these new materials into mainstream integrated circuits, CMOS compatibil-
ity is viewed as an overriding factor. Radio-frequency based mm-wave signal
propagation (RF-I) is an attractive option due to their CMOS compatibility.
Photonics based on-chip signaling is also being considered as a viable op-
tion. Optical networks use components such as waveguides, ring resonators,
detectors, and laser source. With significant research efforts, many of these
components have been (re)designed to be placed on-die. The laser source is
still kept off-die. The leakage power and temperature sensitivity of the optical
components need to be optimized for making photonic communication viable
on chip [33].
As the CMOS device dimensions continue to scale down, the cut-off fre-
quency (ft ) and the maximum frequency (fmax ) of CMOS devices will exceed
few hundreds of GHz. At such high frequencies, conventional line has very
high-signal attenuation. The loss increases with increase in length of the in-
terconnect [8]. Traditional RC signaling can be run at a few GHz, resulting
in a wastage of bandwidth. The RF-concept is that data to be transmitted is
modulated as an electromagnetic wave (EM), which is guided along the trans-
mission line (waveguide). Microwave transmission in guided mediums such as
Hybrid Topology Exploration for RF-Based On-Chip Networks 205
TABLE 6.1
RF parameters for a wire for various technology nodes
Parameter 70 nm 50 nm 35 nm
No. of carriers 8 10 12
Data rate per carrier (Gb/s) 6 7 8
Total data rate (Gb/s) 48 70 96
Energy per bit (pJ) 1 0.85 0.75
Area(all Tx+Rx) mm2 0.112 0.115 0.0119
Hybrid Topology Exploration for RF-Based On-Chip Networks 207
6.3.1 Mesh
2D-Meshes have been a popular topology for on-chip networks because of low
complexity and compact 2D-layout. They also provide the best throughput
because of plenty of network resources and higher path diversity. Meshes,
however, have poor latency and power scalability because of rapidly increasing
diameter, yielding it unsuitable for larger network sizes.
FIGURE 6.1
Concentrated mesh.
208 Communication Architectures for SoC
translates to lower latency. There is a dedicated port for each PE and four
ports for the cardinal directions. Though CMESH provides low latency, it is
energy inefficient because of the high-switch power of high-radix routers and
fatter channels. The crossbar power is increases with the number of ports
and the port width. Thus, concentration trades off throughput and power for
lower latency. Concentration results in a smaller network size. Consequently,
CMESH has reduced path diversity and lesser network resources (buffers,
etc.).
FIGURE 6.2
Flattened butterfly.
Hybrid Topology Exploration for RF-Based On-Chip Networks 209
the presence of high communication locality (nearest neighbor traffic) the re-
duced hop count may not compensate for the higher serialization latency and
this can adversely affect the overall packet latency. Fbfly also suffers from poor
throughput like other concentrated networks. In summary, fbfly topology gives
lower latency at low power and reduced throughput.
FIGURE 6.3
Hierarchical network.
CPW or MTL
FIGURE 6.4
High-level RF circuit.
Hybrid Topology Exploration for RF-Based On-Chip Networks 211
Local
Transaction
1. Decoding Grant Bus
Request
and routing Selective Transfer
generation
2.Arbitration wakeup Stages
To global
network
1. Decoding Grant Bus Packetization
Request
and routing Selective Transfer Injection into
generation
2.Arbitration wakeup Stages global router
From global
network
Depacketiz- 1. Decoding Grant Bus
ation and routing Selective Transfer
2.Arbitration wakeup Stages
FIGURE 6.5
Hierarchical network packet traversal.
50
Avg packet latency(cycles)
40
30
20
10
mesh cmesh
fbfly hier
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.12
0.14
0.16
0.1
FIGURE 6.6
Average packet latency.
212 Communication Architectures for SoC
70
60 mesh cmesh
40
30
20
10
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.12
0.14
0.16
Offered load (packets/cycle/node)
FIGURE 6.7
Network power consumption.
followed by fbfly. Thus, even though CMESH provides the best latency, it
is power-hungry. The hier topology is desirable with respect to latency and
power.
50
Latency (cycles)
40
30
20
10
0
0 0.05 0.1 0.15 0.2 0.25 0.3
Injection Rate (packets/node/cycle)
FIGURE 6.8
Average message latency for local traffic pattern at 64 node network.
2.5
1.5
0.5
0
16(0.09) 32(0.09) 64(0.09) 128(0.06) 256(0.05)
FIGURE 6.9
Energy delay product for local traffic pattern at 64-node network.
214 Communication Architectures for SoC
higher end compared to UR traffic by as much as 50% for large network sizes.
This reveals two interesting points. First, this shows that with high local
traffic, like cmesh and hierarchical topologies can offer higher throughput.
Second, for uniform traffic, the the saturation bottleneck is the global network,
not the bus/concentration since the local network could support an injection
load rate of up to 10% with LC traffic. Theoretically, the bus can take load
up to 12.5% (1 packet every cycle for 8 injecting nodes).
In summary, hierarchical topology provides low-power, fast, high-
bandwidth communication that can be efficiently used in the presence of high
locality. Yet, it saturates faster than the state-of-art topologies.
stracted to a point-to-point link between the end points. Thus, a RF-I is a set
of point-to-point links or express paths.
By enhancing a network with RF-I, we are building a network with a few
express paths. This leads to an irregular network. Consequently, the rout-
ing function needs to be more generic than the simple deterministic logic.
The routing function needs to be smart in using the express paths. The ideal
scenario is to use a dynamic adaptive routing scheme that can detect the net-
work conditions and traffic requirements to load-balance based on whether
the resources are being underutilized, overutilized/congestion. The routing
function could also target lowering the power consumption. We will adopt a
simple table-based routing here for ease of understanding. Table-based rout-
ing is a generic routin. However, combined with a deterministic router and
in the absense of deadlock prevention schemes, the table-based routing leads
to deadlocks. Consequently, we use a deadlock detection and recovery mecha-
nism. The deadlock is detected when there is a circular dependency in the flits
waiting for each other. When a flit is blocked due to lack of buffer in the re-
ceiving router, then waitlist of the next buffer is added to the current router’s
waitlist. In this way, the waitlist is propagated and if any router is waiting
on itself then a circular dependency is detected and a deadlock is flagged.
More details can be found in [9]. If a deadlock is detected then recovery is
done by using dedicated escape virtual channels. All the packets in the net-
work are marked for recovery and follow deterministic route using the escape
Virtual Channels (VCs). Only the new packets entering the network can use
the RF-links and the table-based routing. As deadlocks are uncommon, the
performance penalty of this mechanism is less. In addition, we also experi-
ment with different RF-usage models. All the packets injected in the network
use table-based routing (RF-100%), only 50% use table-based routing and re-
maining use deterministic routing, 25% packets use table-based routing, and
the remaining 75% use only deterministic routing. The deterministic routing
in our chapter does not use any of the RF-express links and the routing is
purely through the mesh links.
For the same RF-link budget, a smaller network could possibly have lesser
number of links inserted. For a equi-bandwidth analysis, if the link budget
is underutilized then the data rate per RF-link/band could be increased. In
such a case, the spacing between carrier frequencies should be increased to
avoid interference. In the simplest case, we use the RF-link bandwidth to
be the same as the flit width. When the data rate is higher, multiple flits
could be sent at once. This involves extra logic to combine and split packets.
The hardware overhead of such a logic is small as we need few extra buffer
space and recomputation of the header flit and the state information. In this
chapter for the sake of simplicity, we used a constant data rate per band at
a technology node for any network size. Thus, for a smaller network size, the
total RF-bandwidth is lesser.
216 Communication Architectures for SoC
FIGURE 6.10
Base case with RF overlaid over MESH. Alphabet pairs have an RF link
between them.
Hybrid Topology Exploration for RF-Based On-Chip Networks 217
FIGURE 6.11
RF overlaid on CMESH(RF-CMESH). RF-links between routers of same color.
FIGURE 6.12
RF overlaid on hierarchical network.
218 Communication Architectures for SoC
routing in the global network. The routing on the bus is identical to the base
hierarchical network. We expect the throughput to increase in all RF-enhanced
networks.
TABLE 6.2
Baseline processor, cache, memory, and router configuration
SPARC 2 GHz processor,
Processor Pipeline
two-way out of order, 64-entry instruction window
64 KB per-core(private), 4-way set associative,
L1 Caches
128 B block size, 2-cycle latency, split I/D caches
1 MB banks,shared, 16-way set associative,
L2 Caches
128 B block size, 6-cycles latency, 32 MSHRs
4 GB DRAM,up to 16 outstanding requests
Main Memory
for each processor, 400 cycle access
2-stage wormhole switched, virtual channel
Network Router
flow control, 1024 maximum packet size
TABLE 6.3
Application workloads
SPLASH 2: Is a suite of parallel scientific workloads. Each
benchmark executed one threads per processor.
SPEComp: We use SPEComp2001 as another representative
workload. The results of applu, apsi, art, and swim are pre-
sented.
Commercial Applications. (1) TPC-C, a database bench-
mark for online transaction processing (OLTP), (2) SAP, a
sales and distribution benchmark, and (3) SJBB and (4) SJAS,
two Java-based server benchmarks. The traces were collected
from multiprocessor server configurations at Intel Corporation.
220 Communication Architectures for SoC
TABLE 6.4
Network parameters
Topo- No. of Channel Conc. Radix VCs Buffer No. of Total RF
logy nodes Width Degree Depth Routers Wires BW
16 512 1 5 4 4 16 4096 448
Mesh 64 512 1 4 4 4 64 8192 768
256 512 1 4 4 4 256 16384 1024
64 512 4 8 4 2 16 8192 448
CMesh 256 512 4 8 4 2 64 16384 1024
16 512 4 7 2 8 4 4096
Fbfly 64 256 4 10 2 8 16 8192
256 128 4 13 2 16 64 16384
64 512 8 5 4 4 8 8192 256
Hyb 256 512 8 5 4 4 32 16384 1024
FIGURE 6.13
Physical layout.
Hybrid Topology Exploration for RF-Based On-Chip Networks 221
TABLE 6.5
Energy and delay of bus and interrouter links
Parameters Bus
70 nm 50 nm 35 nm 25 nm 18 nm
Length (mm) 7 4.9 3.43 2.4 1.68
Delay (ps) 498.9 442.9 353.9 247.7 173.4
Energy (pJ) 1.4 0.67 0.28 0.20 0.14
Leakage (nW) 23.5 13.3 3.5 2.4 1.7
Link
70 nm 50 nm 35 nm 25 nm 18 nm
Length (mm) 3.5 2.45 1.7 1.2 0.84
Delay (ps) 233 208.8 167.5 117.3 82.1
Energy (pJ) 0.6 0.29 0.12 0.08 0.06
Leakage (nW) 10.2 5.49 1.4 0.98 0.69
6.7 Results
We will first study the mesh topology enhanced with RF alone for various net-
work parameters. Followed by this, we will understand the effect of overlaying
RF-I on CMESH and hierarchical topologies.
6.7.1.1 16 Nodes
Figures 6.14, and 6.16 depict the latency for varying injection rates for uni-
form random (UR), and local pattern (LC) traffic patterns respectively. For
this small network size, the RF-100% where all the packets are routed using
table-based routing provides the lowest latency and the highest throughput
222 Communication Architectures for SoC
100
UR mesh
FIGURE 6.14
Average message latency for UR traffic of Mesh+RF at 16 nodes.
60
Network power consumption (W)
50
40
30
20
UR mesh
10 UR meshRF-100%
UR meshRF-50%
0
0.01
0.04
0.08
0.12
0.16
0.24
0.28
0.32
0.34
0.35
0.36
0.37
0.38
0.39
0.42
0.425
0.43
0.2
0.3
0.4
FIGURE 6.15
Average message latency of Mesh+RF for UR traffic at 16 nodes.
Hybrid Topology Exploration for RF-Based On-Chip Networks 223
advantages for the two traffic patterns. For uniform random traffic pattern,
most of the latency and throughput benefits are obtained with only 50% of
the traffic using the table-based (TB) routing (implies access to RF-express
paths).
100
90 LC mesh
Avg. message latency (cycles)
80 LC meshRF-100%
70 LC meshRF-50%
60
50
40
30
20
10
0
0.01
0.04
0.08
0.12
0.16
0.24
0.28
0.32
0.34
0.35
0.36
0.37
0.38
0.39
0.42
0.43
0.44
0.45
0.2
0.3
0.4
0.425
0.445
Injection rate
FIGURE 6.16
Average message latency of Mesh+RF for LC traffic at 16 nodes.
50
Power consumption (W)
45
40
35
30
25
20 LC mesh
15 LC meshRF-100%
10
5 LC meshRF-50%
0
Injection rate
FIGURE 6.17
Total network power consumption of Mesh+RF for LC traffic at 16 nodes.
TABLE 6.6
Energy delay product and
throughput benefits of overlaying
RF over Mesh at 16 nodes
Traffic EDP Throughput
pattern ratio
UR-100% 34 1.16x
UR-50% 19.7 1.10x
LC-100% 14 1.08x
LC-50% -10 1.05x
Hybrid Topology Exploration for RF-Based On-Chip Networks 225
6.7.1.2 36 Nodes
Figures 6.18, 6.19, 6.20, and 6.21 show the load latency plots for 36 nodes at
50nm technology. In addition, we also considered the conservative case when
RF does not take advantage of the technology scaling. This is shown in Figure
6.18 as the EquiRFBW lines. These experiments show that for the reduced
RF-bandwidth case the 50% usage lead to a higher throughput. With the
equal-bandwidth scenario, the demand on RF is higher and thus, leads to
RF congestion when there is no restriction and thus, in the 100% usage the
RF-enhanced topology saturates faster. For this node size and UR traffic the
50% usage lead to sufficient power savings to compensate the power overhead
of RF. In fact, even the 25% usage case for the EquiRFBW also has similar
power as the base case. However, for LC traffic we still see that 50% usage
leads to power overhead.
100
UR mesh
90
Avg message latency (cycles)
UR meshRF-100%
80 UR meshRF-50%
EquiRFBW meshRF-100%
70 EquiRFBW meshRF-50%
60 EquiRFBW meshRF-25%
50
40
30
20
10
0
FIGURE 6.18
Average message latency of Mesh+RF for UR traffic at 36 nodes. EquiRFBW:
RF BW is same as that used at 16 nodes.
6.7.1.3 64 Nodes
For this network size (Figures 6.22, 6.23, 6.24, and 6.25), we can see from
Figure 6.22 that with all traffic patterns, the RF-I gets congested first and
thus, 100% usage saturates first. The restricted usage models yield higher
throughput. The power trend continues to be similar to the previous network
sizes. Thus, at 64 nodes there is a trade-off between latency and throughput.
If latency is the primary design goal, then the RF should be used opportunis-
tically. Thus, as the network size increases, congestion management of RF-I is
essential for extracting maximum throughput advantages. We show this using
226 Communication Architectures for SoC
90
80
Network power consumption
70
60
50
UR mesh
40
UR meshRF-100%
30 UR meshRF-50%
EquiRFBW meshRF-100%
20
EquiRFBW meshRF-50%
10 EquiRFBW meshRF-25%
0
0.01
0.04
0.08
0.12
0.16
0.2
0.22
0.24
0.25
0.26
0.27
0.28
0.3
0.31
0.32
0.33
0.34
0.35
0.36
0.37
Injection rate (packets/node/cycle)
FIGURE 6.19
Network power consumption of Mesh+RF for UR traffic at 36 nodes.
100
90 LC mesh
Avg message latency
80 LC meshRF-100%
70 LC meshRF-50%
60
50
40
30
20
10
0
Injection rate
FIGURE 6.20
Average message latency of Mesh+RF for LC traffic at 36 nodes.
Hybrid Topology Exploration for RF-Based On-Chip Networks 227
90
Injection rate
FIGURE 6.21
Network power consumption of Mesh+RF for LC traffic at 36 nodes.
UR meshRF-50%
80 UR meshRF-25%
EquiRFBW meshRF-100%
70 EquiRFBW meshRF-50%
60 EquiRFBW meshRF-25%
50
40
30
20
10
0
0.01 0.04 0.08 0.12 0.14 0.16 0.18 0.19 0.2 0.21 0.22 0.23
Injection rate (packets/node/cycle)
FIGURE 6.22
Average message latency of Mesh+RF for UR traffic at 64 nodes.
228 Communication Architectures for SoC
100
60
UR mesh
UR meshRF-100%
40 UR meshRF-50%
UR meshRF-25%
EquiRFBW meshRF-100%
20
EquiRFBW meshRF-50%
EquiRFBW meshRF-25%
0
0.01 0.04 0.08 0.12 0.14 0.16 0.18 0.19 0.2 0.21 0.22
Injection rate (packets/node/cycle)
FIGURE 6.23
Network power consumption of Mesh+RF for UR traffic at 64 nodes.
100
Network power consumption (W)
90 LC mesh
80 LC meshRF-100%
70 LC meshRF-50%
60 LC meshRF-25%
50
40
30
20
10
0
0.01
0.04
0.08
0.12
0.14
0.16
0.18
0.19
0.2
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.3
0.31
0.32
0.33
0.305
0.315
Injection rate
FIGURE 6.24
Average message latency of Mesh+RF for LC traffic at 64 nodes.
Hybrid Topology Exploration for RF-Based On-Chip Networks 229
100
Injection rate
FIGURE 6.25
Network power consumption of Mesh+RF for LC traffic at 64 nodes.
100
80
60
40
20
FIGURE 6.26
Average message latency of Mesh+RF for UR traffic at 256 nodes.
230 Communication Architectures for SoC
180
160
Injection rate
FIGURE 6.27
Network power consumption of Mesh+RF for UR traffic at 256 nodes.
100
90
Avg. packet latency
80
70
60
50
40
30
20
10
0
0.01
0.04
0.06
0.08
0.12
0.14
0.145
0.15
0.155
0.16
0.165
0.17
0.18
0.185
0.19
0.195
Injection rate(packets/node/cycle)
LC mesh LC meshRF-100%
LC meshRF-50% LC meshRF-25%
FIGURE 6.28
Average message latency of Mesh+RF for LC traffic at 256 nodes.
Hybrid Topology Exploration for RF-Based On-Chip Networks 231
180
FIGURE 6.29
Network power consumption of Mesh+RF for LC traffic at 256 nodes.
lower power than the base case. Thus, as the network size increases, the power
savings due to lower hop count can compensate for the overhead of using RF-I.
Figure 6.30 shows the energy delay product improvement of RF-enhanced
mesh over the base mesh for the four sizes. This figure shows that for UR
and LC traffic patterns when RF is used 100% of the time, at 36 nodes we
obtain maximum EDP benefits from RF when mesh topology is considered.
The reason for this trend is that initially with technology scaling, there is an
increase in bandwidth and RF-I takes only one cycle for very long distances.
However, as the network size increases, we observe diminishing returns in
terms of throughput and EDP.
The reason for such a behavior is twofold: (a) congestion of the RF-I and
(b) the occurrence of deadlock. As the network size increases and RF-I is
allowed to use only half the time, then EDP advantages also increase. We
expect that eventually even for 50% the EDP benefits will start to drop. Thus,
at larger network sizes, there are diminishing returns in terms of throughput
and EDP by using RF-I for a flat topology.
Figure 6.31 shows the power breakdown in the various router components
for 0.04 injection rate. We can now clearly understand why the 50% usage
does not lead to power advantages. The crossbar switch is a power-hungry
component. RF-I leads to larger crossbar size, which results in higher power.
In the 100% usage case, this increase in power is compensated by the decrease
in the total number of hops in the network.
232
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
FIGURE 6.31
FIGURE 6.30
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
-20.0%
-10.0%
mesh 12% 49% 40% 0%
RF-100%
16
RF-100% 10% 49% 26% 7%
16
RF-50% 11% 54% 33% 4%
Buf
RF-50%
mesh 12% 57% 30% 0%
Xbar
RF-100%
36
RF-100% 11% 52% 18% 9%
36
Link
RF-50% 12% 59% 24% 5% RF-50%
mesh 14% 67% 19%0%
RF-100%
64
RF-100% 11% 54% 11%13%
64
RF Tx+RX
RF-50% 13% 64% 15%6% RF-50%
EDP improvement for all the network sizes over base mesh.
RF-100%
RF-100% 9% 50% 13%8%
256
256
100
90
Avg packet latency
80
70
60
50
40
30
20
10
0
0.01
0.02
0.04
0.06
0.08
0.09
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.21
0.22
0.23
0.24
0.1
0.11
0.2
Injection rate
UR cmesh UR cmeshRF-100%
UR cmeshRF-50% LC cmesh
LC cmeshRF-100% LC cmeshRF-50%
FIGURE 6.32
Average message latency of CMESH+RF for a 64-node network.
70
Network power consumption (W)
60
50
40
30 UR cmesh
UR cmeshRF-100%
20 UR cmeshRF-50%
LC cmesh
10 LC cmeshRF-100%
LC cmeshRF-50%
0
0.01
0.02
0.04
0.06
0.08
0.09
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.21
0.22
0.23
0.24
0.1
0.11
0.2
FIGURE 6.33
Network power consumption of CMESH+RF for a 64-node network.
234 Communication Architectures for SoC
100
UR cmesh
90 UR cmeshRF-100%
50
40
30
20
10
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.11
0.033
0.035
0.037
0.043
0.044
0.045
0.047
0.053
0.055
0.075
0.085
0.095
0.1
0.105
0.107
Injection rate
FIGURE 6.34
Average message latency of CMESH+RF for a 256-node network.
6.7.2.1 CMESH
The average message latency results for 64-node CMESH and 256 nodes are
shown in Figures 6.32 and 6.34, respectively. For the 64-node network, the RF
50% usage itself gives significant benefits. Unlike in the mesh, where the RF
100% usage saturated at 64 nodes for UR traffic, in concentrated mesh the
100% usage gives throughput benefits as well. This is true for LC traffic as
well. At 256-network size, however, the 100% usage saturates earlier than the
base case and 50% usage gives throughput benefits. We don’t see the deadlock
forming before saturation in the concentrated mesh for the large 256-network
size. The power plots for UR and LC are shown in Figures 6.33 and 6.35.
Cmesh+RF has the lowest latency for uniform random traffic pattern.
90
FIGURE 6.35
Network power consumption of CMESH+RF for a 256-node network.
100
Avg packet latency
80
60
40
20
0
0.01
0.03
0.04
0.045
0.05
0.055
0.06
0.063
0.067
0.07
0.073
0.077
0.081
0.085
0.089
Injection rate
UR hier UR hierRF-100%
UR hierRF-50% LC hier
LC hierRF-100% LC hierRF-50%
FIGURE 6.36
Average message latency of Hier+RF for a 64-node network.
236 Communication Architectures for SoC
18
Injection rate
FIGURE 6.37
Network power consumption of Hier+RF for a 64-node network.
32 routers in the global network. Due to the small size, there is minimal per-
formance impact of RF for the LC traffic case as seen in Figure 6.36. For the
UR case, we find that 100% usage gives throughput and latency benefits. For
256 nodes, the RF-enhanced topology at 100% usage does as well as the base
case under LC traffic pattern (see Figure 6.38). This means that RF decreased
the delay variability in the global network. It should be noted that at the same
technology, the RF bandwidth allocated to M ESH > CM ESH > HIER.
It is interesting to note that even with the small RF-bandwidth allocation
hierarchical network has very high throughput advantage. Thus, concentrated
and hierarchical networks take better advantage of the RF-I technology. Thus,
the hierarchical network and the concentrated mesh will have lower area over-
head (even in the scenario where all the topologies use the same bandwidth).
We expect that with higher bandwidth allocation, there will be throughput
benefits and power benefits (see Figure 6.37 and 6.39) but the latency benefits
will flatten out as it saves only the serialization latency in the global network.
Figure 6.40 shows the energy delay product averaged for all load rates until
the hier network saturates (0.04 packet/node/cycle). This value is normalized
to the base mesh. The average packet latency is calculated in a similar fashion
and is also shown in Figure 6.40. In this plot, lower is better. The plot shows
that CMESH+RF has the lowest latency and the hierarchical network pro-
vides the best performance and power trade-offs for these injection rates. By
overlaying the concentrated mesh with RF, as in Cmesh+RF, we obtain better
EDP than the flattened butterfly network. Providing express paths using RF
in cmesh could mimic a richly connected network. Figure 6.41 shows that the
throughput at 64 nodes for each topology normalized to mesh.
In order to understand which topology is able to take maximum advan-
tage of RF-I, the saturation throughput improvement, average latency, and
Hybrid Topology Exploration for RF-Based On-Chip Networks 237
100
FIGURE 6.38
Average message latency of Hierarchical+RF for a 256-node network.
45
40
Network power consumption
35
30
25
20 UR hier
UR hierRF-100%
15 UR hierRF-50%
10 LC hier
LC hierRF-100%
5
LC hierRF-50%
0
0.01
0.015
0.02
0.025
0.029
0.033
0.037
0.04
0.045
0.05
0.06
0.065
0.069
0.073
0.077
0.081
0.085
FIGURE 6.39
Network power consumption of Hierarchical+RF for a 256-node network.
238 Communication Architectures for SoC
1.2
EDP
1 Latency
0.8
0.6 0.55
0.4
0.2
0
FIGURE 6.40
Energy delay product averaged up to 0.04 load rate and normalized to mesh.
1.2
Base
RF-100%
Throughput normalized to Mesh
1 RF-50%
0.8
0.6
0.4
0.2
0
Mesh Cmesh Hier
FIGURE 6.41
Throughput at 0.04 load rate of all topologies normalized to mesh.
Hybrid Topology Exploration for RF-Based On-Chip Networks 239
30%
Throughput
25%
Latency
20%
with RF
15%
10%
5%
0%
FIGURE 6.42
Improvements by applying RF interconnect.
6.8 Applications
A representative set of benchmarks from commercial, scientific, and SPLASH
suites were chosen for study. A 32-way CMP with 32 cache nodes leading to
the 64-node network is considered for study. The layout for the CMP is shown
in Figure 6.13. All the experimental assumptions and network parameters are
explained in Section 6.6. Figure 6.43 shows the instruction per cycle (IPC)
metric normalized to the simple mesh topology. We do not show the mesh+rf
results here for clarity. On an average, the cmesh+rf topology provides 37%
ipc improvement. As can be seen from Figure 6.44, the hierarchical topology
enhanced with RF is comparable to cmesh+rf. Note that the percentage edp
improvement is plotted and higher is better.
240 Communication Architectures for SoC
1.8
1.6
IPC normalized to base mesh
1.4
1.2
0.8
0.6
0.4
0.2
0
sjbb tpc sap sjas swim apsi art applu barnes avg
cmesh hier bfly cmesh+RF hier+RF
FIGURE 6.43
IPC normalized to mesh topology.
80.0%
EDP improvement over simple mesh
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
FIGURE 6.44
Energy delay product improvement for various apps over base mesh.
Hybrid Topology Exploration for RF-Based On-Chip Networks 241
proposal was evaluated for bus-based 64-way CMP and showed up to 50% la-
tency improvement for some applications and up to 30% power reduction over
a baseline electrical bus [16]. Their key insight was that an electrical network
is a must for achieving high performance from optical networks. Shacham et
al. proposed a hybrid optical circuit-switched and a packet-switched electrical
network for decreasing the power consumption of optical networks. A folded
torus topology was shown to provide the least latency and the highest band-
width. They obtain 576 Gbps and a total power consumption of 6 W at 22
nm technology [31]. Vantrease et al. proposed a nanophotonic interconnect
for throughput optimization for a many core CMP [33]. Their evaluations
were targeted at 16 nm and use WDM and an all-optical arbitration scheme.
They also modeled optical interconnect off-die. They show that for an optical
crossbar network and 3D-integration, a performance improvement of up to 6
times for memory-intensive applications was observed when compared to an
all-electrical network. Petracca et al. explore topologies of photonic network
design for a single application [28]. Pan et al. propose a hierarchical network
consisting of electrical local communication and an all optical crossbar for
global communication and compare it to concentrated mesh [25]. Zheng et al.
provide a low latency multicast/broadcast subnetwork and throughput opti-
mized circuit switched optical network [20].
Topology Kumar et al. [18] presented a comprehensive analysis of inter-
connection mechanisms for small scale CMPs. They evaluate a shared bus
fabric, a cross bar interconnection, and point to point links. Pinkston and
Ainsworth [2] examined the cell broadband engine’s interconnection network,
which utilizes two rings and one bus to connect 12 core elements. The mesh
network-on-chip topology has been prototyped in Polaris [32], Tile [36] and
TRIPS [30] for medium sized(50+ nodes) on-chip networks. Wang et al. [35]
did a technology-oriented, and energy-aware topology exploration of mesh and
torus interconnects with different degrees of connectivity. Recently to address
the power inefficiency and scalability limitations of mesh, concentrated mesh
topologies [3], high radix topologies [15], and topologies with express chan-
nels [13] have been proposed. This chapter focuses on applying RF-I tech-
nology to current topologies. In [11] the authors propose a hierarchical net-
work with a global mesh connecting local buses (serving small sets of cores).
While such a hierarchical network solves the power-inefficiency and latency-
scalability limitations of mesh topologies, it provides lower throughput than
mesh.
6.10 Conclusions
The number of cores on-die is predicted to grow for a few technology genera-
tions. The exacerbated global interconnect delay in future technologies call for
Hybrid Topology Exploration for RF-Based On-Chip Networks 243
• The hier+RF topology that has RF-I overlaid onto the global mesh
gives the least energy delay product.
Acknowledgments
This work is supported in part by the National Science Foundation award
CCF-0903432 and CCF-0702617. We would like to thank Aditya Yanamandra
for insightful comments and discussions.
6.11 Glossary
Communication locality: Percentage of traffic whose destination lies
within one hop of the source.
Saturation point: The load rate at which the network latency grows
exponentially.
6.12 Bibliography
[1] International Technology Roadmap for Semiconductors (ITRS), 2008 edi-
tion, http://www.itrs.net/.
[3] J. Balfour and W. J. Dally. Design tradeoffs for tiled cmp on-chip net-
works. In ICS ’06: Proceedings of the 20th Annual International Confer-
ence on Supercomputing, 187–198, New York, NY, USA, 2006. ACM.
[4] S. Borkar. Networks for multi-core chips: A contrarian view. In Special
Session at ISLPED 2007.
[10] J. Cong, M.-C. F. Chang, G. Reinman, and S.-W. Tam. Multiband rf-
interconnect for reconfigurable network-on-chip communications. In SLIP
’09: Proceedings of the 11th International Workshop on System level In-
terconnect Prediction, 107–108, New York, NY, USA, 2009. ACM.
246 Communication Architectures for SoC
Braulio Garcı́a-Cámara
Optics Group. Department of Applied Physics. University of Cantabria
CONTENTS
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
7.2 Photonic Components for On-Chip Optical Interconnects . . . . . . . . . . . . . . . 254
7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
7.2.2 Optical Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
7.2.2.1 In-Chip Light Sources: Small Emitting Devices . . . . 255
7.2.2.2 Out-of-Chip Light Sources: Modulators . . . . . . . . . . . . . . . . . 266
7.2.3 Optical Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.2.3.1 In-Waveguide Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.2.3.2 Free-Space Optical Interconnect (FSOI) . . . . . . . . . . . . . . . . 283
7.2.4 Photodetectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
7.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.3 Why Optical Links? Comparison of Electrical and Optical Interconnects 297
7.3.1 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
7.3.2 Propagation Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
7.3.3 Bandwidth Density and Crosstalk Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 300
7.3.4 Fan-Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
7.4 Photonic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
7.5 New Optical Nanocircuits Based on Metamaterials . . . . . . . . . . . . . . . . . . . . . . 306
7.6 Present and Future of the Intra-/Inter-Chip Optical Interconnections . . . 308
7.7 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.8 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
The recent and very fast advances in computation have led to an important im-
provement in the design and manufacture of chips. However, the high-speed
needs and the very high quantity of information that must be shared have
been constrained by physical limitations. The impedance and size of metal-
lic wires are not enough for future technologies, hence new devices must be
developed. Some alternatives have been proposed to overcome this challenge.
One of them is the use of optical components for inter- and intra-chips com-
munications, substituting electrical signals by light. New advances in photonic
devices, new engineered materials, photolithography, and nano-manufacture
techniques have approached this alternative and, in the near future, it could
249
250 Communication Architectures for SoC
7.1 Introduction
It is well known that the evolution of microelectronics and in particular the
evolution of chips follows the famous Moore’s law [76]. The exponential evo-
lution of electronic devices established by this law has been able to follow
the amount of data that must be processed nowadays. However, data evolu-
tion continues to increase and microelectronic technologies are approaching
their physical limits. The so-called “interconnection bottleneck” can be re-
flected in the increasing differences between the gate delay (switching speed)
of a transistor and the propagation delay along wires between transistors of a
integrated circuit, as is shown in Figure 7.1.
In a few years, the actual electronics devices will not be able to transport
and process the quantity of data that we will need. Until now, the physical
characteristics of electrons make them suitable for processing information via
discrete logic. Also, they could be sent very fast through metal wires, there-
fore they can be also useful for transferring information. However, the recent
innovations need to transfer a high amount of information very fast and this
speed is limited in metal lines by the resistance, capacitance, and reliability of
them. In addition, the increase of the number of elements in the chip obliges
to have a high number of communications and not much space for wires. This
high density of interconnections, high speed of transmission, and power limi-
tations cannot be obtained using the existent metallic wires. For this reason,
one of the best options is the use of electronics for processing the information
while photons are used as carriers for bits. As an example, Figure 7.2 shows
the evolution of the telecommunications technologies and their relative ca-
pacity. In this figure, it can be observed that the capacity rates presented by
systems that use optical fibers as transmission channels cannot be obtained
using metallic channels.
The International Technology Roadmap for Semiconductors (ITRS) of
2007 [1] stated that optical interconnections (OI) present some advantages
with respect to electric interconnections (EI) as, for example,
FIGURE 7.1
Comparison of technology trends in transistor gate delay versus Al/SiO2 in-
terconnect propagation delay [101].
FIGURE 7.2
Increase in relative information capacity of communication channels versus
time [52].
252 Communication Architectures for SoC
FIGURE 7.3
Scheme of the main components of an OI and the different devices that can
perform those tasks.
In order to deal with each of these items, the present chapter is organized
as follows. Section 7.2 is devoted to discuss the main optical components of a
typical optical interconnect (OI). Although, as will be explained, an OI can
be formed of several elements, this section is only focused on the three main
ones (transmitter, channel, and receiver). Also, there are many devices that
can perform these tasks, as can be seen in Figure 7.3. In this section, the
newest devices and their main characteristicts will be analyzed. In Section
7.3, a comparison between the electric and the optical interconnects is car-
ried out. The main advantages and disadvantages of both are analyzed for
several important parameters: the power consumption, the propagation delay,
the bandwidth density, the crosstalk noise, and the fan-out. In this section, it
has been explained why optical interconnects are adequate for inter- and/or
intra-chip communications. Photonic networks are also an interesting case,
whose number of applications in system on-chip increases day per day. For
this reason, Section 7.4 is devoted to the study of them. In the last years,
researchers are trying to improve the characteristics of the actual photonic
components and to widen their applications. Metamaterials allow doing that
by tuning the optical properties of materials. Thus, in Section 7.5, the pos-
sible applications of metamaterials to perform new optical nanocircuits are
discussed. Finally, an analysis of the current development of this new type of
on-chip communications is included in the final section.
254 Communication Architectures for SoC
FIGURE 7.4
Scheme of a discrete optical interconnection.
FIGURE 7.5
Whispering gallery mode in a droplet with a diameter of 130μm. From the
web of the Ultrafast and Optical Amplifiers Group of the ITT Madras (India).
with these characteristics. In this section, a few of them are analyzed in order
to give a brief view of this kind of light emitters.
Emitting Cavities
Very small optical structures can present high quality factor (Q-factor)
resonances. In general, these resonant modes are produced by geometrical
factors and are called morphology-dependent resonances (MDR) or whisper-
ing gallery modes (WGMs) [28] (see Figure 7.5). The physical explanation
of these MDRs is based on the propagation of the light rays inside the res-
onant structure. The rays approach the inner surface with an angle beyond
the critical angle and are totally internally reflected (TIR) [4]. Then, light is
confined inside the structure. After propagating around the cavity, rays re-
turn, in phase, to their respective point which they entered and follow again
the same path without attenuation due to destructive interference [10]. The
frequency or wavelength at which these resonances are excited depends on the
size, shape and refractive index of the structure. MDRs have been observed
in spheres [74], cylinders, [67] and other complex geometries [3].
Some applications have been proposed in the last years using these MDRs
[69]: filters [97], tunable oscillators [61], optical switching [124], sensing [89],
or even high-resolution spectroscopy [95]. However, the main application in
which we are interested is the possibility to generate lasers based on them
[113].
The simplest geometry that presents these resonances is a sphere. Also, sil-
icon is a well-known material with adequate characteristics for the integration
on systems-on-chip. Also, the size of that resonators is perfect for inter- or
intra-chip optics interconnections. For this reason, these silicon spheres could
be ideal for microphotonics. Some works have been focused on their applica-
tion in optical communications. As an example, in reference [96], the authors
Intra-/Inter-Chip Optical Communications 257
FIGURE 7.6
Scheme of the coupling between a silicon microsphere and a silicon optical
fiber to excite the corresponding MDRs. The coupling distance is b [96].
FIGURE 7.7
Spectra of light scattering by a silicon microsphere in the forward direction
(θ = 0◦ ) and at 90◦ when MDRs are excited [96].
FIGURE 7.8
Schematic representation of the surface-plasmon-polariton (SPP) microcavity
proposed in [72]. Light is pumped into the cavity through a tapered optical
fiber passing under its edge. A transverse cross-section of the cavity is shown
for clarity.
FIGURE 7.9
Description of the metallic nanocavity implemented experimentally in [44].
The values correspond to the doped levels in cm−3 and SI means semi-
insulating.
Intra-/Inter-Chip Optical Communications 261
through the injection of an electric signal (electrons at the top of the pil-
lar and holes through the p-InGaAsP layer). This small laser active medium
volume gives rise, unfortunately, to small quality factors and temperature-
dependencies of the laser operation. The value of Q-factor obtained for these
devices is only around 48 at room temperature. However, it could be improved
to values around 180 replacing the gold capsule for a silver one and extending
the InP regions. Also, operating at lower temperatures increases the quality
factor with values like Q = 268 at T = 10K or Q = 144 at T = 77K. These
small parameters produce the kind of lasers that cannot be yet implemented
for optical interconnects. Nevertheless, it could be considered as a first step for
the development of nanolasers, because dimensions could be even decreased
considering other geometries for the cross-section of the pillar, as authors re-
marked in their work [44].
Raman Lasers
Another physical feature that has applications in the generation of ultra-
small optical sources is the so-called Stimulated Raman Scattering (SRS).
This effect was described by Profs. Raman and Krishnan several years ago
[90]. It is an inelastic scattering process that consists in the absorption of one
photon for the diffuser. If this photon has enough energy, it could produce
the excitation or desexcitation of vibrational, rotational or electronic levels,
producing scattered photons with an energy higher (Anti-Stokes) or lower
(Stokes) that the incident photons [5]. Recently, it was demonstrated that
this kind of process can produce a lasing operation in ring cavities formed by
a silicon waveguide incorporated to a 8-m-long optical fiber [15]. In this work,
the authors show that the silicon waveguide acts like a gain medium producing
laser pulses at the Stokes wavelength of 1675 nm (the pump laser emits at
1540 nm.) with a gain around 3.9 dB in the threshold limit (9 W ). For pump
powers larger than the threshold, the gain increases, almost linearly, with a
slope efficiency ∼ 8.5%.
Following this idea, H. Rong and coworkers showed in [92] an experimental
Raman laser integrated in a single silicon chip, which is one of the first steps
to the integration of a laser for optical communications. The proposed device
is similar to that in [15] following the same physical principles but optimized
to its integration in a system-on-chip. The proposed device is formed by a
single-mode rib waveguide with a reverse biased p-i-n diode structure in order
to minimize the losses. Its geometrical cross section is schematically plotted in
Figure 7.10 with the following experimental dimensions: W (waveguide width)
= 1.5 μm, H (rib height) = 1.55 μm and d (etch depth) = 0.7 μm, which
gives rise to an effective area around 1.6 μm2 for the core of the waveguide.
However, the global device has larger dimensions. The laser cavity is formed by
an S-shape curve of this waveguide, as can be seen in Figure 7.11. One of the
sides of the waveguide is coated with a high-reflectivity multilayer (∼ 90%) for
the pump (1536 nm) and the Raman (1670 nm) wavelengths while the other
one leaves uncoated presenting a reflectivity ∼ 30% for both wavelengths. The
262 Communication Architectures for SoC
FIGURE 7.10
Diagram of the cross section of a silicon-on-insulator (SOI) rib waveguide with
a reverse biased p-i-n diode structure as it is presented in [92].
FIGURE 7.11
Experimental set-up of a silicon Raman laser integrated on a silicon chip [92].
spectra of this device presents a very narrow peak, which means high quality
factors (Figure 7.12).
The Raman scattering process presents an important source of losses, the
two-photon absorption (TPA), which produces a considerable amount of free
carries. To reduce this effect, researchers have included the p-i-n diode struc-
ture. Under an applied electric field, the pairs electron-hole generated by TPA
are guided to the p- or n- doped regions, reducing the losses generated by them
in the silicon waveguide and increasing strongly the output power of the sys-
tem. This scheme presents several advantages with respect to other similar
devices like that exposed before [15]. The first one and more important is
that it is designed for its integration on a chip, then the materials and the
experimental techniques are CMOS compatible. Also, the laser characteristics
are quite interesting with a gain ∼ 5.2 dB for a threshold ∼ 0.4 mW and
a slope efficiency of 9.4%. These results correspond to the output power of
Intra-/Inter-Chip Optical Communications 263
FIGURE 7.12
Silicon Raman laser spectra for a pump power of 0.7 mW compared with the
spontaneous emission from a similar waveguide without cavity [92].
the uncoated facet of the waveguide. If both facets are considered, the slope
efficiency can be increased to 10%.
Semiconductor Sources
The main disadvantages of the emitting cavities is that they have to be
pumped with other optical signal. Therefore, an optical fiber must be coupled
to the system. Some of the Raman lasers, explained before, are electrically
pumped and this solves the problem of having an optical source. However, the
optimized optoelectronic systems transforming an electric signal to an optical
one are the semiconductor sources of photons. These devices are based in a
p-n junction with a direct-gap semiconductor in the middle (see Figure 7.13)
emitting photons due to electron-hole recombination. The generated photons
can be reabsorbed producing a new electron-hole pair (nonradiative recom-
bination) or be radiated (radiative recombination) (see Figure 7.14). Usually,
an external potential is applied to the p-n junction to produce a population
inversion in the semiconductor with a high amount of pairs electron-hole. This
is the physical explanation of a semiconductor source [87], that includes light
emitting diodes (LEDs), semiconductor optical amplifiers, and semiconductor
injection lasers. All of them are very efficient transformers of electric energy
into optical ones. Also, their small sizes, realizability, and compatibility with
other electronics devices produce that they are ideal for several electronic
applications.
LEDs are quite simple and cheap systems. Their common characteristics
with other semiconductor devices make them good candidates for output
264 Communication Architectures for SoC
FIGURE 7.13
Basic scheme of pn junction.
FIGURE 7.14
Diagram of the recombination processes in a semiconductor material.
FIGURE 7.15
Schematic cross section of a VCSEL obtained from [19].
266 Communication Architectures for SoC
6. High-density 2D arrays
FIGURE 7.16
Schematic diagram of an asymmetric Mach-Zhender interferometer including
a phaser shifter in the two arms [65].
that is also an optical device and is integrated on-chip. The main parameter
considered in the modulation tasks is the variation of the effective refractive
index (Δnef f ) in the active area (or the phase shifter) of the modulator.
At this point, two kinds of modulators can be differentiated: Mach-Zhender
interferometer-based (MZI) modulators and micro-ring resonator-based mod-
ulators.
FIGURE 7.17
Scheme of the cross section of the MOS capacitor proposed by A. Liu and
coworkers in [65]. The gate oxide thickness is 120 Å. The polysilicon rib and
gate oxide widths are both ∼ 2.5 μm. The doping concentrations of the n-
typed and p-type polysilicon are ∼ 1.7 × 1016 cm−3 and ∼ 3 × 1016 cm−3 ,
respectively.
FIGURE 7.18
Phase shift produced by a MOS capacitor as a function of the driven voltage
(VD ) and for several lengths at λ = 1550 nm (symbols represent experimental
results while lines correspond to simulated values). Figure obtained from [65].
tively, e the electron charge, t the effective charge layer thickness, and VF B
the flat-band voltage. This change in the charge density with the presence of
free carriers produces changes in the refractive index, that will be higher as
higher driven voltage. And consequently, changes in the effective refractive
index of the material induce a phase shift through Equation (7.2). In Figure
7.18, the phase shift is plotted as a function of the driven voltage for different
capacitor lengths showing the previous dependencies.
The device proposed initially by A. Liu et al. [65] presented a 3-dB band-
width exceeding 1 GHz for a only 2.5-mm-long capacitor and a data rate
that is up to 1 Gb/s. Another interesting figure of merit of these devices is
the product Vπ L = (VD − VF B )L whose value is ∼ 8 V -cm for this model.
Actually, these values can be easily improved with a refined optimization of
the capacitor parameters, as the authors recognized. In particular, the doped
and undoped poly-Si area present high losses that are the main source of
losses in the chip, with values ∼ 6.7 dB. An improved version of the capacitor
was developed in [64] using crystalline Si for both the n-type and p-type re-
gions, which is significantly less lossy, using for its growth a technique called
epitaxial lateral overgrowth (ELO). Also, the doping concentrations are influ-
ent in the response of the device; because of that, this enhanced version also
presents higher concentrations of dopers. The overall device is 15 mm long
but with a phase shifter of only 1.6 × 1.6 μm2 . This reduction of the shifter
length gets better the characteristics with a Vπ L parameter of only 3.3 V -cm,
a 3-dB bandwidth around 10 GHz and a data rate of 10 Gb/s, which is an im-
270 Communication Architectures for SoC
FIGURE 7.19
Schematic cross section of the phase shifter based on carrier depletion pro-
posed by (a) A. Liu et al. [66] and (b) N.N. Feng et al. [33].
FIGURE 7.20
Phase shift as a function of the driven voltage for an individual shifter similar
to that explained in [66] for different device lengths and an incident wavelength
of λ = 1550 nm.
1/2
2 0 r(VBi + VD )
WD = (7.4)
eNA
0 and r being the vacuum and low-frequency relative permittivity, respec-
tively, e the electron charge, NA the acceptor concentration, and VBi the
built-in voltage. As can be seen, as the depletion width and then the free
charge density change with the square root of the applied voltage or driven
voltage (to be consistent with the previous configuration), the phase shift
changes in a nonlinear way with WD , unlike what happened in a MOS mod-
ulator. In Figure 7.20 the phase shift versus the applied voltage is plotted to
show the nonlinear dependency that follows the relation (7.4). Although this
figure corresponds to the device proposed in [66], similar curves are observed
for other devices as those presented in [33]. Usually, WD is much smaller than
the waveguide height in which the shifter is integrated. For this reason, the
pn junction position is an important parameter that can optimize the phase
modulation.
The phase shift produced in the device depends also on other parameters.
The following expression shows the main dependencies of the change in the
phase induced in a carrier-depletion shifter [33]
2πΔΓΔnL
Δϕ = (7.5)
λ
272 Communication Architectures for SoC
FIGURE 7.21
Scheme of the structure of a micro-ring resonator-based modulator similar to
that analyzed in [2].
their characteristics are not competitive yet except for the losses that are quite
small (∼ 0.5 − 2 dB).
FIGURE 7.22
Quasi-TM transmission spectrum of a single-couple ring resonator [2].
without coupling into the ring. In Figure 7.22, the transmission of the adja-
cent waveguide to a micro-ring resonator reported in [2] is shown. As can be
seen, when the incident wavelength fulfills Equation (7.6), light transmission
through the adjacent waveguide drops strongly.
Modifying the resonant condition of the micro-ring, the signal at the end
of the adjacent waveguide can be intensity-modulated. Seeing Equation (7.6),
it is easy to conclude that by tuning the effective refractive index of the ring
waveguide (nef f ), the resonant wavelength is also tuned and the transmitted
light is modulated. The resonant behavior is quite sensible to changes in nef f ,
then a small change in n (∼ 10−3 ) can produce a modulation depth (M D =
(Imax − Imin )/Imax ) of ∼ 80%.
The adjustment of nef f can be made by injecting free carriers. This task
can be accomplished optically using a pump beam as in [2] or electrically using
a p-i-n junction as in [120, 119]. The pump beam, in the first case, could be
very low-power (∼ 25 pJ) because low energies are necessary to induce a free-
carrier concentration of ΔN = ΔP = 1.6×1017cm−3 , which in turn produces a
Δnef f = −4.8×10−4 that shifts the resonant peak. However, the inclusion of a
pump beam involves more complex optics. On the contrary, an electro-optical
modulator that uses a p-i-n junction embedded in the ring resonator could be
less complex than the previous one. This device (see Figure 7.23) has the same
geometry of that shown in Figure 7.21 with the difference that neighbor n+
and p+ regions are defined with photolithography and implanted with boron
and phosphorus, as was described by Q. Xu et al. in [120]. This configuration is
quite similar to the forward-biased p-i-n junction MZI modulators discussed
above. As it has been remarked, this kind of configuration produces high
Intra-/Inter-Chip Optical Communications 275
FIGURE 7.23
Scheme of the electro-optic modulator described in [120].
modulation depths but low speeds. However, the resonant behavior of the
recent devices overcomes this challenge eliminating any speed restriction.
The main advantage of these devices is their size. In both discussed cases
[2, 120], the ring waveguide is made with a rectangular silicon waveguide of
250-nm-high and 450-nm-wide forming a ring with a radius of 5–6 μm and
located at 200 nm from the adjacent waveguide. A SEM-image of the device
in [120] is shown in Figure 7.24. Then, its integration in on-chip OI is possible,
in particular for WDM interconnects.
With this kind of device, high-depth modulation was reported at operation
speeds of 1.5 Gb/s, and even 12.5 Gb/s has been achieved with an improved
device [119]. Also, losses are quite small in these devices with 4 ± 1 dB/cm.
However, they present two important disadvantages: i) they operate over very
narrow bandwidths due to the resonant condition, and ii) they are quite de-
pendent on the temperature. These items involve two challenges that should
be studied in detail to be solved.
276 Communication Architectures for SoC
FIGURE 7.24
SEM image of the device fabricated by Q. Xu et al [120]. In the inset, a zoom
of the coupling region is shown.
TABLE 7.2
General characteristics for silicon and polymer
waveguides as it was shown in [54]
Waveguide Materials Si Polymer
Refractive index 3.5 1.5
Width (μm) 0.5 5
Separation (μm) 5 20
Pitch (μm) 5.5 25
Loss (dB/cm) 1.3 1
Silicon Waveguides
Silicon waveguides are a very attractive option for optical lines in on-chip
communications due to their simple fabrication by chemical or plasma etching,
the low-absorption of silicon in the wavelength range 1.3–1.5 μm and/or the
facility to manipulate its refractive index, and then the propagation modes,
through free carrier injection/generation. As for many cases in this chapter,
there are many different configurations and techniques for the fabrication of
silicon waveguides [46]. Here, for space requirements, only two of them can
be briefly discussed: silicon on silicon dioxide [100] and doped silica-based
waveguides [91]. However, other alternatives are still valid as, for example,
silicon on sapphire (SOS) waveguides.
Silicon on silicon dioxide waveguides are also known as silicon-on-insulator
(SOI) waveguides and are made of a crystalline silicon on an oxide layer, en-
suring a fair guiding medium. Recent silicon technologies allow to obtain high-
quality SOI wafers that are ideal for the implementation of planar waveguides.
The very different refractive indexes of silicon (n = 3.45) and silica (n = 1.45)
ensure an important light confinement. Although this large index contrast
sometimes induces nonlinear optical effects (Raman scattering, Kern effects)
that can be useful for amplifier applications (amplifier, silicon laser), it is also
an important source of losses.
One of the main problems associated with the design of SOI waveguides
is related to the election of the thickness of the silica cladding. The core of
the waveguide made of silicon must be isolated to minimize the crosstalk with
other substrates. SiO2 thicknesses larger than a few nanometers are incom-
patible with the CMOS layer [100]. For this reason, new techniques devoted
to optimize it have to be developed. Another fabrication challenge is related
with the integration of optics with electronics and the temperatures at which
optical elements are subjected to during the CMOS process. Usually, tem-
peratures higher than 1000◦C are achieved in CMOS techniques, but this
temperature can damage optical elements. To overcome these challenges, re-
Intra-/Inter-Chip Optical Communications 279
FIGURE 7.25
Fabrication process, step by step, of a SOI waveguide as it was proposed in
[100]: (a) bulk silicon wafer, (b) Si3 N4 deposition, (c) lithographic waveguide
definition, (d) ICP etch, (e) Si3 N4 deposition, (f) ICP cap etch, (g) extended
etch for quicker oxidation, (h) wet oxidation for buffer layer growth, (i) ex-
tended wet oxidation for waveguide underlayer flattening, and (j) upper layer
oxide deposition to complete optical buffer.
Intra-/Inter-Chip Optical Communications 281
Polymer Waveguides
Polymers are versatile materials whose use for optical devices is quite in-
teresting. Their good properties (thermal, mechanical or environmental stabil-
ity), the choice among a large number of materials and, what it is more impor-
tant, their low material and production costs make them an excellent alterna-
tive for the fabrication of single-mode or multimode planar waveguides. They
can be made of polymers, oligomers, monomers, thermoplastic, or thermosets.
This wide range of materials along with the number of different manufacturing
processes and/or manipulation of them involves an almost complete tuning of
the polymer-based waveguide properties. For instance, polymer-based wave-
guides can present a variable refractive index along their length, those are the
so-called graded-index waveguides described in [58]. Also, they are compat-
ible with several substrates and, in particular, with chip substrates, even if
they are either rigid or flexible. The main conclusion of this entire dissertation
could be the fact that polymer materials permit a mass production of photonic
circuits with low cost, high-quality properties and also a high-ruggedness.
In Table 7.3, there is a summary of the main commercial polymers wave-
guides and their optical properties. Also other polymers are used to obtain
high-quality waveguides as, for example, SU −8 that was used to fabricate a 10
Gbps multi-mode optical guide in [21]. From the table, the low values of optical
losses can be noted. In general, the combination of different materials make
polymers to provide low-loss guides and almost be polarization independent
(losses for TE and TM polarizations are very similar). In addition, the wide
range of manufacturing techniques (RIE, photolithography, etc.) allows better
matching between the characteristics of the device and the given requirements
for a certain application.
The most common fabrication technique is photolithography through the
use of masks in a multistep process. However, laser direct printing, not
widespread, presents several advantages from respect to photolithography.
First, it is a rapid process that does not need the previous steps as the design
and manufacture of a mask. This process does not affect areas of the sample
different from the one where the waveguide is patterned. And finally, it allows
to print new structures, impossible to obtain through the use of photolithog-
raphy masks.
One important disadvantage of polymers for guiding applications is the
temperature dependence of their refractive index. This dependence is usually
around Δn ∼ −2 · 10−4 , −3 · 10−4 per ◦ C. This characteristic could be use-
282
TABLE 7.3
Key properties of optical polymers developed worldwide. Obtained from [30]
Polymer Type Patterning Optical Loss Other properties
Manufacturer
[Trade Techniques (dB/cm) [at λ (nm)]
Name] [at λ (nm)]
Photoexposure/wet etch, 0.02[840] Birefringence: 0.0002[1550]
Acrylate RIE, laser ablation 0.2[1300] Crosslinked, Tg : 25◦ C
0.5[1550] Environmentally stable
AlliedSignal
Halogenated Photoexposure/wet etch, < 0.1[840] Birefringence: 10−5 [1550]
ful for active optical devices, as filters, but for a waveguide is considered a
handicap because thermal variations will involve changes in the propagation
modes. Then, this is an important parameter to control during the design and
fabrication process of a polymer-based waveguide for communications appli-
cations.
FIGURE 7.26
Scheme of a multilayer system integrating a free-space optical interconnect
[121].
FIGURE 7.27
Evolution of the signal-to-noise ratio (SNR) as a function of the interconnect
distance for a lensless FSOI. Obtained from [117].
vergence of the output light from the source is usually assured by the cluster
of micro-lens. Introducing a micro-lens, the beam divergence can be corrected
from 12◦ to only 1◦ [108]. However, under certain conditions the beam di-
vergence is not a problem and the lens array only involves a more complex
structure, more difficult manufacturing, higher space requirements, and higher
cost. For these cases, the OI can be designed without a lens as proposed sev-
eral years ago by R. Wang and coworkers [117]. Of course, this configuration is
strongly limited by the interconnection distance, being valid only for d < 10
mm and without any complex routing. As an example, in Figure 7.27, the
evolution of the signal-to-noise ratio (SNR) as a function of the distance is
shown for a lensless FSOI. For the shown case, the source is a VCSEL with an
aperture of φlaser = 0.03 mm in diameter, the corresponding photoreceptor
has a diameter of φP D = 0.32 mm and the channel spacing is l = 0.40 mm. As
can be seen, the SNR depends strongly on the distance. But it also depends on
the mode order of the input light in such a way that SNR is higher for short
distances and low order of the mode. The order-dependence is directly related
with the fact that for higher orders, the spatial-distribution of the beam is
also wider.
On the contrary, if the given optical link has very restrictive conditions
for SNR, losses, crosstalk, etc., the system can be complemented with other
optical elements that improve its characteristics. For example, by including
micro-prisms and a macro-mirror, M. McFadden and coauthors reported an in-
crease of the channel density [70]. As was remarked above, this kind of OI was
proposed inspired by the wireless communications for atmospherical or spatial
applications. In these cases, adaptive optical systems should be necessary to
correct the atmosphere fluctuations. Then, similar devices can be integrated
on FSOI to increase the propagation quality, as was made by C. Henderson
286 Communication Architectures for SoC
et al. in [43]. They used a ferroelectric liquid crystal on silicon spatial light
modulator (LCOS-SLM) achieving communications at 1.25 Gb/s illuminating
with a 850 nm light beam. However, these improvements involve high space
and high power requirements and complex manufacturing techniques. For this
reason, they can be observed only for very particular implementations.
7.2.4 Photodetectors
The optical receiver of an OI is composed of a light detector (photodetector,
[PD]), an amplification circuit (Transimpedance Amplifier [TIA], voltage am-
plifier), and clock and data recovery. In this section only the photodetector
will be analyzed because it is the only fully optical device.
Although there are some types of photodetectors, the most usual devices
for interconnections are designed as a transducer from an optical signal to an
electrical current with a p-i-n structure using a semiconductor on top of a Si
substrate [9, 83]. The physical process in which these devices are based on is
the generation of a pair electron-hole (eh) due to the absorption of a photon
for the semiconductor material. Commonly, the semiconductor is also con-
strained to a potential difference that accelerates the charges, increasing the
sensitivity of the detector. The semiconductor material is chosen depending
on the wavelength range at which the device will work. The typical materials
used in the implementation of photodiodes and the spectral ranges at which
they present optimum absorption are shown in Table 7.4.
TABLE 7.4
Optimum spectral range of absorption for
some semiconductors
Material Wavelength Range (nm)
Silicon 190-1100
Germanium 800-1700
InGaAs 800-2600
Intra-/Inter-Chip Optical Communications 287
FIGURE 7.28
Scheme of a Ge-on-SOI photodectector with a normal incidence geometry as
shown in [57].
288 Communication Architectures for SoC
TABLE 7.5
Simulated characteristics for a Ge-on-SOI photodetector [57]. The labels
mean: S (Finger spacing), VB (Bias Voltage), QE (Quantum Efficiency),
R (Responsivity), −3 dB-Band (−3 dB bandwidth), DC (Dark Current),
and DA (Detector Area)
QE R 3 db − Band DC DA
S (μm) VB (V) λ=850 nm λ=850 nm (GHz) (nA) (μm2 )
(%) (A/W)
-0.5 27 8
0.4 30 0.21 10x10
-1 29 85
-0.5 25 7
0.6 33 0.23 10x10
-1 27 24
FIGURE 7.29
Scheme of the germanium wavelength-integrated avalanche photodetector
(APD) described in reference [7].
FIGURE 7.30
3-dB bandwidth as a function of the bias voltage for a APD germanium pho-
todectector [7]. The upper curve corresponds to a contacts spacing of 200 nm
while the lower one corresponds to 400 nm.
FIGURE 7.31
Responsivity of a Ge-waveguide photodetector as a function of the bias voltage
for an incident wavelength λ = 1.3 μm [6].
FIGURE 7.32
Low-power optimized scheme of a detector including a Ge-on-SOI photodiode
and the necessary circuitry [57].
292 Communication Architectures for SoC
to an other. Then, the operation speed can be increased reducing the size of
the active region in such a way that the electron path is reduced. However,
a reduction of the active region (sizes smaller than wavelength) also means
a decrement on the responsivity of the device due to the diffraction limit.
This disadvantage could be overcome if high electric fields can be confined
in a subwavelength area. When a surface plasmon resonance is excited in
a nanoantenna [80, 34], the near field is strongly enhanced and confined in
extremely small volumes. Then, as was demonstrated in reference [45], the
photogeneration of carries in a semiconductor can be improved by a surface-
plasmon nano-antenna. In this sense, in the last years new photodetectors
that implement plasmonic nano-antennas have appeared.
The simplest example of the improvement of the characteristics of a pho-
todetector including plasmonic features is shown in [94]. In this work the au-
thors proposed a simple p-n junction of silicon with two aluminium contacts.
In the incident window, researchers deposited gold nanoparticles of different
sizes (50 nm, 80 nm and 100 nm in diameter). With this configuration, the au-
thors reported that the response of the photodetector is globally increased for
wavelengths lower than the resonant wavelengths due to the excitation of lo-
calized surface plasmon resonances in each particle. This enhancement reaches
values of 50%–80% for the wavelength at which the resonance is excited, and
depends on the particle size as it is well known [80]. Thus, the sensitivity
enhancement and the decrease of the size of these plasmonic photodetectors
make them very adequate for intra- and inter-chip links.
Other interesting configuration of plasmonic structures with direct appli-
cation for photodetectors is that consisting in a subwavelength aperture in a
metallic film surrounding by periodic corrugations [68, 11]. In this case, the
electric field is strongly enhanced and confined in a very small area (see Fig-
ure 7.33 (a)), then the semiconductor active region can be reduced to the same
size of the aperture (10–100 nm) improving drastically the operation speed
and without any reduction in the responsivity of the device. A scheme of the
device based on these features is shown in Figure 7.33(b).
The different ways to produce surface plasmon polaritons induce several
designs for photodetectors based on them. For example, it is interesting to
show the designs proposed by L. Tang et al. in their work [111] that used a
single C-shaped nanoaperture in a gold film and in [109] implementing a nano-
antenna formed by two gold cylinders. A scheme if these two configurations is
shown in Figure 7.34. Both designs share that have been developed for larger
wavelengths, that is telecom wavelengths (λ ∼ 1300 nm or λ ∼ 1500 nm).
This is the reason why both use germanium as the semiconductor material,
although other materials have been analyzed [110]. The first work stated that
a C-shaped aperture as that shown in it can enhance the responsivity approx-
imately 20%–50% with respect to a conventional photodetector and that the
electric field is twice that produced by a rectangular aperture with a current
of ∼ 1 nA and an illumination of 1.13 μW at 1310 nm. This device is quite
fast with a transit time ∼ 1.5 ps and a capacitance of 0.005 f F , which means
Intra-/Inter-Chip Optical Communications 293
FIGURE 7.33
(a) Distribution of light emerging from a single nano-aperture surrounded by
a groove array with a periodicity of 500 nm obtained from [11]. Each groove
is 40 nm width and 100 nm in depth. Red color means high intensity and blue
means low intensity. (b) Scheme of a nano-photodetector based on a plasmonic
nanoaperture as was proposed by K. Ohashi and coworkers [81].
294 Communication Architectures for SoC
FIGURE 7.34
Scheme of the particular configurations for a photodetector based on surface
plasmon antennas presented in works (a) [109] and (b) [111].
a wide bandwidth. On the other hand, the main characteristic of the C-shaped
configuration [109] is its size. It has a volume of ∼ 150 nm × 60 nm × 80 nm
with an active volume that is ∼ 10−4 λ3 , being probably the smallest pho-
todetector recently manufactured. Other important feature that both devices
share is their polarization dependence. Their shapes are mainly anisotropic.
Then, the polarization of the incident field is quite important producing pho-
tocurrents up to 20 times lower for one polarization that for the orthogonal
one.
The surface plasmon resonance is quite sensible to the influence of a sub-
strate underneath [77]. As higher the refractive index of the substrate is, the
higher perturbation of the SPR is. For this reason, the design of these devices
usually includes an oxide layer between the metallic layer or nano-structure
and the substrate.
The last model for a photodetector treated in this part of the chapter is
probably the SPP-based one whose real application is more developed. This
device was presented by J. Fujikata and coworkers in [35] and it consists
of a surface plasmon antenna made as a periodic silver nano-scale metal-
semiconductor-metal (MSM) electrode structure embedded in a Si layer. Then
the complete set is inserted in the interface between the core of a SiON waveg-
uide and its SiO2 clad. A scheme of the system and a micro-photograph of
it are shown in [35] and reproduced here in Figure 7.35. The grooves are
90 nm in width and 30 nm in height, their are on a 240-nm-thick Si layer
that acts as an absorber, and the waveguide was chosen with a single mode
Intra-/Inter-Chip Optical Communications 295
FIGURE 7.35
(a) Schematic cross section of the Si nano-PD proposed by J. Fujikata et al.
in [35] and (b) Micro-photograph of the fabricated device.
at 850 nm. Under these conditions, the authors reported an 85% coupling
of the propagation light through the waveguide into the Si absorption layer.
The main characteristic of the model is, of course, its integration in the op-
tical channel reducing the coupling losses. In addition, it is also a high-speed
and a very low-capacitance detector with transit times around 17 ps for a
1 V bias voltage and only 4 pF of capacitance. On the contrary, it does not
have a prominent responsivity with a 10% quantum efficiency at an incident
wavelength of 850 nm and a TM polarization, so a deeper analysis is needed
to optimize it. The system is, as the previous ones, polarization-dependent
with twice the lower quantum efficiency for a TE polarization. The authors
have already implemented this device in a on-chip optical clock distribution
showing a good response of the optical device to the clock operation, hence it
could be a commercial device in the near future.
7.2.5 Summary
In this section several types of photonic components for an OI have been
discussed. With such an amount of information, it is easy to forget the general
scheme of the issue. To summarize and refresh the main types of the considered
components, Table 7.6 goes over them and their principal characteristics.
296 Communication Architectures for SoC
TABLE 7.6
Summary of the different types of photonic elements with applications in optical
communications on-chip
Type Main Characteristics
High Q
Dielectric cavities
Small size
Smaller Q-factors than dielectric cavities
In-chip light sources
Metallic cavities
Ultra-small sizes (∼ 100nm)
Easy on-chip integration
Raman lasers
Effective area ∼ 1 μm2
OPTICAL SOURCES
EE = V 2 (Cin + Co + CL ) (7.7)
This energy is related to the power consumption (P) through the expression
EE = 2τ P, being τ the rise time or the time needed for the receiving gate’s
input voltage to rise from a 10% to a 90% of its final value [32].
In the same way, the total power consumption of an optical link depends
on the characteristics of their components. In particular, it depends on the
steady current of the emitter, Iss , the photodetector current, Iph , the quantum
efficiencies of the emitter, ηE and the detector, ηD and the efficiency of the
transmission through the optical channel, ηC .
298 Communication Architectures for SoC
FIGURE 7.36
Power consumption of a single-piece electric interconnection across the differ-
ent technologies as a function of the line length. The curves (from bottom to
up) correspond to 0.05 μm, 0.07 μm, 0.12 μm, 0.18 μm, 0.25 μm and 0.7 μm
technologies, respectively. Also the upper grey curve corresponds to the total
power consumption of an optical interconnection in a 0.18 μm technology [26].
Iph
P(Ith ) = V Iss + (7.8)
2ηE ηC ηD
In both cases, V denotes the power supply voltage.
In Figure 7.36, J. Collet et al. represented the power consumption of a
single-piece electrical interconnect for several technologies. And, for compari-
son, they showed the value that corresponds to an OI in a 0.18 μm technology
(upper horizontal gray bar). From this figure, one important conclusion can be
deduced: the OI presents advantages, regarding the power consumption, only
for long-distance interconnects. This is, for line lengths larger than 5–10 mm.
Similar results were shown by M. Haurylau and coworkers in reference [42]
and shown here in Figure 7.37. In that case, authors showed both the power
consumption and the propagation delay (that will be analyzed later) of an EI
as a function of the length of the communication line and those values for an
OI with a length equal to 17.6 mm (edge length of the chip projected in the
International Technology Roadmap for Semiconductors (ITRS) [1]). Again,
OIs are only considered for long interconnections. The main conclusion of this
figure regards the values of power consumption. Competitive optical communi-
Intra-/Inter-Chip Optical Communications 299
FIGURE 7.37
Power consumption (right axis) and signal propagation delay (left axis) for an
EI as a function of the length. Also the data for two types of OI considering as
the propagation channel a polymer or a silicon waveguide and a interconnec-
tion length equal to 17.6 mm (chip edge length in the ITRS projected chip)
[42] are shown.
cations will consume power lower than 17–18 mW for technologies of 0.18 μm.
Furthermore, smaller integration technologies (0.05 μm) are selected, the re-
duction in power should be reduced drastically to values around 10 μW . These
values require the design and development of optical sources with ultra-small
threshold currents and detectors with very high quantum efficiencies as those
in which researchers are working now [56, 106].
FIGURE 7.38
Propagation delay as a function of the link length for a silicon (0.34 μm wide
and refractive index 3.4) and a polymer (1.36 μm wide and refractive index
equal to 1.3) waveguide in comparison with an EI. Waveguides have square
cross section and are surrounded by a cladding with refractive index of 1.1
[42].
FIGURE 7.39
Comparison of the bandwidth density of electric wires and optical waveguides
made either of silicon or a polymer as a function of the year and the technology
node [42].
7.3.4 Fan-Out
It was previously shown that OIs present several advantages with respect to
EIs only when the length of the connection is large enough. In other words,
optical communications can be the counterpart of electric ones only for the
global on-chip or inter-chip communications. However, this is true only for
point-to-point communications. A system including OIs can be a good alter-
native to electric communications with high fan-out, even for short distances.
The fan-out of a digital system can be defined as the connection of the
output signal of one logic gate to the input of various gates. Three different
cases can be distinguished in fan-out connections [32]
Only for the second case, optical links can provide advantages in its char-
acteristics (power, delay, bandwidth). On the contrary, the proximity of the
receiving gates in the first and third cases makes that the increment in the
propagation delay and in the power consumption due to the inclusion of the op-
tical transmitter and the receiver cannot be compensated by the advantages in
the optical channel. However, a hybrid system that combines an optical and an
electrical interconnection gives better values for propagation delay and power
consumption in any case. In Figure 7.40(a) a fully electrical fan-out system of
the third type is plotted. As a comparison, Figure 7.40(b) represents an alter-
native optoelectronic system that implements optical and electrical fan-out.
In this hybrid system, the first division of the signal is made optically using
a transmitter (T) that sends its signal to two optical receivers (R1 and R2).
Then, they transform light into an electric signal and this is divided again,
but now using an electric system.
The inclusion of fan-out in a fully electrical communication implies a dete-
rioration on the power and delay characteristics. Delay inconveniences can be
solved partially using larger-sized transistors or optimally-spaced repeaters. In
these cases, the propagation delay (τ ) scales as the logarithmic of the number
of fan-out (N ). A hybrid system that includes optical fan-out can reduce these
values if the electro-optical conversion is efficient in terms of energy and speed.
This can be observed in Figure 7.41, where the propagation delay for a hybrid
link is plotted as a function of the load capacitance and for several numbers
of fan-outs (N ). The delay for a fully electrical link also has been included,
for comparison purposes. The load capacitance is defined as the number of
inverters driven times the load capacitance of each inverter.
As was remarked above, for a point-to-point communication (N = 1) the
optical delay is worse than the electrical one. However, as the number of fan-
outs increases, the delay of the interconnection with optical fan-out tends to
Intra-/Inter-Chip Optical Communications 303
FIGURE 7.40
(a) Schematic plot of a fully electrical fan-out system in which the receiving
gates (6 in this case) are along one line and (b) an equivalent optoelectronic
system with an optical fan-out consisting in one transmitter and two receivers.
FIGURE 7.41
Propagation delay for a system with optical fan-out versus the load capaci-
tance. Three number of fan-outs have been considered. Also the delay for an
electrical system have been included as comparison.
304 Communication Architectures for SoC
FIGURE 7.42
Normalized Eτ 2 as a function of the fan-out number of an optoelectronic
system with respect to the electric one. Some load capacitances are considered
[84].
decrease acquiring values lower than that of the electrical system. Even for
small values of N (N ∼ 8) the delay advantages of an optical fan-out can be
observed. These results can be extended to very-short on-chip links around
200–300 μm.
The implementation of an OI leads to a high power consumption due to
the electric-optical signal conversion. Because of that, an optoelectronic solu-
tion for a fan-out is generally worse in terms of energy than an electric one.
Also, it is important to remark that an optical link with fan-out is limited in
the number of driven devices due to the limited energy emitted by the source
(laser, cavity, etc.). Besides this, it is necessary to take into account the global
advantages and disadvantages. In order to analyze it, some authors have in-
troduced the parameter Eτ 2 [32], that includes the influence of the two main
parameters: the propagation delay and the power consumption of the inter-
connection. In Figure 7.42, the considered parameter is shown as a function of
the fan-out for a hybrid optoelectronic system with respect to the electric one
and for several load capacitances. As can be seen, as the load capacitance and
the number of fan-outs increase, the disadvantages related to the power con-
sumption are overcome by the advantages induced in the propagation delay.
A. Pappu and coworkers [84] stated that a hybrid optoelectronic link scales
3.6 times better in delay and 1.5 times worse in energy than an electrical one.
Then, the implementation of optical devices in on-chip interconnections with
fan-out could improve its characteristic in a general point of view.
Intra-/Inter-Chip Optical Communications 305
FIGURE 7.43
Scheme of a horizontal intrachip interconnect using a metamaterial for guiding
light from the emitter to the receptor [39].
not subject to the diffraction limit that appears in conventional optics. Then,
planar negative-refractive index or double-negative (DNG) metamaterials can
potentially offer a lossless control of light propagation at sizes much smaller
than the incident wavelength.
The main scope of optical interconnections in microelectronics is to focus
and to guide light through a chip or some chips. Then, an optical device that
integrates slabs made of a DNG metamaterial can improve the delay and
energy characteristics of the link. In a recent U.S. patent, T. Gaylord and
coworkers [39] showed the possible applications of this kind of optical system
in the field of inter- and intra-chip optical communications. In Figure 7.43,
an schematic illustration of the proposed optical on-chip interconnection is
shown that uses the unconventional slabs as a perfect lens guiding light from
the emitter to the detector. In this case, the metamaterial is composed of a
multilayer structure that is called fishnet metamaterial [38]. A very interesting
behavior of these interconnections is that the propagated beam can travel in
both directions, i.e., this is a bidirectional system, provided a detector and a
source are integrated simultaneously at the beginning and at the end of the
communication.
Other futurist evolutions of optical communications goes through the im-
plementation of optical devices with an electromagnetic response as that pre-
sented by other microelectronic elements. That is, the use of optical nano-
structures whose interaction with light produces equivalent behaviors to that
presented by, for instance, a resistance, an inductance, or a transistor [31]. If
in the future, we are able to manufacture this kind of optical element, even
308 Communication Architectures for SoC
the logic cells of the chip will be implemented with them, and light will be
used to encode the information instead of electric signals [71]. The optical
paths or links of these optical nano-circuits could also be implemented using
nano-structures, like simple particles, made of those new metamaterials that
will be able to control the direction of the light. The control of the optical
properties of these materials means, as it was commented on previously, a
control of their interaction with light, and in particular the direction at which
the material scatters light. This item was first studied a few years ago in ref-
erence [49] and recently generalized in references [37, 36]. It has been even
studied experimentally. For instance, A. Mirin and N. Halas [73] showed that
light scattering by a metallic nanoparticle with a complex geometry (nano-
cups) can be directed at certain directions. The idea of researchers is that
in the near future composites of these kinds of nano-particles can transport
light, in a similar way like waveguides, with much smaller sizes. However, the
practical implementation of these optical links is still far from being real, and
an intensive study and evolution of metamaterials are required.
FIGURE 7.44
Photograph of the prototype of an optical clock module developed in [35].
pulsed light source for the signaling chip and an optical modulator and a
continuous wave (CW) light source for the clocking one, both of them with
silicon waveguides and a silicon photodetector. Their main conclusion is that
although signaling applications offer greater challenges than signaling ones,
the last one does not offer noticiable advantages with respect to its electrical
counterpart. However, recent advantages in silicon photonics have improved
the characteristics of a clocking optical chip and the first examples have ap-
peared. For instance, J. Fujikata et al. presented in reference [35] a prototype
of a Large-Scale Integration (LSI) on-chip fully optical clock system with a 4-
branching H-tree structure. The optical chip (see Figure 7.44) is composed of
an improved coupling and confinement architecture between a SiON waveguide
and a silicon photodetector by a surface-plasmon-polariton (SPP) structure
(see Figure 7.35). The input light is produced by a 850-nm CW light source
and introduced in the chip with a lithium niobate modulator (not discussed
in this chapter). A 5 GHz operation has been reported with this experimental
and complete clock system, which is quite interesting and promising for OI.
In addition, many optical networks have been already manufactured for
different applications. One example of these networks is called Iris [62]. This
is a CMOS-compatible high-performance low-power nanophotonic on-chip
network composed of two different subnetworks. While a linear-waveguide-
based throughput-optimized circuit switched subnetwork supports large and
throughput-sensitive messages, a planar-waveguide-based WDM broadcast-
multicast nanophotonic subnetwork optimizes the transfer of short, latency-
critical and often-multicast messages. Thus, a nano-photonic network, as this
one, provides low-latency, high-throughput and low-power communications for
many-core systems.
The last efforts have produced important results in a way that the last pro-
310 Communication Architectures for SoC
7.7 Glossary
ARC: Antireflection Coating
CMOS: Complementary Metal-Oxide-Semiconductor
CW: Continuous Wave
DARPA: Defense Advanced Research Projects Agency
DBR: Distributed Bragg Reflector
DNG: Double-Negative
EI: Electric Interconnection
ELO: Epitaxial Lateral Overgrowth
ENZ: Epsilon Near Zero
ELO: Epitaxial Lateral Overgrowth
EVL: Epsilon Very Large
FSOI: Free-Space Optical Interconnect
ICP: Induced-Coupled Plasma
LCOS-SLM: Liquid-Crystal on Silicon Spatial Light Modulator
LOCOS: Local Oxidation of Silicon
LED: Light Emitting Diode
LPCVD: Low-Pressure Chemical Vapor Deposition
LSI: Large-Scale Integration
LSPR: Localized Surface Plasmon Resonance
MD: Modulation Depth
Intra-/Inter-Chip Optical Communications 311
7.8 Bibliography
[1] International Technology Roadmap for Semiconductors: 2007 Edition.
International SEMATECH, 2007.
[21] Y-M. Chen, C-L. Yang, Y-L. Cheng, H-H. Chen, Y-C. Chen, Y. Chu,
and T.E. Hsieh. 10 Gbps multi-mode waveguide for optical interconnect.
In Electronic Components and Technology Conference, 1739–1743, 2005.
[23] I-K. Cho, J-H. Ryu, and M-Y. Jeong. Interchip link using an optical
wiring method. Opt. Lett., 33:1881–1883, 2008.
[86] J. B. Pendry. Negative refraction makes a perfect lens. Phys. Rev. Lett.,
85:3966–3969, 2000.
José M. Moya
Politecnica University of Madrid, Spain
Juan-Mariano de Goyeneche
Politecnica University of Madrid, Spain
Pedro Malagón
Politecnica University of Madrid, Spain
CONTENTS
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.1.1 Side-Channel Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.1.2 Sources of Information Leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
8.1.3 Overview of Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
8.2 Power Analysis Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.2.1 Simple Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.2.2 Differential Power Analysis Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
8.2.3 Correlation Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
8.2.4 Stochastic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
8.2.5 Higher Order DPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
8.2.6 Attacks on Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
8.2.7 Attacks on Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
8.2.8 ECC-Specific Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
8.2.9 Power Analysis Attacks on Faulty Devices . . . . . . . . . . . . . . . . . . . . . . . 345
8.2.10 Multichannel Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
8.3 Logic-Level DPA-Aware Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
8.3.1 Dual-Rail Precharge Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
8.3.1.1 Sense-Amplifier Based Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
8.3.1.2 Wave Dynamic Differential Logic . . . . . . . . . . . . . . . . . . . . . . . . 349
Security Concerns about WDDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Avoiding Leakages in WDDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
8.3.1.3 Dual-Spacer Dual-Rail Precharge Logic . . . . . . . . . . . . . . . . . 354
8.3.1.4 Three-Phase Dual-Rail Precharge Logic . . . . . . . . . . . . . . . . . 355
8.3.2 Charge Recovery Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
8.3.3 Masked Logic Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
8.3.3.1 Random Switching Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
8.3.3.2 Masked Dual-Rail Precharge Logic . . . . . . . . . . . . . . . . . . . . . . 361
Security Concerns about Masked Logic Styles . . . . . . . . . . . . . . . . . 363
Avoiding Leakages in MDPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
8.3.3.3 Precharge Masked Reed-Muller Logic . . . . . . . . . . . . . . . . . . . 367
8.3.4 Asynchronous Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
323
324 Communication Architectures for SoC
8.1 Introduction
The power consumption and the electromagnetic emissions of any hardware
circuit are functions of the switching activity at the wires inside it. Since the
switching activity (and hence, power consumption) is data dependent, it is
not surprising that special care should be taken when sensitive data has to be
communicated between SoC components or to the outside.
A common approach to implementing tamper-resistance involves the use of
a separate secure co-processor module [101], which is dedicated to processing
all sensitive information in the system. Any sensitive information that needs
to be sent out of the secure co-processor is encrypted.
Many embedded system architectures rely on designating and maintaining
selected areas of their memory subsystem (volatile or nonvolatile, off-chip or
on-chip) as secure storage locations. Physical isolation is often used to restrict
the access of secure memory areas to trusted system components. When this
is not possible, a memory protection mechanism adopted in many embedded
SoCs involves the use of bus monitoring hardware that can distinguish be-
tween legal and illegal accesses to these locations. For example, the CrypoCell
security solution from Discretix features BusWatcher, which performs this
function. Ensuring privacy and integrity in the memory hierarchy of a proces-
sor is the focus of [114], which employs a hardware secure context manager,
new instructions, and hash and encryption units within the processor.
Security Issues in SoC Communication 325
This makes the resistance increase hard to quantify and the design trade-offs
difficult to make.
The remainder of this chapter is organized as follows. This introductory
section presents the basics of side-channel attacks and countermeasures. Sec-
tion 8.2 provides a more in-depth description of the different attacks based
on measurements of the total power consumption. Sections 8.3, 8.4, and 8.5
detail the available tools and techniques to avoid this kind of attacks at dif-
ferent levels (logic, architecture, and algorithm). Section 8.6 gives some notes
about efficient validation of countermeasures against power analysis attacks.
Finally, Section 8.7 provides some advice about miscellaneous design decisions,
and Section 8.8 draws some conclusions.
2. Active vs. passive: active attacks try to tamper with the devices proper
functioning; for example, fault-induction attacks will try to induce errors
in the computation. As opposed, passive attacks will simply observe the
devices’ behavior during their processing, without disturbing it.
FIGURE 8.1
Structure of a typical side-channel attack.
field, dissipate heat, and even make some noise [108]. As a matter of fact,
there are plenty of information sources leaking from actual computers that
can consequently be exploited by malicious adversaries. In this chapter, we
focus on power consumption and electromagnetic radiation that are two fre-
quently considered side-channels in practical attacks. Since a large part of
present digital circuits is based on CMOS gates, this introduction also only
focuses on this technology.
Side-Channel Analysis (SCA) is one of the most promising approaches to
reveal secret data, such as cryptographic keys, from black-box secure crypto-
graphic algorithms implemented in embedded devices. Differential Side Chan-
nel Analysis (DSCA) exploits (small) differences in a set of measurements by
means of statistics and is particularly well suited for the power analysis of
block cipher implementations.
A side-channel attack works as follows (see Figure 8.1): it compares obser-
vations of the side-channel leakage (i.e., measurement samples of the supply
current, execution time, or electromagnetic radiation) with estimations of the
side-channel leakage. The leakage estimation comes from a leakage model of
the device requiring a guess on the secret key. The correct key is found by
identifying the best match between the measurements and the leakage estima-
tions of the different key guesses. Furthermore, by limiting the leakage model
to only a small piece of the algorithm, only a small part of the key must be
guessed and the complete key can be found using a divide-and-conquer ap-
proach. For instance, an attack on the Advanced Encryption Standard (AES)
generally estimates the leakage caused by a single key byte and as a result the
128-bit key can be found with a mere 16 · 28 tests. Finally, as the observations
might be noisy and the model might be approximate, statistical methods are
often used to derive the secret from many measurements.
Attacks that use a single or only a few observations are referred to as
328 Communication Architectures for SoC
FIGURE 8.2
SPICE simulation result of a 2 input NAND gate in static CMOS logic. Figure
from [44]. Used with permission.
• A digital circuit functions by evaluating the input voltage level and set-
ting the output voltage level based on the input through a set of logic
gates. In current CMOS technology, the logic value of a gate actually
depends on the amount of charge stored on the parasitic capacitor at the
output of the logic gate. A fully-charged capacitor represents logic-high
(logic-1) whereas a depleted capacitor represents logic-low (logic-0). For
each binary logic gate, there can only be four types of transitions on the
gate output wire. Only one transition, from logic-0 to logic-1, actually
draws current from the power supply to charge up the parasitic capaci-
tor. By monitoring the amount of current consumed by the digital circuit
at all times, we can get an idea on the relative amount of logic gates
that are switching at any given time. This gives us some information
about the circuit based on power consumption.
• Parasitic capacitances are not uniform for every gate. They depend on
the type of gate, fanout of the gate, and also the length of the wire or
net in between the current gate and its driven gates. Taking the length
of wires as an example, even if two exact same gates with the same
number of fanout are connected to the same set of successor gates, if the
routing of the wires are different, the capacitance will differ. If both had
a power-consuming transition, the amount of power consumed will be
different, thus leaking important power information about the circuit.
Static power leakage consumption is a characteristic of the process used
to manufacture the circuit. The exact amount of leakage for a given
gate within a circuit is not controllable by a logic designer. Assuming
the gates are manufactured exactly the same, then the static power
leakage does not pose a threat. But this is not the case in the real
world. Process variation plays an important role in the balancing of
static power leakage. As process variation increases, the variation of the
amount of charge leaked for every gate during a fixed period of time
also increases. Unfortunately, the effect of the static power leakage due
to process variations cannot be evaluated at design time, thus can only
be seen through actual measurements on the finished product, whether
it is an ASIC design or a design for FPGA.
We ignore coupling effects and create a linear model: the power consump-
tion of the chip is the sum of the power consumption of its components. Hence,
we can isolate the power consumption of the component (registers, memory
cells, or bus lines) storing the temporary variable predicted by the selection
function D from the power consumption of the rest of the device.
CMOS devices only consume power when logic states change after a clock
pulse. Static power dissipation is negligible [23, 31]. Let B = (βm , βm−1 , ..., β1 )
be the value of the relevant variables to store in the register. Let A =
(αm , αm−1 , ..., α1 ) be the previous content of the register. Let c01 (resp. c10 )
be the amount of power needed to flip a bit of the register from 0 to 1 (resp.
from 1 to 0). If after the clock top the value B is stored in the register, the
power consumption of the chip after the clock top is:
m
C(t) = (1 − αi )βi c01 (t) + αi (1 − βi )c10 (t) + w(t) (8.1)
i=1
Where w(t) is the power consumption of the rest of the chip, that can
be modelled by a Gaussian noise whose mean represents the constant part of
power consumption between two different executions of the algorithm. This
assumes that the sequence of instructions executed by the chip does not de-
pend on the data. This model can be easily simulated to verify the soundness
of a new attack.
The data-dependencies in the dynamic power consumption causes four
major sources of leakage:
Note that early propagation and memory effect have much less impact
on the security features of a circuit than imbalance and exposure time. So,
it is essential to minimize the circuit imbalance and exposure time before
optimizing the circuit for the early propagation and memory effect metrics.
ferential power analysis attacks, but they are not enough for higher-order
attacks.
manipulates sensitive information is executed (e.g., load part of the secret key
to the accumulator). Depending on the Hamming Weight (number of “1”) of
the key data manipulated, the amplitude of the power trace in that instant
varies. If the attacker is expert and has a consumption reference model, he
can estimate the HW of the key from the power amplitude.
Another scenario could be an implementation where the instructions exe-
cuted depend on data (e.g., conditional branch depending on a bit value). If
the attacker has information about the implementation and localizes the ex-
ecution of the algorithm in the power trace, it can derive the data processed
from the duration of the cycles. If the execution duration is different for differ-
ent instructions, it is possible to assign sections of the power trace to concrete
instructions executed.
We have two diffent sources for an attack, although authors typically refer
SPA to the amplitude based ones.
SPA attacks require detailed knowledge about the implementation of the
algorithm in the device. The attack process starts with a thorough analysis of
the target device and its implementation of the algorithm. Useful information
includes:
check which matches the traces collected. This operation is not done to the
whole trace at once, but in a concrete instant of the trace, an instant where the
power consumption depends on an intermediate value of data (v) and part of
the key (k). Consequently, the first step is to choose an instant (concrete j) for
the attack that corresponds to a known intermediate state of the algorithm.
For each possible k, associated v can be calculated from known data input
(P T Ii ) applying the algorithm or from known data output (CT Oi ) applying
the inverse algorithm. At this point, the attacker has a vector of real power
values (Sij ) and, for each possible k, a vector of intermediate values (vik ),
i ∈ 1, ..., N, j instant chosen, and k the possible key. The next step is to find
the key with greater correlation between calculated intermediate values and
real power values. In the DPA attack the difference of means method is used.
The Sij vector is split into two vectors (S0 and S1 ) using a selection func-
tion, D. D assigns Sij to S0 when vik is 0 and to S1 when vik is 1. For every
possible part of the key we have two subsets S0 and S1 . For each subset the
average power signal is:
1
A0 [j] = Sij (8.3)
|S0 |
Sij ∈S0
1
A1 [j] = Sij (8.4)
|S1 |
Sij ∈S1
where |S0 | + |S1 | = N . The difference is DTj . If we had this DTj for every j in
the power trace, we would have a function DT [j], denoted as the differential
trace. The differential trace is:
Although we have seen a selection function D for one instant and one data
bit, multiple bits and multiple instants can be used in order to increase the
difference between the correct key guess and the incorrect ones. For instance,
Messerges et al. [80] use d-bit data and two sets, and they assign those with
greater Hamming Weight to S1 (H(Vij ≥ d/2) and the rest to S0 (H(Vij <
d/2). In [9] Bevan improves the DPA attack using 4-bit D function. Instead of
deciding the key value when the four selection functions agree, they sum the
four differences of means to reach a solution faster. This solution is possible
because every bit influences the power consumption at the same time. Instead
of always having a binary division, in [5] they use d-bit attacks and they
divide the set of traces into d + 1 subsets. In [64] there is a formal definition of
the differential trace that includes the mentioned attacks as particular cases,
assigning values to aj in
d
Sc Sij
DT = ac (8.6)
c=0
|Sc |
S = aH(D ⊕ R) + b (8.7)
where a is a scalar gain and b represents power consumptions independent
from data (noise, time dependent components, ...). The correct key guess is
the one that maximizes the correlation factor ρSH .
The estimated value of the correlation factor, ρ̂SH , is obtained from the
traces stored and data known with the formula:
N Sij Hj,R − Sij Hj,R
ρ̂SH (i, R) = (8.8)
N Sij 2 −( Sij )2 N Hj,R 2 −( Hj,R )2
FIGURE 8.3
CPA attack results of the Fastcore AES chip. Figure from [44]. Used with
permission.
Figure 8.3 shows the results of a sample CPA attack to an ASIC AES
implementation. The graph on the left shows the correlation of all K = 256
subkey permutations to the measurement results as a function of the number
of measured samples S. On the right, the correlation of all K = 256 subkey
permutations is given for 10,000 measurements.
N3
α(x1 , ..., xN3 ; k) = f˜0 it (xj , k o ) − h̃∗t (xj , k) (8.11)
j=1
DT 2 =
Sik − Sij Sij ∈S1 −
Sik − Sij Sij ∈S0 (8.12)
In case the adversary only knows the offset δ = k − j (but not j nor k),
the previous attack can be extended as a “known-offset second-order DPA
attack” [129]. The adversary evaluates the second-order differential trace:
DT2 [j] = Si(j+δ) − Sij Sij ∈S1 − Si(j+δ) − Sij Sij ∈S0 (8.13)
w = u ⊕ v = um ⊕ vm (8.14)
Note that we can calculate the value of the combination of two masked
intermediate values without having to know the mask.
However, second-order DPA attacks are not always required to break a
masking scheme. There are some cases where the independence property is
not fulfilled, and therefore first-order attacks may suffice (see [73]):
• Multiplicative masking. Multiplicative masking, that is often used in
asymmetric ciphers, does not satisfy the independence condition because
if v = 0, then v × m = 0. The masked value has certain dependence on
the unmasked value, and therefore it is vulnerable to first-order DPA
attacks. In particular, it is vulnerable to zero-value DPA attacks [4].
• Reused masks. If masks are reused, the result of an exclusive-or of two
intermediate values concealed with the same Boolean mask would be un-
masked. Also, if a device leaks the Hamming distance of the intermediate
values, and two values concealed with the same mask pass consecutively
through the bus, the Hamming distance of the unmasked values would
be leaked.
• Biased masks. Uniformly distributed masks are essential for the security
of a masking scheme. Hence, the attacker could try to force a kind of bias
into the masks, either by manipulating the device (e.g., fault attacks), or
by selecting a subset of the measured traces that correspond to a subset
of masks.
344 Communication Architectures for SoC
of the zero-value point attacks [4], which tries to observe that characteristic
power consumption trace of 0 in the outputs of the elliptic curve addition and
elliptic curve doubling blocks rather than in intermediate points.
can be permanently changed in this way. Using a probing needle, the content
of Electrically-Erasable Programmable Read-Only Memory (EEPROM) cells
can be manipulated [6]. It is also possible to injected permanent faults by
modifying the device itself and cutting some wires using a laser cutter or a
focused ion beam (FIB) [57].
Differential fault analysis is not limited to finding the keys of known ci-
phers. An asymmetric fault model makes it possible to find the secret key
stored in a tamper-resistant cryptographic device even when nothing is known
about the structure and operation of the cryptosystem.
Moreover, it is possible to use DFA to extract the exact structure of an
unknown DES-like cipher sealed in the tamper-resistant device, including the
identification of its round functions, S-boxes, and subkeys.
vector of length L, with the l-th entry being the sample mean difference of
the l-th channel. Furthermore, the variance of the b-bin under hypothesis H
at time j, is a covariance matrix of size L × L with the i, j-th entry being the
correlation between signals from the i-th and j-th channels.
Encryption
Module
prch
clk
FIGURE 8.4
Precharge wave generation in WDDL.
However, this methodology does not guarantee that each compound gate
has only one switching event per cycle. Both timing and value of the inputs
influence the number of switching events. WDDL resolves this problem by
conceiving a secure version of the AND- and OR-operator. Any other logic
function in Boolean algebra can be expressed with these two differential op-
erators. The differential inverter is not needed, as it can be implemented by
simply exchanging the outputs of the previous gate.
Contrary to simple dynamic differential logic, WDDL gates remove the
logic cells required to precharge the outputs to 0, and so they do not precharge
simultaneously. As the input signals are usually connected to the outputs
of other dynamic gates, whenever the inputs of any AND- or OR-gate are
precharged to 0, the outputs are automatically at 0. The precharged 0s ripple
through the combinatorial logic. Instead of a precharge signal that resets the
logic, there is a precharge wave: hence the name “Wave Dynamic Differential
Logic.”
In order to launch the precharge wave, a precharge operator is inserted at
the inputs of the encryption module, and Master-Slave DDL registers should
be used, as shown in Figure 8.4. The registers store the precharged 0s, sampled
at the end of the preceding precharge phase, during the evaluation phase, and
they launch the precharge wave again.
WDDL can be characterized by the following properties:
• Negative Logic: Notice that, all the WDDL gates have no inverters in
them, even though they still implement negative logic, for example,
WDDL NAND gate. Inverters disrupt the precharge wave propagation
because the wavefront is inverted. Therefore, WDDL uses only positive
logic and implements inverters by cross-coupling wires from the direct
gate with those from the complementary gate.
Note that the above three properties address the first source of informa-
tion leakage from power consumption. Rather than altering the proba-
bility of power-consuming transitions like masking-based logic, WDDL
makes such transitions happen on every clock cycle. This effectively de-
flects any attempt to correlate the number of transitions to the secret
data.
2. Leakage caused by the difference of delay time between the input signals
of WDDL gates.
The impact of these leakages has been studied by Suzuki and Saeki [115].
The power consumption at the CMOS gate can be generally evaluated by:
2
Ptotal = pt · CL · Vdd · fclk + pt · Isc · Vdd · fclk + Ileakage · Vdd (8.15)
where CL is the loading capacitance, fclk is the clock frequency, Vdd is the
supply voltage, pt is the transition probability of the signal, Isc is the direct-
path short circuit current, and Ileakage is the leakage current. As realized
from the formula 8.15, the power consumption at the first term is different
between the gates if there is a difference in the loading capacitance between
each complementary logic gate. Since the existence of a transition at each
complementary logic gate is determined by the values of the input signals, the
total power consumption differs in dependence of the signal values even if the
total number of transitions is equal between the gates.
As described previously, the transition probability during an operation
cycle at the WDDL gates is guaranteed to be pt = 1 independently of the
input signals. However, the operation timing of each complementary logic
gate is generally different mainly due to the delay time of the input signals
during an operation cycle. Therefore, since the average power traces specified
by the predictable signal values have different phases, a spike can be detected
after the DPA operation.
all-zeroes spacer
00
code
01 "0" 10 "1"
words
11
all-ones spacer
FIGURE 8.5
State machine for the dual spacer protocol.
proposed “fat wire,” and Guilley et al. [43] proposed “backend duplication” as
countermeasures in the placement and routing to improve the DPA-resistance.
However, we are not aware of any study of a countermeasure against the
inevitable leakage due to the differences in the delay of two different signals.
The S-box design method for low power consumption proposed by Morioka and
Safoh [85] is recommended as one technique to reduce this leakage. In general,
adjusting the delay time between the input signals at each gate requires a high
effort.
to use two spacers (i.e., two spacer states, 00 for all-zeroes spacer, and 11 for
all-ones spacer), resulting in a dual spacer protocol (see Figure 8.5). It defines
the switching as follows: spacer → code word rightarrow spacer rightarrow
code word. The polarity of the spacer can be arbitrary and possibly random.
A possible refinement for this protocol is the alternating spacer protocol. The
advantage of the latter is that all bits are switched in each cycle of opera-
tion, thus opening a possibility for perfect energy balancing between cycles of
operation.
Single-rail circuits can be converted automatically into dual-spacer dual-
rail circuit with a software tool named the “Verimap design kit,” from Sokolov
et al., and it successfully interfaces to the Cadence CAD tools. It takes as input
a structural Verilog netlist file, created by Cadence Ambit (or another logic
synthesis tool), and converts it into dual-rail netlist. The resulting netlist can
then be processed by Cadence or other EDA tools.
eval
charge
discharge
in
out
out
FIGURE 8.6
Timing diagram for the TDPL inverter.
its way from the supply node to the load. In contrast, in charge recovery cir-
cuits each capacitance node is charged steadily, and the voltage drops across
the resistive elements are made small in order to reduce the energy dissipation
during the charge or discharge of the capacitive loads via a power clock signal.
Basically, charge recovery logic styles have been devised for low-power pur-
poses. However, they have some other characteristics such as inherent pipelin-
ing mechanism, low data-dependent power consumption, and low electromag-
netic radiations that are usually neglected by researchers. These properties
can be useful in other application areas such as side-channel attack-resistant
cryptographic hardware. Moradi et al. [84] have recently proposed a charge
recovery logic style, called 2N-2N2P, as a side-channel attack countermeasure.
The observed results show that the usage of this logic style leads to improve
DPA-resistance as well as energy saving.
Several charge recovery styles have been proposed, each one with its own
characteristic and efficiency, but their fundamental structure does not differ
much from each other. Due to its simplicity, Moradi et al. choose 2N-2N2P [58]
to examine the DPA-resistance of charge recovery logics [84].
A 2N-2N2P gate consists of two main parts:
• two functional blocks whose duty is to construct the gate outputs q and
q
• two cross-coupled inverters, which are formed by two p-MOS and two
Security Issues in SoC Communication 357
n-MOS transistors and allow maintaining the right (high or low) voltage
levels at the outputs.
All 2N-2N2P gates operate at four different phases: input phase, evaluation
phase, hold phase, and reset phase. During the input phase, the inputs reach
their own valid values. During the evaluation phase the outputs are calculated
and reach their valid values. During the hold phase the inputs discharge to 0
and the output values remain valid. Finally, during the reset phase the outputs
are discharged to 0.
The 2N-2N2P is a full-custom logic style. The transistor cost of all 2N-
2N2P logic cells is less than the corresponding SABL and TDPL cells. From
the power consumption point of view, the peak of power consumption traces
in DPA-resistant logic styles does not depend on the frequency but in charge
recovery logic families it does. But the average energy consumption per cycle
of a 2N-2N2P gate is much smaller than the SABL and TDPL ones.
A sudden current pulse in a CMOS circuit causes a sudden variation of
electromagnetic field surrounding the device. According to Faraday’s law of
induction, any change in the magnetic environment of a coil of wire will cause
a voltage to be induced in the coil. In this way the data-dependent electromag-
netic variation can be captured by inductive sensors. The peak of power con-
sumption traces and the slope of supply current changes are also much smaller
than SABL, especially for low power clock frequencies. Therefore, 2N-2N2P
has also less electromagnetic emanations than the other static DPA-resistance
logic styles.
The pipelining structure of 2N-2N2P (and other charge recovery logic fam-
ilies) causes the circuit to process multiple data simultaneously. Therefore,
the power consumption at each cycle depends on several data that are being
processed. Obviously, a pipeline does not provide an effective countermeasure
against DPA attacks, and it can be viewed as a noise generator that has the ad-
vantage of decreasing the correlation between predictions and measurements.
However, by evaluating an information theoretic metric, mutual information,
Moradi et al. [84] conclude that the information leakage is much smaller than
in SABL, specially at low frequencies.
Detailed experimental results about charge recovery logics as DPA-
countermeasures are still missing.
signals with a masking-bit, then later removes the mask by doing another XOR
operation. In this approach, each probably attacked signal b is represented by
bm = b ⊕ mb , where mb is a uniformly distributed random variable (i.e.,
p(mb = 0) = p(mb = 1) = 1/2) and is independent of b. Consequently, the
bm also is a uniformly distributed random variable. In the masking-approach,
a circuit is replaced with a masked implementation. For example, a 2-input
XOR function g = a ⊕ b is replaced with gm = am ⊕ bm and mg = ma ⊕ mb .
We refer to the implementation gm as M-XOR gate. The mg signal is called
its correction mask.
Masking on the gate level was first considered in the U.S. patent 6295606 of
Messerges et al. in 2001. However, the described masked gates are extremely
big because they are built based on multiplexors. A different approach has
been pursued later on by Gammel et al. in patent DE 10201449. This patent
shows how to mask complex circuits such as crypto cores, arithmetic-logic
units, and complete microcontroller systems.
The problem with those masked logic styles is that glitches occur in these
circuits. As shown in [117], glitches in masked CMOS circuits reduce the SCA
resistance significantly. Therefore, glitches must be considered when introduc-
ing an SCA countermeasure based on masking.
One widely used masking-based logic style that avoids glitches is Random
Switching Logic (RSL) [117].
Popp and Mangard [96] proposed a new logic style called masked dual-rail
precharge logic (MDPL) that applies random data masking into WDDL gates.
There are no constraints for the place and route process. All MDPL cells can
be built from standard CMOS cells that are commonly available in standard
cell libraries.
Figure 8.7 shows the basic components of MDPL. The logic AND- and OR-
gates in WDDL were implemented with a pair of standard two-input AND-
and OR-gates. In MDPL, both compound gates apply a pair of standard 3-
input majority (MAJ) gates.
The architecture of cryptographic circuits using MDPL is shown in Fig-
ure 8.8. The signals (am , bm , am , bm ), which are masked with the random data
m and m, and these random values are the input signals of the combinational
circuit.
When examining the security of MDPL against DPA, we assume that
an attacker can predict the architecture of the combinational circuit and
the pre-masking signals (a, b, a, b) corresponding to the masked counterparts
(am , bm , am , bm ). And the random numbers m and m can be predicted only
with a probability of 1/2.
MDPL circuits have the following features:
2. The precharge signal controls the precharge phase to transmit (0, 0) and
the evaluation phase to transmit (0, 1) or (1, 0).
Security Issues in SoC Communication 359
a m m
a
a MDPL
b MAJ q b NAND
a m m
m b
a MDPL
b NAND
a m m
a b
b MAJ q a MDPL
m b NAND
b
AND / NAND gate
XOR gate
MDPL D-flip-flop
a m m
a MDPL
CMOS
0 b XOR
DFF
1 b
clk
FIGURE 8.7
Basic components of MDPL.
cominational
MDPL
data in MDPL
D-flip-flops
gates
RNG
FIGURE 8.8
Architecture of a MDPL-based cryptographic circuit.
360 Communication Architectures for SoC
Input1
Input2 Output1
RSLGate1
RSLGate1 Output2
RandomBit
enable1 enable2
FIGURE 8.9
Random Switching Logic example.
consumption correlation with the secret key is removed. However, further anal-
ysis by using a threshold filter can remove the randomizing effect caused by
the random bits used. After the random bits are removed, the stripped RSL
circuit becomes just like the original circuit with no countermeasure and can
be attacked using power analysis to single out the secret key.
Logic styles that are secure against DPA attacks must avoid early prop-
agation. Otherwise, a power consumption occurs that depends on the un-
masked data values due to data-dependent evaluation moments. In [27], the
logic style Dual-rail Random Switching Logic (DRSL) is presented. In DRSL,
a cell avoids early propagation by delaying the evaluation moment until all
input signals of a cell are in a valid differential state.
Popp et al. [95] point out that DRSL does not completely avoid an early
propagation effect in the precharge phase. The reason is that the input signals,
which arrive at different moments, can still directly precharge the DRSL cell.
The propagation delay of the evaluation-precharge detection unit (EPDU)
leads to a time frame in which this can happen. Only after that time frame,
the EPDU unconditionally precharges the DRSL cell.
E01 = E10 = E11 . This is in fact the motivation for using dual-rail precharge
(DRP) logic styles such as SABL [59] or WDDL [122]. DRP logic styles have
the property that transitions need the same amount of energy, if all pairs of
complementary wires are perfectly balanced, i.e., have the same capacitive
load. However, as already discussed, this requirement is very hard or even
impossible to guarantee. This is the motivation for Masked Dual-rail Precharge
Logic (MDPL) [96]. MDPL is based on a completely different approach to
prevent DPA attacks.
MDPL is a masked logic style that prevents glitches by using the DRP prin-
ciple. Hence, for each signal dm also the complementary signal dm is present
in the circuit. Every signal in an MDPL circuit is masked with the same mask
m. The actual data value d of a node n in the circuit results from the signal
value dm that is physically present at the node and the mask m: d = dm ⊕ m.
Figure 8.7 show the basic elements of the MDPL logic style. An MDPL
AND gate takes six dual-rail inputs (am , am , bm , bm , m, m) and produces
two output values (qm , qm ). As shown in [96], qm and qm can be calculated
by the so-called majority (MAJ) function. The output of this function is 1, if
more inputs are 1 than 0. Otherwise, the output is 0: qm = M AJ(am , bm , m)
and qm = M AJ(am , bm , m). A majority gate is a commonly used gate and it
is available in a typical CMOS standard cell library.
In an MDPL circuit, all signals are precharged to 0 before the next eval-
uation phase occurs. A so-called precharge wave is started from the MDPL
D-flip-flops, similar to WDDL [122]. First, the outputs of the MDPL D-flip-
flops are switched to 0. This causes the combinational MDPL cells directly
connected to the outputs of the D-flip-flops to precharge. Then, the combina-
tional gates in the next logic level are switched into the precharge phase and
so on. Note that also the mask signals are precharged. The output signals of
the MDPL AND gate are precharged if all inputs are precharged. All combina-
tional MDPL gates are implemented in that way. Therefore, in the precharge
phase, the precharge wave can propagate through the whole combinational
MDPL circuitry and all signals are precharged correctly.
A majority gate in a precharge circuit switches its output at most once
per precharge phase and at most once per evaluation phase, i.e., there occur
no glitches. In a precharge circuit, all signals perform monotonic transitions
in the evaluation phase (0 to 1 only) and in the precharge phase (1 to 0 only),
respectively. Furthermore, the majority function is a so-called monotonic in-
creasing (positive) function. Monotonic transitions at the inputs of such a gate
lead to an identically oriented transition at its output. Hence, a majority gate
performs at most one (0 to 1) during the evaluation phase and at most one
(1 to 0) during the precharge phase. Since an MDPL AND gate is built from
majority gates, an MDPL AND gate will produce no glitches.
Other combinational MDPL gates are based in the AND gate, as shown
in [96].
However, the use of MDPL gates has a significant cost. It implies an aver-
Security Issues in SoC Communication 363
age area increment by 4.5, the maximum speed is also reduced by 0.58, and
the power consumption is increased by 4 to 6.
w1
driver 1
C3 C1
w2
driver 2
C2
FIGURE 8.10
Interwire capacitance model.
Although theoretically possible, Popp et al. have recently shown that the
masked logic style MDPL is not completely broken because of early propaga-
tion. In regular designs where the signal-delay differences are small, MDPL
still provides an acceptable level of protection against DPA attacks. With a
PDF attack [103], no key byte of the attacked MDPL AES coprocessor was
revealed with 3,511,000 power measurements. This is also clear from the fol-
lowing perspective. With the settings chosen for the PDF-attack in theory
(no electronic and other forms of noise, no perfect balancing of dual-rail wire
pairs, ...), it can easily be shown that all DPA-resistant logic styles that try to
achieve a constant power consumption are also completely broken. However,
various publications draw opposite conclusions from experimental results.
iMDPL D-flip-flop
FIGURE 8.11
Basic components of iMDPL.
Security Issues in SoC Communication 367
the input signals reach the inputs of the three latches before the EPDU sets
its output to 0. Fortunately, this timing constraint is usually fulfilled because
of the propagation delay of the EPDU.
Finally, if the first input signal is set back to the precharge value, the
EPDU again produces a 1 and all six outputs of the set-reset latches switch to
0. Note that the set-reset latches are only set to this state by the EPDU and
not by an input signal that switches back to the precharge value. Thus, also an
early propagation effect at the onset of the precharge phase is prevented. An
iMDPL-OR cell can be derived from an iMDPL-AND cell by simply swapping
(i.e., inverting) the mask signals m and m.
Obviously, the price that has to be paid for the improvements in terms
of early propagation is a further significant increase of the area requirements
of iMDPL cells compared to MDPL. Since the iMDPL cells are already quite
complex, exact figures for the area increase cannot be given in general because
it depends significantly on the particular standard cell library that is used to
implement an iMDPL circuit. However, one can expect an increase of the area
by a factor of up to 3 compared to the original MDPL. This makes it clear that
carefully finding out which parts of a design really need to be implemented in
DPA-resistant logic is essential to save the chip area.
A significant reduction of the cell size can be achieved by designing new
standard cells that implement the functionality of iMDPL. Of course, that has
the well-known disadvantages of a greatly increased design and verification
effort. Furthermore, a change of the process technology would then mean
spending all the effort to design an iMDPL standard cell library again.
a SR AND SR
XOR
a (MUX-DS)
a Dual rail
a
precharge
FIGURE 8.12
Basic components of PMRML.
in the correction masks generator. Initial masks should come from a Random
Number Generator (RNG), which is assumed to be already available to the
design.
The precharge logic is used to ensure at most one transition at an AND
(NAND) gate during a cycle. This makes the gates glitch-free. Though the
precharge method has been used in many other logic styles, there are two
main differences. First, only a subset of data is conveyed by dual-rail signals.
Specifically, only the selection signals of MUX-DSs need them. Second, a multi-
stage precharge scheme is used to reduce the performance penalty caused by
the precharging time.
The proposed masked circuit for an AND gate (g = c·d) is a 4X1 MUX-DS,
which is implemented by a 2-level NAND network, and the selection variable
is encoded with a dual rail. In this work, a NAND gate is assumed to be
atomic, but the whole MUX-DS circuit is not. Lin et al. [67] prove that the
masked implementation of an AND function g = c · d shown in Figure 8.12 is
glitch-safe and DTS-safe when it is used in a PMRML design.
In a multistage PMRML structure, each stage is controlled by separate
P E (precharge enable) signals. All stages start precharge (P Ek = 1) simul-
taneously, and start evaluation (P Ek = 0), at different time. The duration
of the precharge pulse of each stage should be enough to ensure that all the
nodes in the stage have been reset. Also, no other evaluation phase should be
started until all input signals from the stage currently evaluating have been
set stable.
AES encryption hardware design was successfully synthesized by Synopsys
DC with conservative wire load model under UMC 0.18 μm technology. Com-
pared to the unprotected design, as shown in Table 8.1, the area is increased
by 100% and the speed is decreased by 29% in the PMRML design, what is
not bad compared to other alternatives. Although a precharge scheme is used
in the PMAXL design, the multistage schemes make the speed not halved.
Security Issues in SoC Communication 369
TABLE 8.1
Relative cost of different anti-DPA logic styles
Protection Area Performance
Unprotected 1 1
WDDL 3 0.26
MDPLtextsuperscript* 4.54 0.58
PMRMLtextsuperscript* 2 0.71
*Does not include RNG circuit
The second group of time-oriented hiding techniques affects the clock sig-
nal. These proposals randomly change the clock signal to make the alignment
of power traces more difficult:
• Randomly skipping clock pulses, by filtering the clock signal.
• Randomly changing the clock frequency.
• Randomly switching between multiple clock domains.
For all these countermeasures, if the attacker is able to identify the coun-
termeasure, its effect can be undone.
Amplitude-oriented hiding countermeasures try to make the power con-
sumption equal for all operations and all data values, or, alternatively, to add
noise to the power traces to counteract power analysis attacks.
• Filtering the power suply.
• Noise generators.
It is important to note that the signal-noise ratio (SNR) not only depends
on the cryptographic device, but also on the measurement setup. If a counter-
measure reduces the SNR for one measurement setup, it does not necessarily
reduce the SNR for all setups. For example, filtering the power consump-
tion makes attacks more difficult if the power consumption is measured via
a resistor, but it does not affect measurements of electro-magnetic emissions
significantly.
Architecture-level masking countermeasures usually apply Boolean mask-
ing schemes to the different circuit elements. Other types of masking are not
worth to design unless Boolean masking is not applicable. Some of the most
used techniques include:
• Masking adders and multipliers.
• Random precharging. The duplicates of the registers contain random
values and they are connected randomly to the combinational cells.
• Masking buses. Buses are particularly vulnerable to power analysis at-
tacks because of their large capacitance. Basic bus encryption prevents
eavesdropping on the bus.
Next, some interesting architecture-level countermeasures will be discussed
in more detail.
that provide higher security against certain fault models. The area overhead
for a 32-bit AES prototype is 36%, and the power consumption is increased
by 55%.
Note, however, that if cascaded registers are allowed, there exist several
configuration options that lead to identical sequences of combinatorial and
sequential logic. Concerning this matter and allowing
up to m cascaded reg-
isters, the number c of distinct configurations is n+m−1
m , i.e., the number of
combinations of m elements out of n, where the order does not matter and
repetition is allowed. The probability to observe the same configuration twice
is 1/c. However, the number of possible configurations determines the size of
the memory needed to store the configuration data and is therefore bounded.
Further, an increasing number of intermediate registers increases the number
of cycles needed for one encryption. The number of registers, however, does
not affect the maximal clock frequency, because more than one register can
be cascaded. In general, the number of options for the temporal shift is de-
termined by the number m of registers and the number n of blocks, and is
bounded above by c.
It is important to remember that the number of needed power traces for
a successful DPA attack grows quadratically with the number of operations
shuffled [23]. And in this case, a preprocessing phase to properly align the
traces can be very difficult.
FIGURE 8.13
Overview of a SORU2 system.
FIGURE 8.14
SORU2 datapath.
376 Communication Architectures for SoC
bit data operators: 1) the result from the previous BRU, 2) a new data item
from the SORU2 register file, and 3) the last result computed by itself.
The result of a BRU operation is stored in the pipeline register, so it can
be used by the next BRU at the next clock cycle. Moreover, that result can
be sent to the register file where it will be stored.
Different configurations for one-cycle operations can be prepared off-line
and stored in an external memory. The compiler is in charge of inserting
configuration code to write the required contexts while the program is loaded.
Additionally, the run-time support system can use dynamic information to
re-optimize parts of the program by using a different set of configuration
contexts.
One of the main advantages of the SORU2 architecture is the ease of the
compilation process. We have ported Low-Level Virtual Machine (LLVM) [63]
to our prototype processor, adding a new vectorization pass to implement
loops as SORU2 SIMD operations.
SORU2 has many characteristics that can be used effectively to avoid side-
channel attacks in an embedded system. We can classify them into low-power
characteristics and nondeterministic behavior.
Unlike most reconfigurable architectures, SORU2 does not include long
lines connecting many gates all over the device. Data flow inside the SORU2
execution unit is very directional, and therefore the load capacitances are
usually very low.
Memory buses usually have a high power consumption, and therefore,
whenever a cryptographic key goes through them, it is leaking significant
information that can be analized with DPA techniques. However, in a SORU2
implementation of the algorithm, the key would only go through the memory
bus once. Then, it would be stored in a SORU2 internal register, and oper-
ated only inside the coprocessor. As power traces of the SORU2 execution
unit are much smaller, the signal-noise ratio would be much smaller, and any
statistical analysis would require many more traces to succeed.
The multiple SORU2 configuration contexts can be preloaded before start-
ing to execute the cipher algorithm, and it can change the active configuration
every clock cycle.
The most straightforward approach to take advantage of the proposed
architecture for avoiding DPA attacks is by generating multiple SORU2 im-
plementations of the program loops, and randomly change between them in
run-time. Non-deterministic changes between functionally-equivalent imple-
mentation of loop bodies would increase the noise level significantly at no cost
for the embedded system. An attack based on power analysis would become
very difficult.
The compiler finds the loops that can be extracted to SORU2 SIMD op-
erations, and uses standard compilation techniques to generate a first im-
plementation. Additionally, a simulated annealing pass in the compiler gen-
erates many different implementations of these loops starting from the first
implementation.
Security Issues in SoC Communication 377
• Changes in the program flow, although they are usually easy to detect
by visual inspection of the power traces.
• Key-independent memory addresses.
• Parallel activity to increase the noise.
Some of the most used techniques between the algorithm-level masking
techniques include:
• Masking table look-ups, in order to implement masking in a simple and
efficient way. However, table initialization needs to be done for all the
masks involved in the operations, and the computational effort and mem-
ory requirements are high.
• Random precharging to perform implicit masking. If the device leaks
the Hamming distance, loading or storing a random value before the
actual intermediate value occurs, works as if the intermediate value were
masked.
Next, some interesting algorithm-level countermeasures will be discussed
in more detail.
BRIP was originally proposed for ECC context to protect against RPA,
which requires an inversion for the computation. However, BRIP’s authors
say that their algorithm can also be applied in Zn for cryptosystems based on
integer factorization or discrete logarithm.
8.6 Validation
In principle, the design flow of secure cryptosystems can take advantage of
the concept of power analysis simulation for early assessing the susceptibility
of a given system to selected PA attacks, but the ever-increasing complexity
of SoC architectures makes it difficult to quantitatively understand the degree
of vulnerability of a system at the design time. In fact, although circuit-level,
gate-level, and even register transfer level (RTL) simulations of a whole SoC
are nearly unfeasible for the average performance and power consumption,
they are even more time consuming for SCA simulation, where a much higher
380 Communication Architectures for SoC
For example, in the timing attack from Brumley and Boneh [18], the com-
piler optimizations can make a huge difference in order to increase the leakage.
And moreover, the leakage found by Brumley and Boneh in the OpenSSL code
was due to algorithm-level optimizations.
Increased Visibility/Functionality
Increased visibility, or for that matter increased functionality, can facilitate the
observation of side-channel information. In general, new functionality increases
the complexity and hence introduces new interactions that might ease the
difficulty of mounting the measurements.
For example, special performance counters have been added to modern
microprocessors to count a wide array of events that affect a program’s per-
formance. They are able to count the number of cache accesses, as the cache
behavior is an important factor in a program’s performance. Compared with
the time stamp counter, which is currently used in the cache attacks, the per-
formance counters increase the visibility. The counters can be programmed
to solely count the events of interest. For instance, it is possible to specify
to only measure the cache read misses and not the cache write misses. Time
measurements, on the other hand, measure all events that influence the time
delay. The performance counters paint a more accurate picture and they could
enable better and faster attacks than the timestamp counter.
an example, Tiri et al. [119] show the measurements and the attack result of
an AES ASIC protected using WDDL to flatten out the power consumption.
1.5M measurements are not sufficient to find the key byte under attack.
Algorithmic masking was conceived as a mathematically proven counter-
measure, but recent developments show that practical implementation issues
can often break perfect theoretical security schemes. Earlier work has already
shown that it was vulnerable against higher order attacks. These attacks can
combine multiple samples to remove an unknown mask because of the Ham-
ming weight or distance leakage estimation model used [93]. But now, using
template attacks, in which the leakage model is built from the measurements,
the authors of [92] conclude that masking has zero improvement on the secu-
rity of an implementation.
Besides, the power dissipation of masked hardware circuits is uncorrelated
to the unmasked data values, and therefore cannot be used for DPA. However,
the power dissipation of a masked hardware circuit may still be correlated
to the mask. Because of this correlation, it is possible to bias the mask by
selecting only a small slice over the entire power probability density function
(PDF). This technique has been successfully applied using an AES S-Box with
perfect masking [26]. Using logic-level simulation, Chen and Schaumont have
demonstrated the dependency between the power dissipation and the mask
value. By slicing the power PDF before mounting a DPA, each bit can be
biased from the mask. Therefore, hardware masking remains susceptible to
direct DPA by making clever use of the power probability density function.
areas and are accessible to potential intruders; and 3) all these computers are
usually interconnected, allowing attacks to be propagated step by step from
the more resource-constrained devices to the more secure servers with lots of
private data.
Current ciphers and countermeasures often imply a need for more re-
sources (more computation requirements, more power consumption, specific
integrated circuits with careful physical design, etc.), but usually this is not
affordable for this kind of application. But even if we impose strong require-
ments for any individual node to be connected to our network, it is virtu-
ally impossible to update hardware and software whenever a security flaw is
found. The need to consider security as a new dimension during the whole de-
sign process of embedded systems has already been stressed [100], and there
have been some initial efforts towards design methodologies to support secu-
rity [102, 7, 8], but to the best of our knowledge no attempt has been made
to exploit the special characteristics of wireless sensor networks.
These properties have been used to build an application framework for the
development of secure applications using sensor networks [86].
8.8 Conclusions
Very often, SoCs need to communicate sensitive information inside the chip or
to the outside. To avoid eavesdropping on the communications, it is common
to use encryption schemes at different levels. However, using side-channel in-
formation, it can be very easy to gain the secret information from the system.
Protecting against these attacks, however, can be a challenge, it is costly and
must be done with care.
Every SoC component involved in the communication of sensitive infor-
mation (emitter, receiver, and communication channels) should be carefully
designed, avoiding leakages of side-channel information at every design level.
And these security concerns should be considered globally, as one design level
may interact with others. For example, compiler optimizations may create
data-dependent asymmetries in a perfectly designed algorithm, but also the
small variance in the capacitance of signal lines in the chip may be used to
discover the cryptographic key. These considerations should be also taken into
account when reusing cores.
Therefore, to keep the design time reasonable, the communication of sen-
sitive information within a SoC should be limited to a minimum set of com-
ponents, that should be designed carefully, trying to avoid any kind of infor-
mation leakage at every design level. Reuse is not possible unless these issues
Security Issues in SoC Communication 387
were taken into account during the design of the reused component. And even
in that case, it is important to consider the impact in the leakages of any
change from previous designs (technology, logic style, components connected
to the bus, protocols usage, etc.).
There is a huge catalog of countermeasures against side-channel attacks,
ranging from algorithmic transformations, to custom logic styles, but no one
by itself completely avoids these attacks. Each countermeasure has its weak-
nesses. Hence, a reasonable compromise needs to be found between the resis-
tance against side-channel attacks and the implementation costs of the coun-
termeasures (performance, area, power consumption, design time, etc.).
8.9 Glossary
AES: Advanced Encryption Standard
DDR: Double-Data-Rate
DEMA: Differential Electromagnetic Analysis
EM: Electromagnetic
EMA: Electromagnetic Analysis
EPDU: Evaluation-Precharge Detection Unit
FFT: Fast Fourier Transform
FIB: Focused Ion Beam
FPGA: Field Programmable Gate Array
FPRM: Fixed Polarity Reed-Muller canonical form
HW/SW: Hardware/Software
iMDPL: Improved Masked Dual-rail Precharge Logic
IPA: Inferential Power Analysis
LLVM: Low-Level Virtual Machine
MDPL: Masked Dual-rail Precharge Logic
PA: Power Analysis
PDA: Personal Digital Assistant
PDF: Probability Density Function
PMRML: Precharge Masked Reed-Muller Logic
QDI: Quasi-Delay Insensitive
RFID: Radio Frequency Identification Device
RIP: Randamized Initial Point
RNG: Random Number Generator
RPA: Refined Power Analysis
RSA: Rivest-Shamir-Adleman public key encryption algorithm
RSL: Random Switching Logic
SABL: Sense Amplifier Based Logic
SCA: Side Channel Analysis
SNR: Signal-Noise Ratio
SORU: Stream-Oriented Reconfigurable Unit
SPA: Simple Power Analysis
Security Issues in SoC Communication 389
8.10 Bibliography
[1] D. Agrawal, J. R. Rao, and P. Rohatgi. Multi-channel attacks. In Walter
et al. [130], 2–16.
[11] E. Biham and A. Shamir. Power analysis of the key scheduling of the
AES candidates. In Proceedings of the Second AES Candidate Confer-
ence, 115–121. Addison-Wesley, New York, 1999.
[17] E. Brier and M. Joye. Weierstraß elliptic curves and side-channel at-
tacks. In David Naccache and Pascal Paillier, editors, Public Key Cryp-
tography, volume 2274 of Lecture Notes in Computer Science, 335–345.
Springer, Berlin Heidelberg, 2002.
[18] D. Brumley and D. Boneh. Remote timing attacks are practical. Com-
puter Networks, 48(5):701–716, 2005.
[30] J.-S. Coron. Resistance against differential power analysis for elliptic
curve cryptosystems. In C. K. Koç and C. Paar [21], 292–302.
[31] J.-S. Coron, D. Naccache, and P. Kocher. Statistics and secret leakage.
ACM Trans. Embed. Comput. Syst., 3(3):492–508, 2004.
[51] B. S. Kaliski Jr., Ç. K. Koç, and C. Paar, editors. Cryptographic Hard-
ware and Embedded Systems - CHES 2002, 4th International Workshop,
Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers, volume
2523 of Lecture Notes in Computer Science. Springer, Berlin Heidelberg,
2003.
[82] P. L. Montgomery. Speeding the pollard and elliptic curve methods for
factorizations. Mathematics of Computation, 48:243–264, 1987.
[98] J.-J. Quisquater and D. Samyde. Eddy current for Magnetic Analysis
with Active Sensor. In Esmart 2002, Nice, France, 9 2002.
[103] P. Schaumont and K. Tiri. Masking and dual-rail logic don’t add up.
In Paillier and Verbauwhede [94], 95–106.
Security Issues in SoC Communication 399
[107] A. Shamir. Protecting smart cards from passive power analysis with
detached power supplies. In C. K. Koç and C. Paar [22], 71–77.
[116] D. Suzuki, M. Saeki, and T. Ichikawa. DPA leakage models for CMOS
logic circuits. In Rao and Sunar [99], 366–382.
400 Communication Architectures for SoC
[119] K. Tiri, D. Hwang, A. Hodjat, B. -C. Lai, S. Yang, P. Schaumont, and In-
grid Verbauwhede. A side-channel leakage free coprocessor ic in 0.18μm
CMOS for embedded AES-based cryptographic and biometric process-
ing. In Joyner Jr. et al. [52], 222–227.
[120] K. Tiri and P. Schaumont. Changing the odds against masked logic. In
Eli Biham and Amr M. Youssef, editors, Selected Areas in Cryptography,
volume 4356 of Lecture Notes in Computer Science, 134–146. Springer,
Berlin Heidelberg, 2006.
[121] K. Tiri and I. Verbauwhede. Securing encryption algorithms against
DPA at the logic level: Next generation smart card technology. In Walter
et al. [130], 125–136.
[123] K. Tiri and I. Verbauwhede. Place and route for secure standard
cell design. In CARDIS, 2004, 143–158. Kluwer Academic Publishers,
Toulouse, FR, 2004.
[125] K. Tiri and I. Verbauwhede. A digital design flow for secure integrated
circuits. IEEE Trans. on CAD of Integrated Circuits and Systems,
25(7):1197–1208, 2006.
[133] S.-M. Yen, W.-C. Lien, S.-J. Moon, and J. C. Ha. Power analysis by ex-
ploiting chosen message and internal collisions — vulnerability of check-
ing mechanism for RSA-decryption. In E. Dawson and S. Vaudenay,
editors, Mycrypt, Volume 3715 of Lecture Notes in Computer Science,
183–195. Springer, Berlin Heidelberg, 2005.
This page intentionally left blank
K11940_cover.fhmx 2/2/11 2:01 PM Page 1
C M Y CM MY CY CMY K
COMPUTER ENGINEERING
Ayala
K11940
6000 Broken Sound Parkway, NW ISBN: 978-1-4398-4170-9
Suite 300, Boca Raton, FL 33487 90000
270 Madison Avenue
an informa business New York, NY 10016
2 Park Square, Milton Park
w w w. c r c p r e s s . c o m Abingdon, Oxon OX14 4RN, UK 9 78 1 439 84 1 709