0% found this document useful (0 votes)
200 views

Advanced VLSI Architecture Design For Emerging Digital Systems

Vlsi

Uploaded by

gangavinodc123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
200 views

Advanced VLSI Architecture Design For Emerging Digital Systems

Vlsi

Uploaded by

gangavinodc123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

VLSI Design

Advanced VLSI Architecture Design for


Emerging Digital Systems

Guest Editors: Yu-Cheng Fan, Qiaoyan Yu, Thomas Schumann,


Ying-Ren Chien, and Chih-Cheng Lu
Advanced VLSI Architecture Design
for Emerging Digital Systems
VLSI Design

Advanced VLSI Architecture Design


for Emerging Digital Systems

Guest Editors: Yu-Cheng Fan, Qiaoyan Yu,


Thomas Schumann, Ying-Ren Chien, and Chih-Cheng Lu
Copyright © 2014 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in “VLSI Design.” All articles are open access articles distributed under the Creative Commons Attribu-
tion License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Editorial Board
Roc Berenguer, Spain Marcelo Lubaszewski, Brazil M. Renovell, France
Chien-In Henry Chen, USA Mohamed Masmoudi, Tunisia Peter Schwarz, Germany
Kiyoung Choi, Korea Antonio Mondragon-Torres, USA Jose Silva-Martinez, USA
Anh Tuan Do, Singapore Jose Carlos Monteiro, Portugal Antonio G. M. Strollo, Italy
Ethan Farquhar, USA Fateme Moradi, Iran Junqing Sun, USA
Dimitri Galayko, France Farshad Moradi, Denmark Rached Tourki, Tunisia
David Hernandez-Garduno, USA Maurizio Palesi, Italy Spyros Tragoudas, USA
Lazhar Khriji, Oman Rubin A. Parekhji, India Sungjoo Yoo, Korea
Israel Koren, USA Zebo Peng, Sweden Avi Ziv, Israel
David S. Kung, USA Gregory Peterson, USA
Chang-Ho Lee, USA A. Postula, Australia
Contents
Advanced VLSI Architecture Design for Emerging Digital Systems, Yu-Cheng Fan, Qiaoyan Yu,
Thomas Schumann, Ying-Ren Chien, and Chih-Cheng Lu
Volume 2014, Article ID 746132, 2 pages

Engineering Change Orders Design Using Multiple Variables Linear Programming for VLSI Design,
Yu-Cheng Fan, Chih-Kang Lin, Shih-Ying Chou, Chun-Hung Wang, Shu-Hsien Wu, and Hung-Kuan Liu
Volume 2014, Article ID 698041, 5 pages

Design of Smart Power-Saving Architecture for Network on Chip, Trong-Yen Lee and Chi-Han Huang
Volume 2014, Article ID 531653, 10 pages

Optimization of Fractional-N-PLL Frequency Synthesizer for Power Effective Design, Sahar Arshad,
Muhammad Ismail, Usman Ahmad, Anees ul Husnain, and Qaiser Ijaz
Volume 2014, Article ID 406416, 7 pages

Performance Analysis of Modified Drain Gating Techniques for Low Power and High Speed Arithmetic
Circuits, Shikha Panwar, Mayuresh Piske, and Aatreya Vivek Madgula
Volume 2014, Article ID 380362, 5 pages

Gate-Level Circuit Reliability Analysis: A Survey, Ran Xiao and Chunhong Chen
Volume 2014, Article ID 529392, 12 pages

Low-Area Wallace Multiplier, Shahzad Asif and Yinan Kong


Volume 2014, Article ID 343960, 6 pages

Efficient Hardware Trojan Detection with Differential Cascade Voltage Switch Logic, Wafi Danesh,
Jaya Dofe, and Qiaoyan Yu
Volume 2014, Article ID 652187, 11 pages

On-Chip Power Minimization Using Serialization-Widening with Frequent Value Encoding,


Khader Mohammad, Ahsan Kabeer, and Tarek Taha
Volume 2014, Article ID 801241, 14 pages
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 746132, 2 pages
http://dx.doi.org/10.1155/2014/746132

Editorial
Advanced VLSI Architecture Design for Emerging
Digital Systems

Yu-Cheng Fan,1 Qiaoyan Yu,2 Thomas Schumann,3 Ying-Ren Chien,4 and Chih-Cheng Lu5
1
Department of Electronic Engineering, National Taipei University of Technology, Taipei 10608, Taiwan
2
Department of Electrical and Computer Engineering, University of New Hampshire, Durham, NH 03824, USA
3
Department of Electrical Engineering and Information Technology, Hochschule Darmstadt-University of Applied Sciences,
Birkenweg 8, 64295 Darmstadt, Germany
4
Department of Electrical Engineering, National Ilan University, Yilan 260, Taiwan
5
Division for Biomedical & Industrial IC Technology, Industrial Technology Research Institute, Hsinchu 310, Taiwan

Correspondence should be addressed to Yu-Cheng Fan; [email protected]

Received 29 September 2014; Accepted 29 September 2014; Published 22 December 2014

Copyright © 2014 Yu-Cheng Fan et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

With physical feature sizes in VLSI designs decreasing rap- memory data bus. In particular, serialization-widening (SW)
idly, existing efficient architecture designs need be reex- of data bus with frequent value encoding (FVE) is proposed to
amined. Advanced VLSI architecture designs are required minimize the power consumption of the on-chip cache data
to further reduce power consumption, compress chip area, bus.
and speed up operating frequency for high performance In the paper entitled “Efficient hardware trojan detection
integrated circuits. With time-to-market pressure and ris- with differential cascade voltage switch logic,” the authors
ing mask costs in the semiconductor industry, engineering present to exploit the inherent feature of differential cascade
change order (ECO) design methodology plays a main role voltage switch logic (DCVSL) to detect hardware trojans
in advanced chip design. Digital systems such as commu- (HTs) at runtime. By examining special power characteristics
nication and multimedia applications demand for advanced of DCVSL systems upon HT insertion, the authors can detect
VLSI architecture design methodologies so that low power HTs, even if the HT size is small. Simulation results show that
consumption, small area overhead, high speed, and low cost the method achieves up to 100% HT detection rate. The eval-
can be achieved. uation on ISCAS benchmark circuits shows that the scheme
This special issue is dedicated to aspects of VLSI archi- obtains a HT detection rate in the range of 66% to 98%.
tecture design and their applications. Special interest focuses In the paper entitled “Low-Area Wallace multiplier,” the
on emerging digital systems. This special issue contains eight authors propose a reduced-area Wallace multiplier without
papers that focus on the power minimization design, effi- compromising on the speed of the original Wallace multiplier.
cient hardware Trojan detection, low-area Wallace multiplier, The proposed designs are synthesized using Synopsys Design
gate-level circuit reliability analysis, low power and high Compiler in 90 nm process technology and achieve the lowest
speed arithmetic circuits, power effective fractional-N-PLL area cost as compared to other tree-based multipliers. The
frequency synthesizer, power-saving architecture for network speed of the proposed and reference multipliers is almost the
on chip, and ECO design. same.
In the paper entitled “On-chip power minimization using In the paper entitled “Gate-level circuit reliability analysis:
serialization-widening with frequent value encoding,” the a survey,” the authors provide an overview of some typical
authors address the problem of the high-power consumption methods for reliability analysis with special focus on gate-
of the on-chip data buses by exploring a new framework for level circuits that are either large or small, with or without
2 VLSI Design

reconvergent fan-outs. It is intended to help the readers gain


an insight into the reliability issues and their complexity
as well as optional solutions. Understanding the reliability
analysis is also a first step towards advanced circuit designs
for improved reliability in the future research.
In the paper entitled “Performance analysis of modified
drain gating techniques for low power and high speed arithmetic
circuits,” the authors present several high performance and
low power techniques for CMOS circuits. In these design
methodologies, drain gating technique and its variations are
modified by adding an additional NMOS sleep transistor at
the output node. This method helps to achieve faster dis-
charge and thereby provides higher speed. Intensive simu-
lations are performed using Cadence Virtuoso in a 45 nm
standard CMOS technology at room temperature with supply
voltage of 1.2 V. Comparative analysis of the present circuits
with standard CMOS circuits shows smaller propagation
delay and lesser power consumption.
In the paper entitled “Optimization of fractional-N-PLL
frequency synthesizer for power effective design,” the authors
design a low power fractional-N phase-locked loop (FNPLL)
frequency synthesizer for industrial application, which is
based on VLSI. The design of FNPLL has been optimized
using different VLSI techniques to acquire significant perfor-
mance in terms of speed with relatively less power consump-
tion.
In the paper entitled “Design of smart power-saving archi-
tecture for network on chip,” the authors present a novel archi-
tecture, namely, smart power-saving (SPS), for low power
consumption and low area in virtual channels of NoC. Com-
paring with related works, the new proposed method reduces
with 37.31%, 45.79%, and 19.26% on power consumption and
reduces with 49.4%, 25.5%, and 14.4% on area, respectively.
Finally, in the paper entitled “Engineering change orders
design using multiple variables linear programming for VLSI
design,” the authors present an engineering change orders
(ECO) design using multiple variable linear programming for
VLSI design. The authors adopt linear programming tech-
nique to plan and balance the spare cells and target cells to
meet the new specification according to logic transformation.
The proposed method solves the related problem of resource
for ECO problems and provides a hardware-efficient solution.

Acknowledgments
Finally, the Guest Editors would like to thank all the authors
who sent their valuable contributions and all the reviewers for
their valuable comments.
Yu-Cheng Fan
Qiaoyan Yu
Thomas Schumann
Ying-Ren Chien
Chih-Cheng Lu
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 698041, 5 pages
http://dx.doi.org/10.1155/2014/698041

Research Article
Engineering Change Orders Design Using Multiple Variables
Linear Programming for VLSI Design

Yu-Cheng Fan, Chih-Kang Lin, Shih-Ying Chou, Chun-Hung Wang,


Shu-Hsien Wu, and Hung-Kuan Liu
Department of Electronic Engineering, National Taipei University of Technology, Taipei 10608, Taiwan

Correspondence should be addressed to Yu-Cheng Fan; [email protected]

Received 7 June 2014; Accepted 18 July 2014; Published 24 August 2014

Academic Editor: Chih-Cheng Lu

Copyright © 2014 Yu-Cheng Fan et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

An engineering change orders design using multiple variable linear programming for VLSI design is presented in this paper. This
approach addresses the main issues of resource between spare cells and target cells. We adopt linear programming technique to plan
and balance the spare cells and target cells to meet the new specification according to logic transformation. The proposed method
solves the related problem of resource for ECO problems and provides a well solution. The scheme shows new concept to manage
the spare cells to meet possible target cells for ECO research.

1. Introduction ECO synthesis, and ECO routing [2]. Kuo et al. insert spare
cells with constant insertion for engineering change and
Engineering change orders (ECO) are important technolo- describe an iterative method to determine feasible mapping
gies used for changes in integrated circuit (IC) layout and solutions for an EC problem [3]. Besides, in order to per-
compensate for design problems. Traditionally, when chip form ECO efficiently, literature [4–9] adopt minimal change
shows errors, it often requires new photomasks for all layers. EC equations automatically. Brand proposed incremental
However, photomasks of deep-submicron semiconductor synthesis method [4]. Huang presented a hybrid tool for
fabrication process are very expensive. In order to save automatic logic rectification [5]. Lin et al. addressed logic
money, ECO technology modifies only a few of the metal synthesis techniques for engineering change problems [6].
layers (metal-mask ECO) to reduce the cost of photomasks Shinsha et al. performed incremental logic synthesis through
for all layers [1]. gate logic structure identification [7]. Swamy et al. achieved
To perform the ECO, IC designers adopt sprinkling many minimal logic resynthesis for engineering change [8]. Watan-
unused logic gates during IC design flow. When chip is abe and Brayton presented another kind of incremental
manufactured and shows design errors, IC designers modify synthesis technique for engineering changes [9]. However,
the gate-level net-list using the presprinkling unused logic few researchers discuss the resource between spare cells and
gates. At the same time, the designers track and verify the target cells. Therefore, in order to solve the problems, we
modification to check formal equivalence after ECO process. adopt linear programming technique to plan and balance the
The designers must guarantee the revised design matching spare cells and target cells in this paper. The proposed scheme
the revised specification. meets the new specification according to logic transformation
How to achieve ECO efficiently? There are some litera- and overcomes the related problems of resource for ECO
tures that address this problem and provide related solution. research.
In literature [2], Tan and Jiang describe a typical metal- This paper is organized as follows. In Section 2, we
only ECO flow with four steps that include placement and address typical ECO design flow. In Section 3, logic trans-
spare cell distribution, logic difference extraction, metal-only formation is discussed. In Section 4, multiple variables linear
2 VLSI Design

Old register Logic Placement 3. Logic Transformation


DFT
transfer level synthesis and routing
Before discussing the engineering change orders design
New register using multiple variables linear programming, we addressed
ECO Old net-list ECO logic transformation. Figure 3 describes an ECO prob-
transfer level
󸀠
lem with an equation out = (𝐴 + (𝐵𝐶)󸀠 ) . Figure 3(b) lists
ECO editing the available spare cells. According to the list, we discover
the available spare cells are not enough. In order to solve
the problem, we adopt another mapping solution with an
ECO New net-list equation out = (𝐴󸀠 𝐵𝐶) instead of the original equation in
Figure 3(c). It requires one AND and one INV gate. The
Figure 1: A typical ECO design flow. mapping solution in Figure 3(c) requires gates fewer than
the available spare cells and is constructed with the available
spare cells.
However, most of spare cells only provide basic logical
functions that include AND, OR, NOT, NAND, and NOR.
Half Adder (HA), Full Adder (FA), And-Or-Inverter (AOI),
Find out
and Or-And-Inverter (OAI) can provide complex logical
patch logic
functions. We can adopt these logical cells to perform ECO
function. For example, AOI22 can be implemented by two
Patch logic NAND and one AND cells in Figure 4. According to the
existing resources of spare cells, we can resynthesize the
Available spare Mapped Equivalent and
changed function lists.
Mapping
cell list patch logic timing check

Figure 2: Two-phase ECO design flow.


4. Multiple Variables Linear Programming for
VLSI Design
Although logic transformation skill makes the ECO technol-
programming for VLSI design is presented. In Section 5, we ogy come true, a chip often does not own enough spare cells
discuss the advantage and disadvantage of the related works. to modify the function to meet a new specification. How
Finally, we conclude this paper in Section 6. to allocate limited resource? We should estimate quantity of
spare cells and logic transformation rule to perform optimal
engineering charge orders.
2. Typical ECO Design Flow In Figure 5, it describes the engineering change orders
design using multiple variables linear programming for VLSI
Before describing the proposed method, we address a typ- design and relation of logic transformation. “Logic 𝐴” is one
ical manual ECO design flow in Figure 1. IC designers kind of spare cells that can be transformed into “Logic 𝑎”
perform the change in register transfer level and verify or “Logic 𝑏.” Similarly, “Logic 𝐵” can be transformed into
fixed code matching the new specification at first. Then, “Logic 𝑎,” “Logic 𝑏,” or “Logic 𝑐.” “Logic 𝐶” performs ECO
old net-list is scanned to search the possible fix points. function instead of “Logic 𝑐” or “Logic 𝑑.” Besides, Logic 𝐷 is
After the possible fix points are searched, IC designers transformed into “Logic 𝑐” or “Logic 𝑑.” Equivalently, “Logic
modify the net-list and check the functionally equiva- 𝐸” is transformed to “Logic 𝑑” or “Logic 𝑒” to achieve ECO
lent between new net-list and new register transfer level function.
[10–14]. We assume 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , and 𝑋5 are the number of
Next, we describe two-phase ECO design flow in Figure 2. spare cells, Logic 𝐴, Logic 𝐵, Logic 𝐶, Logic 𝐷, and Logic 𝐸.
To patch the logics of the modified circuit, we prepare Let 𝑌1 , 𝑌2 , 𝑌3 , 𝑌4 , and 𝑌5 be the desired number of target
available spare cell list. According to logic function, the cells, Logic 𝑎, Logic 𝑏, Logic 𝑐, Logic 𝑑, and Logic 𝑒.
modified circuit is mapped to specified logics. After patching Besides, 𝐴𝑎 is the number of spare cells (Logic 𝐴) to
logic, equivalent check and timing check are performed to be transformed into Logic 𝑎 and 𝐴𝑏 is the number of spare
make sure that the new function met the new specification. cells (Logic 𝐴) to be transformed into Logic 𝑏. Similarly,
However, there are some important problems that appear 𝐵𝑎, 𝐵𝑏, and 𝐵𝑐 are the number of spare cells (Logic 𝐵) to
during patching logic. Are there enough spare cells and types be transformed into Logic 𝑎, Logic 𝑏, and Logic 𝑐. In a similar
to satisfy the consumption of patch logic? How to estimate way, 𝐶𝑐 and 𝐶𝑑 are the number of spare cells (Logic 𝐶) to
the quantity and logic types of ECO procedure? In order be transformed into Logic 𝑐 and Logic 𝑑, 𝐷𝑐 and 𝐷𝑑 are the
to solve this problem, we proposed an engineering change number of spare cells (Logic 𝐷) to be transformed into Logic
orders design using multiple variables linear programming 𝑐 and Logic 𝑑, and 𝐸𝑑 and 𝐸𝑒 are the number of spare cells
for VLSI design in this paper. (Logic 𝐸) to be transformed into Logic 𝑑 and Logic 𝑒.
VLSI Design 3

A Type Spare cells


A A A
AND 2
B OR 2 B
BC
INV 1
C NAND 1 C
(a) (b) (c)

󸀠
Figure 3: Example of an ECO problem. (a) EC equation: output = (𝐴 + (𝐵𝐶)󸀠 ) . (b) Spare cells. (c) Mapping: output = (𝐴󸀠 𝐵𝐶).

(a) (b)

Figure 4: AOI22 can be implemented by two NAND and one AND cells.

Therefore, the restriction rule of the number of spare cells


and transformed target cells in Figure 5 is written as follows:
Aa
Ab 𝑋1 = 𝐴𝑎 + 𝐴𝑏;
Logic A Logic a
𝑋2 = 𝐵𝑎 + 𝐵𝑏 + 𝐵𝑐;
X1 Y1
𝑋3 = 𝐶𝑐 + 𝐶𝑑; (1)
Ba
𝑋4 = 𝐷𝑐 + 𝐷𝑑;
Bb
𝑋5 = 𝐸𝑑 + 𝐸𝑒.
Logic B Bc Logic b
Besides, the restriction rule of the engineering change
orders design using multiple variables linear programming in
X2 Y2 Figure 5 is written as follows:
𝐴𝑎 + 𝐵𝑎 ≧ 𝑌1 ; (2)
Cc
𝐴𝑏 + 𝐵𝑏 ≧ 𝑌2 ; (3)
Logic C Cd Logic c
𝐵𝑐 + 𝐶𝑐 + 𝐷𝑐 ≧ 𝑌3 ; (4)

X3 Y3 𝐶𝑑 + 𝐷𝑑 + 𝐸𝑑 ≧ 𝑌4 ; (5)
Dc
𝐸𝑑 + 𝐸𝑒 ≧ 𝑌5 . (6)
Dd
However, spare cells are not often enough; designer
Logic D Logic d should balance the spare cell allocation to meet all require-
ments of desirable cells.
We assume one case when 𝐵𝑏 ≦ 𝑌2 . In order to provide
X4 Ed Y4 enough spare cells, we should increase the number of 𝐴𝑏 to
achieve 𝐴𝑏 + 𝐵𝑏 ≧ 𝑌2 .
Ee Similarly, when 𝐶𝑐 + 𝐷𝑐 ≦ 𝑌3 , we should increase 𝐵𝑐
number to meet 𝐵𝑐 + 𝐶𝑐 + 𝐷𝑐 ≧ 𝑌3 .
Logic E Logic e
Therefore, we define another restriction rule of the engi-
neering change orders design which is written as follows:
X5 Y5
𝐴𝑎 = 𝑋1 − 𝐴𝑏;
Figure 5: ECO design using multiple variables linear programming (7)
for VLSI design and relation of logic transformation. 𝐵𝑎 = 𝑋2 − 𝐵𝑏 − 𝐵𝑐.
4 VLSI Design

Table 1: ECO methods comparison.

Method Traditional ECO Proposed method


Cell resource prediction Constraint based Multiple variables linear programming
Predictive precision of patching logic number Normal precision High precision
Balance between spare cells and target cells Low balance High balance
Restriction rule Not define Define
Resource optimization Not define Define
Solution boundary Not define Define

According to formulas (2) and (7), we can balance the transformation, and multiple variables linear programming
number of 𝐴𝑏, 𝐵𝑏, and 𝐵𝑐 to achieve the target number 𝑌1 . for VLSI design. The presented scheme estimates the resource
Consider the following: of spare cells and provides a well solution of ECO problems.

𝑋1 − 𝐴𝑏 + 𝑋2 − 𝐵𝑏 − 𝐵𝑐 ≧ 𝑌1 . (8)
Conflict of Interests
In a similar way, we define the restriction rule of the The authors declare that there is no conflict of interests
engineering change orders design which is written as follows: regarding the publication of this paper.
𝐵𝑐 = 𝑋2 − 𝐵𝑎 − 𝐵𝑏;
Acknowledgments
𝐶𝑐 = 𝑋3 − 𝐶𝑑; (9)
This work was supported by the National Science Council of
𝐷𝑐 = 𝑋4 − 𝐷𝑑. Taiwan under Grant nos. NSC 101-2221-E-027-135-MY2 and
102-2622-E-027-008-CC3. The authors gratefully acknowl-
According to formulas (4) and (9), we can balance the edge the Chip Implementation Center (CIC), for supplying
number of 𝐵𝑐, 𝐶𝑐, and 𝐷𝑐 to achieve the target number 𝑌3 . the technology models used in IC design.
Consider

𝑋2 − 𝐵𝑎 − 𝐵𝑏 + 𝑋3 − 𝐶𝑑 + 𝑋4 − 𝐷𝑑 ≧ 𝑌3 . (10) References
We model the engineering change orders problems using [1] J. A. Roy and I. L. Markov, “ECO-system: embracing the change
multiple variables linear programming. According to the in placement,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 26, no. 12, pp. 2173–2185,
functions, we can understand the engineering change orders
2007.
relation between supply and requirement. Then, designer can
[2] C. Tan and I. H. Jiang, “Recent research development in
estimate and perform ECO using spare cell efficiently.
metal-only ECO,” in Proceedings of the 54th IEEE International
Midwest Symposium on Circuits and Systems (MWSCAS ’11), pp.
5. Discussion 1–4, August 2011.
[3] Y. M. Kuo, Y. T. Chang, S. C. Chang, and M. Marek-Sadowska,
In this Section, we discuss the advantage and disadvantage of “Spare cells with constant insertion for engineering change,”
the related works. Table 1 shows ECO method comparison. IEEE Transactions on Computer-Aided Design of Integrated
The proposed approach designs a multiple variable linear Circuits and Systems, vol. 28, no. 3, pp. 456–460, 2009.
programming ECO for VLSI design. Our method can predict [4] D. Brand, A. Drumm, S. Kundu, and P. Narain, “Incremental
cell resource accurately using multiple variable linear pro- synthesis,” in Proceedings of the IEEE/ACM International Con-
gramming techniques. Traditional ECO is not to predict it ference on Computer-Aided Design, pp. 14–18, 1994.
well. Besides, our scheme provides a high accurate prediction [5] S. Huang, K. Chen, and K. Cheng, “AutoFix: A hybrid tool for
of patching logic number to balance between spare cells and automatic logic rectification,” IEEE Transactions on Computer-
target cells. It is hard for traditional ECO method to do these. Aided Design of Integrated Circuits and Systems, vol. 18, no. 9, pp.
Moreover, we define restriction rule, resource optimization, 1376–1384, 1999.
and solution boundary of ECO problem to increase the [6] C. Lin, K. Chen, and M. Marek-Sadowska, “Logic synthesis
for engineering change,” IEEE Transactions on Computer-Aided
efficiency of the proposed ECO method and provide a well
Design of Integrated Circuits and Systems, vol. 18, no. 2-3, pp.
solution. 282–292, 1999.
[7] T. Shinsha, T. Kubo, Y. Sakataya, J. Koshishita, and K. Ishihara,
6. Conclusion “Incremental logic synthesis through gate logic structure identi-
fication,” in Proceedings of the IEEE/ACM Conference on Design
In this paper, we proposed an engineering change orders Automation, pp. 391–397, Jun 1986.
design using multiple variables linear programming for VLSI [8] G. Swamy, S. Rajamani, C. Lennard, and R. K. Brayton, “Mini-
design. The paper discusses typical ECO design flow, logic mal logic re-synthesis for engineering change,” in Proceedings of
VLSI Design 5

the IEEE International Symposium on Circuits and Systems, pp.


1596–1599, 1997.
[9] Y. Watanabe and R. K. Brayton, “Incremental synthesis for
engineering changes,” in Proceedings of the IEEE International
Conference on Computer Design: VLSI in Computers and Pro-
cessors (ICCD ’91), pp. 40–43, Cambridge, Mass, USA, October
1991.
[10] J. Wang, Finding the Minimal Logic Difference for Functional
ECO, Taiwan Cadence Design Systems, 2012.
[11] Y. C. Fan and H. W. Tsao, “Watermarking for intellectual
property protection,” IEE Electronics Letters, vol. 39, no. 18, pp.
1316–1318, 2003.
[12] Y. Fan and H. Tsao, “Boundary scan test scheme for IP
core identification via watermarking,” IEICE Transactions on
Information and Systems, vol. E88-D, no. 7, pp. 1397–1400, 2005.
[13] Y. Fan, “Testing-based watermarking techniques for intellectual
-property identification in SOC design,” IEEE Transactions on
Instrumentation and Measurement, vol. 57, no. 3, pp. 467–479,
2008.
[14] Y. Fan and Y. Chiang, “Discrete wavelet transform on color
picture interpolation of digital still camera,” VLSI Design, vol.
2013, Article ID 738057, 9 pages, 2013.
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 531653, 10 pages
http://dx.doi.org/10.1155/2014/531653

Research Article
Design of Smart Power-Saving Architecture for Network on Chip

Trong-Yen Lee and Chi-Han Huang


Department of Electronic Engineering, National Taipei University of Technology, Taipei 10608, Taiwan

Correspondence should be addressed to Trong-Yen Lee; [email protected]

Received 5 June 2014; Accepted 27 June 2014; Published 6 August 2014

Academic Editor: Yu-Cheng Fan

Copyright © 2014 T.-Y. Lee and C.-H. Huang. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

In network-on-chip (NoC), the data transferring by virtual channels can avoid the issue of data loss and deadlock. Many virtual
channels on one input or output port in router are included. However, the router includes five I/O ports, and then the power
issue is very important in virtual channels. In this paper, a novel architecture, namely, Smart Power-Saving (SPS), for low power
consumption and low area in virtual channels of NoC is proposed. The SPS architecture can accord different environmental factors
to dynamically save power and optimization area in NoC. Comparison with related works, the new proposed method reduces
37.31%, 45.79%, and 19.26% on power consumption and reduces 49.4%, 25.5% and 14.4% on area, respectively.

1. Introduction two stages arbitration and will select most body flits into
XBAR to transmit. The VA will be working when the packet is
In recent years, the 3-dimensional IC and TSV (Through- arrival. The SA operation when the flit is arrival. The tail flit
Silicon Via) technology are proposed to solve area issues. The represents last flit, and then the router will unregister trans-
3-dimensional IC of Intel Ivy Bridge processor and the 16- mission channel. The router topology includes mesh, star, and
core multicore architecture can be implemented in 22 nm [1]. fat tree [4, 5].
Therefore, the multicore and heterogeneous systems are pop- Yoon et al. [6] analysis of virtual channels (VCs) can avoid
ular research in SoC (system-on-chip). These architectures routing and protocol deadlock and improve the routing per-
require high throughput and performance to transfer data formance when the packet traffic is congested. The VCs can
in a multicore SoC. Therefore, the NoC (network-on-chip) solve packet switch hard issue but it leads the power and area
can be proposed to solve this requirement, but it derived new and so forth issue in NoC.
problems such as power consumption and area [2, 3]. Nicopoulos et al. [2] proposed IntelliBuffer architecture
The NoC architecture [1] consists of processing element to solve PV (process variation) to reduce the power consump-
(PE), network interface (NI), router, and topology which is tion in layer 1 [7]. It differs from the conventional architecture
shown in Figure 1. The PEs transfer information to NI, the NI in two fundamental ways. First, these slots use clock-gating to
packages the information into flits then passes to routers. The reduce the power consumption when slots are empty. In order
routers have difference corner router (CR), edge router (ER), to avoid data loss transmission, one of slots clock keeps to
and router (R); the CR, ER and R has three, four, and five I/O access data in each I/O port. Second, the router creates a leak-
ports to access information then each port includes 𝑛 vir- age classification register (LCR) table; then the write and read
tual channels. Router includes transmission channel, rout- pointer always accesses the lowest power consumption slots
ing computation (RC), virtual channel arbiter (VA), switch from the LCR table.
arbiter (SA), and crossbar (XBAR). The flits includes header, Taassori et al. [3] proposed an adaptive data compression
body, and tail; the header flit has PE priority, source address, technology to reduce the number of packet bits in layer 3 [7].
destination address, and so forth. The RC uses header flit and It reduces of the number of transmissions. Therefore, it can
routing algorithms to find transmission path. VA uses two improve power consumption of router. Palma et al. [8] use
stages arbitration to select most high priority packet trans- T-Bus-Invert technology to reduce the hamming distance
mission and then will sign transmission channel. SA uses transition activity rate to improve the power consumption.
2 VLSI Design

CR ER CR

NI NI NI
PE PE PE
ER R ER

NI NI NI
PE PE PE
CR ER CR

NI NI NI
PE PE PE

Router

Figure 1: NoC architecture.

Jafarzadeh et al. [9] use end-to-end data coding technology to the highest priority packet sent to next router. The arbitration
minimize switching activity rate and routing path to improve unit includes routing computation (RC), VC arbiter (VA),
NI power consumption. and switch arbiter (SA). The RC is the calculation of routing
Lee et al. [10] proposed buffer clock-gating architecture paths and priorities. The VA contains a number of two-stage
and used clock-gating to reduce the transmit power con- arbitrations to select packet and sign up VCs. First stage
sumption when slots are empty and full. Ezz-Eldin et al. [11] selects the local highest priority packet from input VCs to
proposed an adaptive virtual channel with two sections in crossbar and signs up VCs. Second stage selects the global
layer 1 [7]. First, the work used hierarchical multiplexing tree highest priority packet from input crossbar to output VCs and
for Virtual Channels (VCs) to reduce area. Second, it uses signs up VCs. The SA also contains a number of two-stage
clock-gating to reduce power consumption. Rosa et al. [12] arbitrations to select flits for transmission. First stage selects
proposed dynamic frequency scaling in PE for NoC. It the local highest priority flits from input VCs to crossbar.
considers the communication and loading rate to control the Second stage selects the global highest priority flits from input
router frequency to reduce the power consumption. crossbar to output VCs. The VA executed prepacket and the
Huaxi et al. [13] proposed fat tree-based optical NoC; this SA executed preflits.
architecture includes topology, placement, layout, and proto- The router with transmission unit is illustrated in
col. This paper proposed low power and cost router optical Figure 3. In this unit, it includes 𝑛VCs to access large packet
turnaround router to improve the power consumption. Gu from input physical channel to output physical channel. A
et al. [14] proposed Cygnus router to optimize the router algo- power consumption calculation to VCs is shown in (1). The
rithms to reduce the power consumption. Swaminathan et al. variable of 𝑛 represents the number of access packets or flits
[15] create two FIFOs in NI. Use two FIFO dynamic con- in VCs. The variable of 𝑓 represents access frequency in
figuration data access to improve throughput and power VCs. The variable of 𝑐 represents capacitance and ] represents
consumption. voltage in VCs. Nicopoulos et al. [2] and Katabami et al. [17]
In the next section we analyse the power consumption proposed clock-gating to solve this issue.
under the difference VCs access. Section 3 we introduce the In this paper, we proposed a dynamic control of each
topology and router packet architecture, we addition the SPS virtual channel clock in different transmission environments.
in router to save power. In Section 4 we present SPS with Whether packet transfer is complete, the SPS can effectively
router design. Section 5 contains experimental results and reduce the power consumption and does not affect the
Section 6 concludes this paper. transmission performance. Consider

2
2. Power Issue with Virtual Channels 𝑃𝑛VCs = ∑ 1𝑛 × (𝑓 × 𝑐 × V) . (1)
𝑛=1
The multicore architecture and big data communication are
more popular in next generation. Traditional communication 3. Router and Topology with SPS
technologies cannot meet a large amount of traffic on multi-
core and heterogeneous chip. The NoC can solve this issue. 3.1. Relation of Topology and Router. The relation of topology
It uses network transmission method to make the difference and router is illustrated in Figure 4. The router uses different
core communication at same time. The NoC can solve the transmission mode with topologies. For example, the mesh
communication issue but the big data access enhances the uses the 𝑋-𝑌 routing to transmit. The 𝑋-𝑌 routing flow chart
power consumption. for 2 × 2 meshes is illustrated in Figure 5, when the MSB
The router composed of the arbitration and transmission of destination router address (𝑅𝑑𝑚 ) is equal to the MSB of
unit [16] is illustrated in Figure 2. The arbitration unit selects current router address (𝑅𝑐𝑚 ) and if the LSB of router
VLSI Design 3

Output n

VC 1
RC
VC 2
..
. VA
VC n
SA
Input 2

VC 1
Router VC 2
..
.
VC n
Output n
Input 1
VC 1
VC 2 Output 1
..
. Crossbar
VC n (n × n)

Figure 2: Router architecture with NoC.

East
physical E W East
channel a S a Crossbar physical
s Crossbar channel
s
t t
p p
h h
nVCs
Arbiter Arbiter Arbiter

East East
physical channel East Crossbar East physical channel
input output
West West
physical channel West West physical channel
input output
South South
physical channel South South physical channel
input output
North North
physical channel North North physical channel
input output

Figure 3: Transmission unit.

addresses (𝑅𝑑𝑙 and 𝑅𝑐𝑙 ) is equal then it means the flits arrival.
Otherwise, the 𝑋-𝑌 routing algorithm includes two-stage Sign up Algorithm
flows. In stage one, the flits are sent until that the 𝑅𝑑𝑚 equals of Input: 𝑅𝑟𝑜𝑡ℎ and 𝐸𝑚𝑝 .
𝑅𝑐𝑚 on the 𝑥-axis routers. In stage two, the flits are sent to (1) while (flits arrival) do
the destination by 𝑦-axis routers. The virtual channel will be (2) if (𝑅𝑟𝑜𝑡ℎ𝑓2 is header and 𝑎𝑑𝑥 is free channel)
initialed under packet transmit on two routers, which pro- (3) {sign up the channel and select the channel
cedure is shown on Algorithm 1. to output}
The control method of arbiter architecture uses different (4) else if (𝑅𝑟𝑜𝑡ℎ𝑓2 is body and 𝑎𝑑𝑥 = 𝑅𝑟𝑜𝑡ℎ𝑠2 )
transmission mode to design. The VC arbiter and switch bar (5) {select the channel to output}
are by the topology and priority to design the routing compu- (6) else if (𝑅𝑟𝑜𝑡ℎ𝑓2 is tail and 𝑎𝑑𝑥 = 𝑅𝑟𝑜𝑡ℎ𝑠2 )
tation unit. Algorithm 2 constructs VC two stages arbitration (7) {clear the channel and select the channel to output;}
of prepackets. Stage 1 decided high priority packet into (8) else
crossbar from local VCs (input VCs) of each packet at lines (9) {read back flit to virtual channel}
(10) end while
3 to 4 and lines 8 to 10. Stage 2 decided most important packet
to transmission from global VCs (output VCs) of each packet
at lines 5 to 6 and lines 11 to 13. Algorithm 1: Channel sign up algorithm.
4 VLSI Design

Topology

Packet structure Transmission mechanism

Router architecture

Transmission Arbiter control


channel architecture architecture

Routing
computation
VCs design Switching bar

Virtual
SPS design Crossbar channel Switch
arbiter
arbiter

Figure 4: Topology and router relation with SPS.

Start

Yes Yes
Rdm == Rcm Rdl == Rcl Arrival
No No
Yes Yes
Rdm > Rcm Rdl > Rcl Down

No No

Right Left Up

Figure 5: 𝑋-𝑌 routing flow chart.

Virtual channel arbitration


Input: header flits
/∗Control signal enable∗/
(1) while (header flits) do
(2) use lottery arbitration to select local and global highest priority flits
(3) if (local)
(4) {𝑉𝑎𝑖 = local input virtual channel address}
(5) if (global)
(6) {𝑉𝑎𝑜 = global input virtual channel address}
(7) end while
/∗Channel switch∗/
(8) Case 𝑉𝑎𝑖
(9) {𝐶𝑟𝑖1 = local packet of 𝑉𝑎𝑖 }
(10) end case
(11) Case 𝑉𝑎𝑜
(12) {𝑅𝑖𝑡 = global packet of 𝑉𝑎𝑜 }
(13) end case

Algorithm 2: VC arbitration algorithm.


VLSI Design 5

Switch arbitration
Input: body and tail flits
/∗Control signal enable∗/
(1) while (body or tail flits) do
(2) use channel sign up register to select local and global highest priority flits
(3) if (local)
(4) {𝑆𝑎𝑖 = local input virtual channel address}
(5) if (global)
(6) {𝑆𝑎𝑜 = global input virtual channel address}
(7) end while
/∗Channel switch∗/
(8) Case 𝑆𝑎𝑖
(9) {𝐶𝑟𝑖2 = local packet of 𝑆𝑎𝑖 }
(10) end case
(11) Case 𝑆𝑎𝑜
(12) {𝑅𝑜𝑡 = global packet of 𝑆𝑎𝑜 }
(13) end case

Algorithm 3: Switch arbitration algorithm.

Router Router

(a) Star (b) Mesh (c) Ring (d) Tree

Figure 6: Router connection topology architecture.

Algorithm 3 constructs VC two stages arbitration of the RC algorithm is 𝑋-𝑌 routing, and the VA and SA
preflits. Stage 1 decided high priority flit into crossbar from algorithms are lottery [18].
local VCs (input VCs) of each flit at lines 3 to 4 and lines 8 The router that connects with PE is shown in Figure 7; so
to 10. Stage 2 decided most important flit to transmit from that the PE and router access information, use the network
global VCs (output VCs) of each flits at lines 5 to 6 and lines interface (NI). It handles the information between router
11 to 13. and PE. The NI includes two level designs [19] as shown in
The router includes four directions to connect other Figure 8. It contains three modules to meet the specifications
routers and one local physical channel to connect PE in of the different layers. The shell module needs to meet IP
transmission channel architecture. There have been 𝑛VCs of
specification. The kernel module needs to meet the NoC
each physical channel without local physical channel. The
topology specification.
switch bar support for transmission the most important
packet to output channel. The SPS controls each VCs power
consumption when the channel status changes. The SPS 3.3. Flits with Router Architecture. The flit specification with
architecture is introduced in next section. router is shown in Figure 9; the flit type of 2-bit 00 represents
the one packet; this flit type does not sign up VCs. The 2-bit
3.2. Topology Architecture. The topology is definition of 01 represents the header flit which includes routing informa-
the packet transmission path between router and link. The tion and address; this flit type always is determined in sign up
router connection topology architecture is shown in Figure 6; channel. The 2-bit 10 represents the body flit which includes
they include star, mesh, ring, and tree topologies. The RC transmission information; this flit payload records the
algorithms depend on topology architecture in arbitration segment packet. The 2-bit 11 represent the tail as last transmis-
unit. The VA and SA algorithms depend on packet priority in sion information; this flit not only records the last segment
arbitration unit. In this paper, the topology is the 2 × 2 mesh, packet but also cleans the VCs.
6 VLSI Design

Arbiter
RC VA SA

PE PE Transmission channel
ISL Switch logic OSL
Packet Packet
input PC IVC CR OVC PC output
Router
NI
SPS
Figure 7: Router connection with PE.
Figure 10: Router with SPS architecture.

Shell Kernel
IP NoC 4.2. Design of SPS Control Timimg. The VCs access timing
protocol Handshake Packet packets
NoC
diagrams of SPS architecture are illustrated in Figure 12. The
IP protocol assembling interface
Clock Block A indicates that the VCs have no information to
transmit. The Clock Block B indicates that the VCs are writing
encodes IP frequency information. The Clock Block C indicates that the data in VCs
protocol conversion are waiting to transmit. Our analysis for unused clock-gating
architecture is shown in (2). The slots access information of
power consumption is denoted by 𝑃𝑎 . The slot content full
Figure 8: NI breakdown into Shell, Kernel, and interface. and empty of power consumption are denoted by 𝑃𝑓 and 𝑃𝑒 ,
respectively. The 𝑃𝑠 is power consumption except for 𝑃𝑓 , 𝑃𝑒 ,
and 𝑃𝑎 . The unused clock-gating architecture does not control
Flit type (00) Source Destinations Payload clock for sequential logic in 𝑛VCs. Therefore, the logic will
address address
generate power consumption in high transmission structure.
Source Destinations The clocking gating consumes power in Clock Block B
Flit type (01) Routing information
address address
and Clock Block C. Our analysis for clock-gating architecture
Flit type (10) Payload is shown in (3). The 𝑃𝑔1 is power consumption of empty
gating. The clock-gating architecture does not control clock
Flit type (11) Payload when VCs is full stage. The VCs always store flits to wait for
transmission.
Figure 9: Flits type of router. The SPS consumes power in Clock Block B. Our analysis
for SPS architecture is shown in (4). The 𝑃𝑔2 is power
consumption of SPS. It saves the power consumption of
empty and full gating for 𝑛VCs. Consider
4. SPS with Router Design
𝑃𝑟1 = 𝑃𝑎 + 𝑃𝑓 + 𝑃𝑒 + 𝑃𝑠 , (2)
The VC that contains many slots to access data led to extra
power consumption. In this paper, we propose SPS architec- 𝑃𝑟2 = 𝑃𝑎 + 𝑃𝑓 + 𝑃𝑠 + 𝑃𝑔1 , (3)
ture to reduce the power consumption.
𝑃𝑟3 = 𝑃𝑎 + 𝑃𝑠 + 𝑃𝑔2 . (4)
4.1. Router with SPS Architecture. The proposed router
with SPS architecture is illustrated in Figure 10. The phys- 4.3. Design of SPS. The proposed SPS uses the VCs status to
ical channel (PC) is used to connect other routers and dynamic control clock of each VC. The CFSM of SPS with
access information. The input VCs (IVC) is used to store VCs is illustrated in Figure 13; it contains two CFSM in this
information from PCs. It always is designed by FIFO or architecture.
other sequential logic. The arbiter decides the flits priority The first CFSM includes initial, empty, full, and waiting
to control input switch logic (ISL) and output switch logic status. Initial status: when the VC is reset, the structure is into
(OSL) to transmit flits. It includes RC, VA, and SA. The the initial status until the flit arrive. Empty status: when the
crossbar (CR) connects IVC to OVC, the switch signal form user resets the VCs or the flits transport to next storage unit,
arbiter. The output VCs (OVC) store information from CR. the structure is into this status. Full status: the store flit in VC
The proposed SPS uses the transmission channel status to is full. Waiting status: When the user resest the VCs or the
dynamic control IVC and OVC clock in essential operating. store flit is complete.
The VCs with SPS architecture are illustrated in Figure 11. The VCs with SPS algorithm is illustrated in Algorithm 4.
It controls system clock into I/O VC to reduce power con- In line 3, the VCs will initialize the VCs count and flags.
sumption. In this architecture, the VC contains 0 to 𝑖 − 1 slots The VCs will access flits to change VCs count when channel
to access data. packet or arbiter signal arrive at line 4 to 9. When the VCs
VLSI Design 7

Packet I/O VC 1 Packet


input output
ISL OSL
0 i−1

SPS

Packet Packet
input I/O VC n output
ISL OSL
0 i−1

SPS

Figure 11: VCs with SPS architecture.

Clock Clock Clock Clock Clock Clock Clock Clock Clock


Block Block Block Block Block Block Block Block Block
A B C A B C A B C

(a) No clock-gating (b) Clock-gating (c) SPS

Figure 12: VCs power with clock diagram.

VCs with SPS CFSM


(VC 0 Ef = 1)/
(VC1 to i−1 Ef = 1)/ direction := Waiting
direction := Empty Initial

(Wake up = 1)/ (Curr VC Ff! = 1)/


(Old VC Ef = 1)/ Empty Waiting
direction := Empty direction := Waiting direction := Waiting

(Old VC = Tran data)/


(Curr VC = Rec data)/
direction := Empty
Full direction := Full

(Old VC Ff = 1)/
direction := Full

Old VC Curr VC Old VC Ff Old VC Ef Curr VC Ff VC 0 to i−1 Wake up

SPS CFSM
(VC1 to i−1 Ef = 1)/
direction := Clock Gating
Initial (VC 0 Ef = 1)/
direction := Wake up

(Curr VC EF = 1 ‖ Old VC data)/ (Old VC = Tran data)/


direction := Clock Gating (Curr VC Ff! = 1)/
direction := Clock Gating direction := Wake up
Clock-gating Wake Up
(Curr VC = Rec data)/
direction := Wake up

Figure 13: CFSM of SPS with VCs.

count can be changed, then the VCs flag will be changed at then SPS will disable this VC clock and change to this status.
line 10 to 17. Wake up: when the VC want to store flit, one VC will wake
The second CFSM includes initial, clock-gating, and wake up.
up status. Initial status: this principle is the first CFSM of ini- The SPS algorithm is illustrated in Algorithm 5. In line 3,
tial state. Clock-gating: when the VC changes to full or empty, the SPS will initialize VCs clock and access status from VCs
8 VLSI Design

VCs with SPS Algorithm Router with SPS Algorithm


Input: VCs clock, channel packet, arbiter signal and reset. Input: system clock, start, Lottery Input.
Output: channel packet, channel status Output: test-start, Implement-results
(1) VCcount is integer and range is 1 ≤ VCcount ≤ 𝑛 (1) If start testing
(2) VCflag includes full flag and empty flag (2) {test-start = 1; pass VD}
(3) initial VCcount and VCflag (3) While (read test data from and start bit set-up to one) do
(4) while (channel packet or arbiter signal be arrival) do (4) Lottery Input = Test-vector
(5) if (channel packet be arrival and full flag != 1) (5) Implement-results = Test-vector use Router with SPS to
(6) {VCcount = VCcount + 1 and packet store in VCs} transmission;
(7) if (arbiter signal be arrival and empty flag != 1) (6) Test-vector address = Test-vector address + 1;
(8) {VCcount = VCcount − 1 and packet be read from VCs} (7) If (test finish or start = 0)
(9) end while (8) {test-start = 0}
(10) while (VCcount be change) do (9) End while
(11) if (VCcount = 𝑛)
(12) {assign full flag to 1}
Algorithm 6: Router with SPS testing algorithm.
(13) else if (VCcount = 1)
(14) {assign empty flag to 1}
(15) else
(16) {assign full flag and empty flag to 0}
(17) end while
requirement of start testing, when the requirement arrives,
TVG then will change status from idle to generator. When
the requirement is cancelled, the status be changed from
Algorithm 4: VCs with SPS algorithm. generator to idle. The generator status will generate test-
vector and compare-vector; this is illustrated in Figure 15;
we use 𝑐 language to generate lottery arbitration [18] in test-
SPS Algorithm vector at control step 1. We use HDL to design the conven-
Input: system clock, channel packet, arbiter signal and reset. tional router to generate the compare-vector and the input
Output: VCs clock pattern from the test-vector at control step 2. When the
(1) VCgroup is VCs group of 4 direction port compare-vector and test-vector functions are complete then
(2) VCflag includes full flag and empty flag the status will be changed from generator to vector output
(3) Initial VCs clock and access VCs count and stage flag (VO) at control step 3. The VO status will transform test-
(4) follow LCR to arrangement all slots priority; vector and compare-vector to Xilinx memory IP files,
(5) VC𝑐𝑙𝑘𝑖 is VCs clock of each VCgroup //where 1 ≤ 𝑖 ≤ 𝑛 through memory to control data output to test and compare
(6) Example VCgroup = East port only one clock.
(7) initial VC𝑐𝑙𝑘𝑖 = 0; //where 1 ≤ 𝑖 ≤ 𝑛 The second module is vector database (VD); the control
(8) while (virtual channel be write) do flow graph is illustrated in Figure 16; the module writes VO
(9) if (VCflag = empty) status vector in memory. The database includes two vectors
(10) {VC𝑐𝑙𝑘𝑖 = system clock}
to test and analyze the proposed circuit. The lottery database
(11) If (VCflag = full flag)
is provided test packet for router with SPS. The compare
(12) {VC𝑐𝑙𝑘𝑖 = 0 and VC𝑐𝑙𝑘𝑖+1 = system clock}
(13) end while
database is provided analysis for router with SPS.
(14) while (virtual channel be read) do The third module is router with SPS; we use VD to
(15) if (empty flag = 1) propose the test-vector to implement this module. The testing
(16) {VC𝑐𝑙𝑘𝑖 = 0} algorithm is illustrated in Algorithm 6, when the start signal
(17) end while set up to one from I/O, then the module starts to test and pass
this signal to VD at lines 1 to 2. When testing is started, the
input signal will be read from VD, shown at lines 3 and 4 in
Algorithm 5: SPS algorithm.
Algorithm 6. The read test-vector delay time is one clock from
VD to router with SPS. The router with SPS uses VD test-
vector to compute at line 6. When this pattern computation is
with VC flags. The slots priority from LCR [2] and each VCs finish, the next pattern will be read from VD at line 6. When
clock can be initialized at lines 4, 5, and 7. The SPS controls the test pattern computation is finished or start signal is
VCs clock to reduce the VCs power consumption when VCs cancelled, test-start set up and stop testing at lines 7 and 8.
is accessed and flags changed at lines 8 to 17. The final module, verification module, is illustrated in
Figure 17; we verify the function in this module. The function
5. Experimental Results verification is comparing of compare-vector and implement-
results from VD and router with SPS. If the pattern is error,
In this section, we proposed autotesting architect for router then verification result returns error signal.
with SPS. This architect includes four modules of autotesting. The hardware experimental environment uses Xilinx
The first module is test-vector generator (TVG); the FSM FPGA xc5vlx50t-1ff1136 to verify SPS architecture. The soft-
is illustrated in Figure 14; the Idle status is waiting for the ware experimental environment uses Xilinx ISE 12.3 and the
VLSI Design 9

Start
(test req = 1)/ (output finish = 1)/
direction = Generator Idle direction = Idle

(test req = 0)/ (test req = 0)/


direction=Idle direction=Idle
Generator Vector output
(gen finish = 1)/
(gen finish = 0)/ direction = Test Vector Output (output finish = 0)/
direction = Generator direction = Test Vector Output

Figure 14: Test-vector generator (TVG) module FSM.

Lottery
Router
VD
with SPS
Control step 1
Compare-vector Implement-results
Conventional
router
Control step 2

Control step 3 Verification


module
Compare-vector Test-vector

Figure 15: Generator status control and data flow graphs.


Verification
result

Start Figure 17: Verification module data flow graphs.

No No
Write Read
Yes Yes
475
Power consumption (mW)

Vector Database
category category 425

Compare- 375
Test-vector Compare Lottery
vector
Compare Lottery Compare Lottery 325
database database test-vector test-vector
275
Figure 16: Vector database (VD) control flow graph.
225
0.1 0.3 0.5 0.7 0.9 2 4 6 8 10
Numbers of test flits (k flits)

analysis tools use Modelsim 6.6, Xilinx Chipscope ILA, and IntelliBuffer [2] BCG [10]
Adaptive data compression [3] Proposed
Xpower 12.3, which are supported by Xilinx. The test exper-
imental environment uses 2 × 2 mesh and 𝑋-𝑌 routing; the Figure 18: Power consumption distribution.
PC have 4 VCs to access flits. The power consumption distri-
bution is illustrated in Figure 18; the number of test packets is
from 100 to 10000. The packet format is flit and packet length
is 18 bits. 6. Conclusions
Comparing related works, as shown in Table 1, Intel-
liBuffer [2], adaptive data compression [3], and buffer clock- The Smart Power-Saving (SPS) architecture for network-on-
gating [10], the proposed method reduces 37.31%, 45.79%, chip was presented. A clock control circuit and SPS algorithm
and 19.26% on power consumption, respectively, and reduces are demonstrated to reduce the power consumption on the
49.4%, 25.5% and 14.4% on area, respectively. NoC architecture. From experimental results, the proposed
10 VLSI Design

Table 1: Comparison of power consumption and area.

Constraints
Methods
Power consumption (mW) Area (number of slices) Improved power Improved area
IntelliBuffer [2] 410.42 1551 37.31% 49.4%
Adaptive data compression [3] 474.53 1054 45.79% 25.5%
Buffer clock-gating [10] 318.63 917 19.26% 14.4%
Newly proposed 257.05 785

SPS architecture is more efficient to reduce the power con- Large Scale Integration (VLSI) Systems, vol. 22, no. 3, pp. 675–
sumption than IntelliBuffer [1], adaptive data compression 685, 2014.
[3], and buffer clock-gating [10] in the NoC architecture. [10] T. Y. Lee, C. H. Huang, and X. S. Lin, “Design of buffer clock-
gating architecture for network-on-chip,” in Proceedings of the
22th VLSI Design/CAD Symposium, pp. 2–5, August 2011.
Conflict of Interests [11] R. Ezz-Eldin, M. A. El-Moursy, and A. M. Refaat, “Low leakage
The authors declare that there is no conflict of interests power NoC switch using AVC,” in Proceedings of the IEEE
International Symposium on Circuits and Systems (ISCAS ’12),
regarding the publication of this paper.
pp. 2549–2552, Seoul, Republic of Korea, May 2012.
[12] T. R. da Rosa, V. Larrea, N. Calazans, and F. G. Moraes, “Power
Acknowledgment consumption reduction in MPSoCs through DFS,” in Proceed-
ings of the 25th Symposium on Integrated Circuits and Systems
The authors would like to thank the Ministry of Science and Design (SBCCI '12), pp. 1–6, 2012.
Technology of the Republic of China, Taiwan, for partially [13] G. Huaxi, X. Jiang, and Z. Wei, “A low-power fat tree-based
supporting this research. optical network-on-chip for multiprocessor system-on-chip,” in
Proceedings of the Design, Automation and Test in Europe
Conference and Exhibition (DATE ’09), pp. 3–8, April 2009.
References
[14] H. Gu, K. H. Mo, J. Xu, and W. Zhang, “A low-power low-cost
[1] D. James, “Intel Ivy Bridge unveiled—the first commercial tri- optical router for optical networks-on-chip in multiprocessor
gate, high-k, metal-gate CPU,” in Proceedings of the Custom systems-on-chip,” in Proceedings of the IEEE Computer Society
Integrated Circuits Conference (CICC '12), pp. 9–12, September Annual Symposium on VLSI (ISVLSI ’09), pp. 19–24, Tampa, Fla,
2012. USA, May 2009.
[2] C. Nicopoulos, S. Srinivasan, A. Yanamandra et al., “On the [15] K. Swaminathan, G. Lakshminarayanan, F. Lang, M. Fahmi, and
effects of process variation in network-on-chip architectures,” S. B. Ko, “Design of a low power network interface for Network
IEEE Transactions on Dependable and Secure Computing, vol. 7, on chip,” in Proceedings of the 26th IEEE Canadian Conference
no. 3, pp. 240–254, 2010. on Electrical and Computer Engineering (CCECE ’13), pp. 1–4,
May 2013.
[3] M. Taassori, M. Taassori, and M. Mossavi, “Adaptive data com-
pression in NoC architectures for power optimization,” Inter- [16] R. Mullins, A. West, and S. Moore, “Low-latency virtual-chan-
national Review on Computers and Software, vol. 5, no. 5, pp. nel routers for on-chip networks,” in Proceedings of the 31st
540–547, 2010. Annual International Symposium on Computer Architecture
(ISCA '04), pp. 188–197, 2004.
[4] D. Bertozzi and L. Benini, “Xpipes: a network-on-chip architec-
ture for gigascale systems-on-chip,” IEEE Circuits and Systems [17] H. Katabami, H. Saito, and T. Yoneda, “Design of a GALS-
Magazine, vol. 4, no. 2, pp. 18–31, 2004. NoC using soft-cores on FPGAs,” in Proceeding of the Embedded
Multicore Socs (MCSoC '13), pp. 26–28, September 2013.
[5] S. J. Lee, K. Lee, and H. J. Yoo, “Analysis and implementation
of practical, cost-effective networks on chips,” IEEE Design and [18] J. Wang, Y. Li, Q. Peng, and T. Tan, “A dynamic priority arbiter
Test of Computers, vol. 22, no. 5, pp. 422–433, 2005. for network-on-chip,” in Proceedings of the IEEE International
Symposium on Industrial Embedded Systems (SIES '09), pp. 253–
[6] Y. J. Yoon, N. Concer, M. Petracca, and L. Carloni, “Virtual 256, July 2009.
channels versus multiple physical networks: a comparative anal-
[19] S. Saponara, L. Fanucci, and M. Coppola, “Design and coverage-
ysis,” in Proceedings of the 47th ACM/IEEE Design Automation
driven verification of a novel network-interface IP macrocell for
Conference (DAC '10), pp. 162–165, June 2010.
network-on-chip interconnects,” Journal of Microprocessors and
[7] L. Benini and G. de Micheli, “Networks on chips: a new SoC Microsystems, vol. 35, no. 6, pp. 579–592, 2011.
paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, 2002.
[8] J. C. S. Palma, L. S. Indrusiak, F. G. Moraes, R. Reis, and M.
Glesner, “Reducing the power consumption in networks-on-
chip through data coding schemes,” in Proceedings of the 14th
IEEE International Conference on Electronics, Circuits and Sys-
tems (ICECS ’07), pp. 1007–1010, December 2007.
[9] N. Jafarzadeh, M. Palesi, A. Khademzadeh, and A. Afzali-
Kusha, “Data Encoding Techniques for Reducing Energy Con-
sumption in Network-on-Chip,” IEEE Transactions on Very
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 406416, 7 pages
http://dx.doi.org/10.1155/2014/406416

Research Article
Optimization of Fractional-N-PLL Frequency Synthesizer for
Power Effective Design

Sahar Arshad,1 Muhammad Ismail,1 Usman Ahmad,2


Anees ul Husnain,3 and Qaiser Ijaz3
1
Department of Electronic Engineering, University College of Engineering and Technology,
The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan
2
Scholar Teacher Research Alliance for Problem Solving (STRAPS), Bahawalpur 63100, Pakistan
3
Department of Computer System Engineering, University College of Engineering and Technology,
The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan

Correspondence should be addressed to Qaiser Ijaz; [email protected]

Received 10 May 2014; Accepted 7 June 2014; Published 23 July 2014

Academic Editor: Yu-Cheng Fan

Copyright © 2014 Sahar Arshad et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We are going to design and simulate low power fractional-N phase-locked loop (FNPLL) frequency synthesizer for industrial
application, which is based on VLSI. The design of FNPLL has been optimized using different VLSI techniques to acquire
significant performance in terms of speed with relatively less power consumption. One of the major contributions in optimization
is contributed by the loop filter as it limits the switching time between cycles. Sigma-delta modulator attenuates the noise generated
by the loop filter. This paper presents the implementation details and simulation results of all the blocks of optimized design.

1. Introduction to the phase of the input signal and this signal is called
an output signal of the PLL. The input signal is called
For many manufacturers and product developers, it is a good the “reference” signal. In a feedback loop, the oscillator is
idea to reduce power consumption in electronic products. controlled by the output signal from the phase detector [3, 4].
It is also an important idea to gain competitive advantage The circuit compares the phase of a signal obtained from
in an increasingly power hungry world. Low power con- its output oscillator with the phase of the input signal to
sumption gives many benefits to designers and to users; keep the phases matched by adjusting the frequency of its
for example, the main advantage is that it reduces stringent oscillator. A phase locked loop (PLL) architecture has two
cooling requirements and it results in inexpensive and more types, a Fractional-N PLL (FNPLL) and an integer-N PLL
compact products [1]. The rapid rise in power requirements
[5]. For a given frequency resolution, the latter has high
has promoted governments and industry to increase energy
reference frequency than the former, and, hence, the loop
efficiency and design low power components. The majority of
bandwidth which is limited to 10% of the reference frequency
frequency synthesis techniques fall into two categories: either
can be set larger in the FNPLL than in the integer-N-PLL.
direct frequency synthesis or indirect frequency synthesis
[2]. To achieve fine frequency steps, the direct frequency Therefore, the latter architecture is used for faster locking.
synthesis technique is used because it is based on using digital This speed advantage of the FNPLL, however, comes at the
techniques. To generate multiples (integer or noninteger) of price of increased design complexity [6]. This is because
a reference frequency, indirect frequency synthesis is used the fractional-N operation in steady state requires fractional
because it is based on a phase-locked loop (PLL). Here, the spur reduction circuits whose quantization noise folds into
latter technique is used because we are going to implement the PLL spectrum via loop nonlinearities, demanding more
PLL. It is used to generate a signal whose phase is related significant design efforts to minimize the loop nonlinearities.
2 VLSI Design

fin
Phase
fout
Loop filter VCO
detector

Figure 1: PLL system representation.

Fref
PFD CP LPF VCO

Divided by
(Nint + y[n])

Sigma- Y[n]
a delta
modulator

Figure 2: Sigma-delta FNPLL arrangement.

W = 2.0 𝜇 W = 2.0 𝜇
L = 0.12 𝜇 L = 0.12 𝜇
clk1 W = 2.0 𝜇 out1
L = 0.12 𝜇
W = 1.0 𝜇 W = 1.0 𝜇
L = 0.12 𝜇 L = 0.12 𝜇

W = 1.0 𝜇
L = 0.12 𝜇

clk2

Figure 3: Schematic of phase detector (PD).

Figure 4: Layout of PD.

On the contrary, in the absence of fractional spurs, integer-N- In this equation, 𝑀 is an integer, and 𝑛 is the fractional
PLLs involve less design complexity. Here, FNPLL is required. part. To obtain the desired fractional division ratio dual
The expression of output frequency of the FNPLL is modulus is used [7]. Using the sigma-delta modulation tech-
nique, we can remove the fractional spurs. This technique
generates a random integer number. The average of these ran-
FreqFNPLL = (𝑀 ⋅ 𝑛) ∗ FreqRef . (1) dom numbers will result in the desired ratio. A phase detector,
VLSI Design 3

Figure 5: V versus T.

clk1
out1 R out2
1000
clk2 Res C
0.5 pF
Capa

out3

Figure 6: CMOS circuit of PD with LF.

Figure 7: Layout of PD with LF.

Figure 8: V versus T.

a loop filter, and a voltage controlled oscillator (VCO) are the difference. Three units are coupled as a feedback system as
main parts of phase-locked loop, as shown in Figure 1. shown in Figure 1. The periodic output signal is generated
The important part of the phase-locked loop (PLL) is by the oscillator. The applications of PLL are versatile; for
phase detector. It is also called a phase comparator, logic example, it can generate different stable frequencies or it can
circuit, frequency mixer, or an analog multiplier that gener- obtain a signal from noisy signals. A complete phase-locked
ates a voltage signal and this voltage signal shows the phase loop block can be obtained from single integrated circuit.
4 VLSI Design

W = 2.0 𝜇 W = 2.0 𝜇 W = 2.0 𝜇 W = 2.0 𝜇


L = 0.12 𝜇 L = 0.12 𝜇 L = 0.12 𝜇 L = 0.12 𝜇

W = 2.0 𝜇 W = 2.0 𝜇 W = 2.0 𝜇


L = 0.12 𝜇 L = 0.12 𝜇 L = 0.12 𝜇
out1

W = 1.0 𝜇 W = 1.0 𝜇 W = 1.0 𝜇


L = 0.12 𝜇 L = 0.12 𝜇 L = 0.12 𝜇

in1 W = 1.0 𝜇 W = 1.0 𝜇 W = 1.0 𝜇 W = 1.0 𝜇


L = 0.12 𝜇 L = 0.12 𝜇 L = 0.12 𝜇 L = 0.12 𝜇

Figure 9: VCO schematic.

Figure 10: VCO layout.

This technique is used in advanced electronic products which 2. PLL Design Using 0.12 Micrometer
have different output frequencies from some Hz to many
Giga Hz [8]. To get low power consumption, high speed, 2.1. Phase Detector. The first block has two inputs, the
and stability, we decide to design phase-locked loop of reference input and the feedback. It compares frequencies
architecture fractional-n using 0.12 micrometer CMOS/VLSI of input and produces an output using phase difference of
design. As the demand of PLL is growing day by day in the inputs. To represent this block XOR gates are used. The gate
field of communications, low leakage transistors will be used produces a square wave when one-fourth of period shift of 90
for maintaining low power but for this we have to make a little degrees takes place at clock input, whereas output is different
compromise on frequency. for all other angles. We apply output of the XOR gates to
The structure of FNPLL is depicted in Figure 2. We can LPF which results in analog voltage, proportional to phase
difference.
control characteristics of PLL by using low pass filter, for
Figure 3 depicts a CMOS circuit of phase detector,
example, transients response and bandwidth. The basic and
Figure 4 describes layout, and Figure 5 represents the output
essential functional unit of PLL is VCO. VCO is used for
waveform.
clock generation [9]. For synthesizing aspired frequencies, we
use PLL with arbitrary frequency division (+N) method. This 2.2. Loop Filter. To get pure DC voltage along with rectifiers
proposed technique has the ability to give fast settling time, filters, the electronic circuits are also used. The second
reduce phase noise, and also reduce the effect of spurious block of PLL is loop filter and it has two distinct functions.
frequencies when compared with existing FNPLL techniques. First, maintains stability, that is defined by describing the
VLSI Design 5

Figure 11: V versus T.

W = 2.0 𝜇 W = 2.0 𝜇 W = 2.0 𝜇


L = 0.12 𝜇 L = 0.12 𝜇 L = 0.12 𝜇
out1

W = 2.0 𝜇
L = 0.12 𝜇
clk1 W = 2.0 𝜇 clk2
L = 0.12 𝜇
C
W = 1.0 𝜇 1 pF
L = 0.12 𝜇 Capa
W = 1.0 𝜇
L = 0.12 𝜇
R W = 1.0 𝜇
100 L = 0.12 𝜇
Res

Figure 12: CMOS circuit of comparator.

Figure 13: Layout of comparator.

Figure 14: V versus T.


6 VLSI Design

W = 2.0 𝜇
L = 0.12 𝜇 W = 2.0 𝜇
L = 0.12 𝜇
out1

W = 1.0 𝜇
L = 0.12 𝜇
clk1 clk2 W = 1.0 𝜇
L = 0.12 𝜇

W = 1.0 𝜇
clk3 L = 0.12 𝜇

Figure 15: CMOS schematic of OTA.

Figure 16: OTA layout.

Figure 17: V versus T.

loop dynamics. This explains the response of the loop to 2.3. Voltage Controlled Oscillator. As VCO is a source of
uncertainties. The 2nd function is applied to the VCO control varying output signal so the frequency of the output signal
input which appears at the phase detector output. This is regulated over a DC voltage range. The output signal can
frequency produces FM sidebands and modulates the VCO be a square wave or a triangular wave form. The oscillation
[10]. Other features of the PLL, for example, bandwidth, frequency is controlled by the value of input voltage [11].
transient response, lock range, and capture range, can be Figure 9 shows a CMOS circuit of VCO, Figure 10 shows
controlled by LPF. The LPF is used to attenuate this energy, layout, and Figure 11 shows output waveform.
but it can also reject band. The low pass filter can be obtained
by using a capacitor of large value and the capacitor is charged 2.4. Sigma-Delta Modulator. Sigma-delta modulation tech-
and discharged with the help of the switch resistance 𝑅on . nique is used to convert high definition signals to low
By the help of 𝑅on .C delay a low pass filter can be created. definition signals in digital domain. We designed sigma-
Figure 6 depicts a CMOS schematic of phase detector with delta modulator using 0.12 micrometer feature size and then
loop filter, Figure 7 shows layout, and Figure 8 shows output the layout was obtained. The input is the aspired fractional
waveform. number (𝑛) and the output is the sum of quantization noise
VLSI Design 7

and a DC part [12, 13]. By the use of integer divider quantiza- [13] R. K. Krishnamurthy, A. Alvandpour, V. De, and S. Borkar,
tion noise was generated. Figures 12 and 15 show the CMOS “High-performance and low-power challenges for sub-70 nm
circuit; Figures 13 and 16 show the layout of comparator and microprocessor circuits,” in Proceedings of the IEEE Custom
operational transconductance amplifier. Figures 14 and 17 Integrated Circuits Conference, pp. 125–128, May 2002.
show the output waveforms.

3. Conclusion
Power usage and heat dissipation are one of the biggest
challenges of VLSI industry today. In order to design the low
power consuming component, without making significant
change in performance, the design of FNPLL frequency
synthesizer was implemented and simulated. The optimized
design was implemented to 0.12 micrometer technology.
Using CMOS logic, the schematics were designed and verified
functionally and then prefabrication layout was sketched. The
simulation curves of the layouts reflected reduction in power
consumption, for the optimized design.

Conflict of Interests
The authors declare that there is no conflict of interests
regarding the publication of this paper.

References
[1] R. Jacob Baker, CMOS Circuit Design, Layout and Simulation,
IEEE Press, John Wiley & Sons, 3rd edition, 2010.
[2] A. Anil and R. K. Sharma, “A high efficiency charge pump for
low voltage devices,” International Journal of VLSI Design &
Communication Systems, vol. 3, no. 3, 2012.
[3] U. L. Rohde, Digital PLL Frequency Synthesis, Prentice-Hall,
Englewood Cliffs, NJ, USA, 1983.
[4] B. K. Mishra, S. Save, and S. Patil, “Design and analysis of second
and third order PLL at 450 MHz,” International Journal of VLSI
Design & Communication Systems, vol. 2, no. 1, 2011.
[5] N. Weste and D. Harris, CMOS VLSI Design—A Circuits and
Systems Perspective, Pearson Education, 3rd edition, 2005.
[6] U. A. Belorkar and S. A. Ladhake, “Design of low power phase
lock loop using 45 nm VLSI technology,” International Journal
of VLSI Design & Communication Systems, vol. 1, no. 2, 2010.
[7] T. A. D. Riley, M. A. Copeland, and T. A. Kwasniewski, “Delta-
Sigma modulation in fractional-n frequency synthesis,” IEEE
Journal of Solid-State Circuits, vol. 28, no. 5, pp. 553–559, 1993.
[8] M. H. Perrott, “Fractional-N Frequency Synthesizer Design
Using The PLL Design Assistant and CppSim Programs,” July
2008.
[9] S. Franssila, Introduction to Microfabrication, John Wiley &
Sons, 2004.
[10] K. Woo, Y. Liu, E. Nam, and D. Ham, “Fast-lock hybrid
PLL combining fractional-N and integer-N modes of differing
bandwidths,” IEEE Journal of Solid-State Circuits, vol. 43, no. 2,
pp. 379–389, 2008.
[11] N. Fatahi and H. Nabovati, “Design of low noise fractional-
N frequency synthesizer using sigma-delta modulation tech-
nique,” in Proceedings of the 27th International Conference on
Microelectronics (MIEL ’10), pp. 369–372, IEEE, May 2010.
[12] S. Borkar, “Obeying Moore’s law beyond 0.18 micron,” in
Proceedings of the 13th Annual IEEE International ASIC/SOC
Conference, pp. 26–31, September 2000.
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 380362, 5 pages
http://dx.doi.org/10.1155/2014/380362

Research Article
Performance Analysis of Modified Drain Gating Techniques for
Low Power and High Speed Arithmetic Circuits

Shikha Panwar, Mayuresh Piske, and Aatreya Vivek Madgula


School of Electronics Engineering (SENSE), VIT University, Vandalur-Kelambakkam Road, Chennai 600127, India

Correspondence should be addressed to Shikha Panwar; [email protected]

Received 2 May 2014; Accepted 27 June 2014; Published 15 July 2014

Academic Editor: Yu-Cheng Fan

Copyright © 2014 Shikha Panwar et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper presents several high performance and low power techniques for CMOS circuits. In these design methodologies, drain
gating technique and its variations are modified by adding an additional NMOS sleep transistor at the output node which helps in
faster discharge and thereby providing higher speed. In order to achieve high performance, the proposed design techniques trade
power for performance in the delay critical sections of the circuit. Intensive simulations are performed using Cadence Virtuoso in
a 45 nm standard CMOS technology at room temperature with supply voltage of 1.2 V. Comparative analysis of the present circuits
with standard CMOS circuits shows smaller propagation delay and lesser power consumption.

1. Introduction termed as drain to source voltage, and 𝑚 is the body effect


coefficient. Cdm and Cox are the depletion layer and gate
As we move on to finer MOSFET technologies, transistor oxide capacitances, respectively.
delay has decreased remarkably which helped in achiev-
To counteract the excessive leakage in CMOS circuit,
ing higher performance in CMOS VLSI processors. With
many architectural techniques have been proposed over the
technology scaling, it is required to reduce the threshold
years. Power gating [2] and stacking effect [3] are two well-
and power supply voltages. As square of power supply
voltage is directly proportional to dynamic power dissipation, known techniques for reducing leakage power dissipation.
to achieve less consumption of power, supply voltage has Power gating normally makes use of sleep transistors that
to be reduced. Static power and dynamic power are two are connected either between the power supply and the
main components of total power dissipation. Static power pull-up network (PUN) or between the pull-down network
consumption is calculated in the form of leakage current (PDN) and ground. Sleep transistors are switched on when
through each device. Substantial increase has been observed the circuit is evaluating and they are switched off in standby
in subthreshold leakage current with scaling of threshold mode to conserve the leakage power in the logic circuit.
voltage [1]. Subthreshold current 𝐼ST is given by [1] Multi-threshold-CMOS (MTCMOS) [4] technique is also
an effective way to achieve considerable decline in leakage
𝑊 2
𝐼ST = 𝜇0 Cox ( ) (𝑚 − 1) (𝑉𝑇 ) × 𝑒(𝑉𝑔 −𝑉th )/𝑚𝑉𝑇 power consumption. In MTCMOS technique, high 𝑉th sleep
𝐿 (1) transistors are added in the circuit whereas PUN and PDN
× (1 − 𝑒 −𝑉DS /𝑉𝑇
), use low 𝑉th devices. In dual threshold circuits [5], low 𝑉th
devices are used in the delay critical sections and high 𝑉th
where devices are used to reduce the leakage current in the circuitry.
Stacking of transistor in series reduces the subthreshold
Cdm
𝑚=1+ , (2) leakage current when one transistor is in the off state. Stacking
Cox effect is used in sleepy stack technique [6] and force stack
where thermal voltage, 𝑉𝑇 = 𝐾𝑇/𝑞, 𝜇0 is the mobility, 𝑉𝑔 technique [7]. Sleepy stack technique provides better results
is the gate voltage, 𝑉th is the threshold voltage, 𝑉DS is than forced stack technique. In forced stack, an extra sleep
2 VLSI Design

Pull-up Pull-up
network network
S S

S Pull-up S Pull-up
network network

Pull-down
Pull-down network
S󳰀 network S󳰀

S󳰀 Pull-down
Pull-down S 󳰀
network
network

(a) (b) (c) (d)

Figure 1: (a) Drain gating, (b) power gating, (c) drain-header and power-footer gating (DHPF), and (d) drain-footer and power-header
gating (DFPH).

transistor is inserted for each input of the gate both in PUN added between the power supply and the PUN, whereas
and in PDN resulting in higher delay and area. In sleepy stack, NMOS sleep transistor with input (S󸀠 ) is added between
an additional sleep transistor is connected in parallel with the the PDN and ground as shown in Figure 1(b). The two
transistor stack. This reduces the leakage current but at the mixed techniques DHPF and DFPH are shown in Figures
same time delay in the circuit is increased. 1(c) and 1(d), respectively. As the name suggests, in DHPF,
LECTOR [8] and GALEOR [9] are also two leakage a PMOS sleep switch is inserted between PUN and output
tolerant techniques. LECTOR makes use of two leakage node and an NMOS sleep switch is inserted between the PDN
control transistors (LCTs) that are connected between the and ground rail. DFPH consists of an NMOS sleep switch
PUN and PDN. In the same time GALEOR technique makes between output node and PDN and a PMOS sleep switch
use of gated leakage transistors (GLTs). Both LCTs and GLTs between the power supply and the PUN. Comparative results
reduce leakage by increasing the resistance between supply in Section 4 indicate that power gating technique is the best
voltage and ground. leakage tolerant technique whereas drain gating technique
Another efficient technique to counter the leakage current has the least delay among the previously proposed circuits.
problem is drain gating and its variation [10], explained in
detail in Section 2. The modified circuits are proposed in 3. The Proposed High Speed
Section 3. Simulation results taking NAND gate, 1-bit full
adder, and 8-bit RCA (Ripple carry adder) as test bench Circuit Techniques
circuits are enumerated in Section 4 and Section 5 provides The proposed circuits are aimed at reducing the propagation
the final conclusion. delay incurred by drain gating technique and its variations.
Four different circuit techniques, namely high speed drain
2. Drain Gating Technique and gating (HS-drain gating), HS-power gating, HS-DHPF, and
Its Variant Circuits HS-DFPH as shown in Figures 2(a), 2(b), 2(c), and 2(d)
respectively, are proposed in this section. In HS-drain gating
In drain gating technique [10] shown in Figure 1(a), two technique an additional sleep transistor with sleep input (S)
sleep transistors are added between the PUN and PDN. is connected at the output node parallel to the NMOS sleep
PMOS transistor with sleep input (S) is connected between transistor (S󸀠 ) and PDN. During the active mode, when the
PUN and output node, whereas NMOS transistor with logic circuit evaluates the circuits output, the added NMOS
sleep input (S󸀠 ) is inserted between the output node and sleep transistor (S) provides an additional discharging path in
PDN. When the circuit is in evaluation mode, the NMOS the circuit. This added transistor helps in speedy evaluation,
and PMOS sleep transistors are turned on resulting in low hence providing higher speed. In a similar fashion, an
resistance conducting path. When the circuit is in standby, additional NMOS sleep transistor with sleep input (S) is
both transistors are switched off to reduce the standby power. added to power gating, DHPF, and DFPH circuits.
Other variant circuits of drain gating are, namely, power The proposed cicuits have been verified by taking NAND
gating, drain-header and power-footer gating (DHPF), and gate, 1-bit full adder, and 8-bit RCA as test bench circuits.
drain-footer and power-header gating (DFPH). In power Experimental results in Section 4 prove that the modified
gating technique, PMOS sleep transistor with input (S) is HS-drain gating technique has the the least delay among
VLSI Design 3

Pull-up Pull-up
network S network S

S Pull-up S Pull-up
network network

Pull-down
S Pull-down S S
S󳰀
network S
network S󳰀

S󳰀 Pull-down
Pull-down S󳰀 network
network

(a) (b) (c) (d)

Figure 2: (a) HS-drain gating, (b) HS-power gating, (c) HS-DHPF, and (d) HS-DFPH.

Table 1: Power and delay values of NAND gate, FA, and 8-bit RCA
using various techniques.
a b c
NAND gate FA 8-bit RCA a b
Circuit
techniques Power Delay Power Delay Power Delay a
Carry
(nW) (ps) (nW) (ps) (uW) (ps) c a
Sum
Standard
22.32 45𝑒3 2.1𝑒3 30𝑒3 52.2 23.7𝑒3 b
CMOS
Drain b
12.63 25𝑒3 393 15𝑒3 7.53 8.85𝑒3 c
gating
Power
8.73 205𝑒3 238 150𝑒3 2.39 20.5𝑒3 a c
gating a
DHPF 11.08 80𝑒3 340 25𝑒3 3.02 12𝑒3
DFPH 8.71 175𝑒3 245 150𝑒3 4.02 15.5𝑒3 b
a b b
a b c
HS-drain
18.42 2.22 250 15.08 6.97 10.3
gating c
HS-power
16.57 11.76 246 49.9 2.13 21.4
gating
HS-DHPF 16.70 6.3 248 35.97 4.15 11.9 Figure 3: 1-bit CMOS full adder.
HS-DFPH 16.64 11.71 247 44.87 2.92 17

and NMOS sleep transistors between PUN and PDN network


the existing and proposed architectural techniques as shown turn off and additional NMOS sleep transistor is turned on,
in Figure 5. Also HS-power gating technique has the lowest discharging the output node to ground thereby resulting in
power as compared to standard CMOS circuit and the newly higher performance. A trade-off is achieved between power
proposed circuits as shown in Table 1. The ratio of PMOS to and delay so as to maintain high speed in the proposed
NMOS size is set to be equal to 2. circuits.
Two-input NAND gate using HS-drain gating operates in
two modes, namely, sleep or standby mode and active mode. 4. Simulations and Results
When the circuit is in active mode, sleep input (S) is in low
state, and output node gets charged to power supply voltage. Two-input NAND gate, 1-bit full adder, and 8-bit RCA are
Both NMOS and PMOS sleep transistors connected between implemented using the proposed high speed architectural
PUN and PDN are turned on and output is evaluated. For techniques. The circuit diagrams for 1-bit full adder and 8-bit
example, if we provide input to the PUN as 0(XX) where XX RCA are shown in Figures 3 and 4, respectively. Each stage in
stands for input vectors (00, 01, 10, 11), output will be high for 8-bit RCA consists of a 1-bit full adder (FA). Each FA circuit
the first three cases and low for the fourth case for the NAND consists of 28 transistors. In RCA, carry is propagated from
gate. Sleep signal should be provided in the form of alternate one stage to another and final carry is obtained as C8 shown
high and low signals. When sleep signal (S = 1), both PMOS in Figure 4.
4 VLSI Design

Sleep 1.00E − 07
B7 A 7 B1 A 1 B0 A 0

1.00E − 08
C1
C8 FA7 FA1 FA0 Cin

Propagation delay (s)


1.00E − 09
S8 S1 S0

Figure 4: 8-bit RCA. 1.00E − 10

1.00E − 11
The total power consumption and propagation delay of
various existing and proposed techniques for NAND gate,
FA, and 8-bit RCA are compared in Table 1. HS-drain gating
technique has the least delay. HS-power gating, HS-DFPH, 1.00E − 12
1 2 3 4 5 6 7
and HS-DHPF suffer from 50%, 39%, and 13% propagation ∘
delay with respect to HS-drain gating technique. Standard Temperature ( C)
drain gating and its variants circuit techniques suffer from
99% propagation delay in comparison with HS-drain gating Power gating CMOS
HS-drain gating DFPH
technique. Circuits employing HS-power gating technique Drain gating
HS-power gating
have very low power consumption. Power savings of nearly HS-DFPH DHPF
85% are achieved in arithmetic architectures employing HS- HS-DHPF
power gating technique. HS-drain gating technique has the
least power saving among the proposed circuits. HS-DHPF Figure 5: Temperature versus the propagation delay for the existing
and HS-DFPH techniques optimize the power and delay in and the proposed techniques.
CMOS arithmetic circuits.
The corner analysis for the drain gating design and its
variants is plotted along with that of the modified high speed
counterparts. Figure 5 shows the temperature versus the
propagation delay graph for 8-bit RCA using the existing
techniques and the proposed techniques. 1.00E − 07
Similarly Figure 6 shows the plot of process corners
versus the propagation delay of 8-bit RCA using the existing
techniques and the proposed techniques. 1.00E − 08

On observing the comparative graph shown in Figures 5


Propagation delay (s)

and 6, we can infer that the designs made using the modified
1.00E − 09
high speed drain gating technique and its corresponding
variants have substantial reduction in the propagation delay
when compared to the designs made using the CMOS, drain
1.00E − 10
gating technique, and its variants.

5. Conclusions 1.00E − 11

In this paper, we have tabulated the total power consumption


and the propagation delay for certain circuits using the exist- 1.00E − 12
ing low power and performance enhancing techniques and
FF SNFP TT FNSP SS
the newly proposed ones. Also we have made a comparative
Process
study of these techniques for the parameters like temperature,
process corners, and propagation delay. Simulation results Drain gating HS-drain gating
show that the proposed circuits work effectively even at Power gating HS-power gating
extreme temperature and at different transistor configura- DFPH HS-DFPH
tions. DHPF HS-DHPF

From the above mentioned experimental data, we can CMOS


observe that, by implementing the high speed modified
designs for the drain gating technique and its variants, we are Figure 6: Process versus the propagation delay for 8-bit RCA using
able to enhance the performance of the design at lower power the existing and the proposed techniques.
VLSI Design 5

consumption. Power consumption savings as observed in 8-


bit RCA and 1-bit full adder are 95% and 88%, respectively,
whereas propagation delay has been reduced by almost 99%
in both RCA and full adder circuit.

Conflict of Interests
The authors declare that there is no conflict of interests
regarding the publication of this paper.

References
[1] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leak-
age current mechanisms and leakage reduction techniques in
deep-submicrometer CMOS circuits,” Proceedings of the IEEE,
vol. 91, no. 2, pp. 305–327, 2003.
[2] M. Powell, S. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar,
“Gated-Vdd: a circuit technique to reduce leakage in deep-
submicron cache memories,” in Proceedings of the IEEE Sym-
posium on Low Power Electronics and Design (ISLPED ’00), pp.
90–95, July 2000.
[3] M. Johnson, D. Somasekhar, L. Y. Chiou, and K. Roy, “Leakage
control with efficient use of transistor stacks in single threshold
CMOS,” IEEE Transactions on VLSI Systems, vol. 10, no. 1, pp.
1–5, 2002.
[4] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and
J. Yamada, “1-V power supply high-speed digital circuit technol-
ogy with multithreshold-voltage CMOS,” IEEE Journal of Solid-
State Circuits, vol. 30, no. 8, pp. 847–854, 1995.
[5] L. Wei, Z. Chen, M. C. Johnson, K. Roy, Y. Ye, and V. K. De,
“Design and optimization of dual-threshold circuits for low-
voltage low-power applications,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 7, no. 1, pp. 16–24,
1999.
[6] J. C. Park and V. J. Mooney III, “Sleepy stack leakage reduction,”
IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 14, no. 11, pp. 1250–1263, 2006.
[7] S. Narendra, V. De, D. Antoniadis, A. Chandrakasan, and S.
Borkar, “Scaling of stack effect and its application for leakage
reduction,” in Proceedings of the International Symposium on
Low Electronics and Design (ISLPED '01), pp. 195–200, Hunting-
ton Beach, Calif, USA, August 2001.
[8] N. Hanchate and N. Ranganathan, “LECTOR: a technique for
leakage reduction in CMOS dircuits,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 12, no. 2, pp. 196–
205, 2004.
[9] S. Katrue and D. Kudithipudi, “GALEOR: leakage reduction for
CMOS circuits,” in Proceedings of the 15th IEEE International
Conference on Electronics, Circuits and Systems (ICECS ’08), pp.
574–577, September 2008.
[10] J. W. Chun and C. Y. R. Chen, “A novel leakage power reduction
technique for CMOS circuit design,” in Proceedings of the
International SoC Design Conference (ISOCC ’10), pp. 119–122,
November 2010.
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 529392, 12 pages
http://dx.doi.org/10.1155/2014/529392

Review Article
Gate-Level Circuit Reliability Analysis: A Survey

Ran Xiao and Chunhong Chen


Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, Canada N9B 3P4

Correspondence should be addressed to Chunhong Chen; [email protected]

Received 25 April 2014; Accepted 7 June 2014; Published 10 July 2014

Academic Editor: Yu-Cheng Fan

Copyright © 2014 R. Xiao and C. Chen. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Circuit reliability has become a growing concern in today’s nanoelectronics, which motivates strong research interest over the
years in reliability analysis and reliability-oriented circuit design. While quite a few approaches for circuit reliability analysis have
been reported, there is a lack of comparative studies on their pros and cons in terms of both accuracy and efficiency. This paper
provides an overview of some typical methods for reliability analysis with focus on gate-level circuits, large or small, with or without
reconvergent fanouts. It is intended to help the readers gain an insight into the reliability issues, and their complexity as well as
optional solutions. Understanding the reliability analysis is also a first step towards advanced circuit designs for improved reliability
in the future research.

1. Introduction details). Some approaches have been reported in literature,


which tackle the problem either analytically or numerically
As CMOS technology keeps scaling down to their funda- (by simulation). The contribution of this paper is to provide
mental physical limits, electronic circuits have become less an extensive overview and comparative study on typical
reliable than ever before [1]. The reason is manifold. First of reliability estimation methods with our simulation results
all, the higher integration density and lower voltage/current and/or results reported in literature.
thresholds have increased the likelihood of soft errors [2, 3]. We first review the key concepts in reliability analysis
Secondly, process variations due to random dopant fluctu- and its role in circuit design and then describe and evaluate
ation or manufacturing defects have negative impacts on several existing mainstream approaches for reliability analysis
circuit performance and may cause circuits to malfunction by looking at their accuracy, efficiency, and flexibility. Exam-
[1]. These physical-level defects would statistically lead to ples and simulation results are also given in order to show
probabilistic device characteristics. Also, some emerging their advantages and disadvantages. Finally, we provide some
nanoscale electronic components (such as single electron useful suggestions on how to choose an appropriate reliability
devices) have demonstrated their nondeterministic charac- analysis method under different circumstances, along with
teristics due to uncertainty inherent in their operation under some remarks on possible future work.
high temperature and external random noise [4, 5]. This
may further degrade the reliability of future nanoelectronic
circuits. Thus, circuit reliability has been a growing concern in 1.1. Signal Probability and Reliability. The probability of a
today’s micro- and nanoelectronics, leading to the increasing logic signal 𝑠 is by default defined as the probability of the
research interest in reliability analysis and reliability-oriented signal being logic “1” and is expressed as 𝑃𝑠 = Pr{𝑠 =
circuit design. “1”}. The reliability of the probabilistic signal 𝑠 is defined
For any reliability-aware architecture design, it is indis- as the probability that its value is correct (i.e., it is equal
pensable to estimate the reliability of application circuits to its error-free value) and is expressed as 𝑟𝑠 = Pr{𝑠 =
both accurately and efficiently. However, analyzing the reli- its error-free value}. In gate-level design, the output signal of
ability (or the error propagation) for logic circuits could a gate may become unreliable due to its unreliable inputs
be computationally expensive in general (see Section 1.3 for and/or errors of gate itself. If we use the classical von
2 VLSI Design

Neumann model [6] for gate errors, any gate can be associated X1
r Z
independently with an error probability 𝜀𝑖 . In other words, the X2
gate is modeled as a binary symmetric channel that generates
a bit flip (from 0 → 1 or 1 → 0) by mistake at its output (known
as von Neumann error [6]) symmetrically with the same X1
r=1 Z
probability. Thus, each gate 𝑖 in the circuit has an independent X2 r=1
gate reliability 𝑟𝑖 = 1 − 𝜀𝑖 , which is assumed to be localized Xr
and statistically stable. Also, it is reasonable to assume that
the error probability for any gate falls within [0, 0.5] (or 𝑟𝑖 ∈
[0.5, 1]). r=1
The reliability for a combinational logic circuit (denoted rZ
by 𝑅𝐶) is defined as the probability of the correct functioning
at its outputs (i.e., the joint signal reliability of all primary X1
outputs). This reliability can be generally expressed as a X2 r=1
r=1 Z
function of gate reliabilities in the circuit (denoted by r = Xr
{𝑟1 , 𝑟2 , . . . , 𝑟𝑁𝑔 }, where 𝑁𝑔 is the number of gates), as well as
signal probabilities of all primary inputs(denoted by Pin = Figure 1: An AND gate and its equivalent circuit.
{𝑃in1 , 𝑃in2 , . . . , 𝑃in𝑁in }, where 𝑁in is the number of primary
inputs), that is,

𝑅𝐶 = Pr {all outputs are correct} = 𝑓 (r, Pin ) , dissipation. One of the key issues in this context is to select
(1)
the most critical (in terms of reliability and cost) components
where the function 𝑓 depends on the topology of the (or logic gates) in the circuit and improve the circuit reliability
circuit under consideration. Note that the primary inputs are by increasing the robustness of only a few gates. In order to
assumed to be fully reliable (𝑟𝑠 = 1 if 𝑠 is a primary input). detect these critical gates, multiple cycles of reliability analysis
Under a particular case where all primary input probabilities are usually conducted for the whole circuit. In a more general
are a constant (say 0.5), 𝑅𝐶 turns out to be a function of r only. term, accurate and efficient reliability analysis can provide a
It is worth noting that gate errors may come from either guideline for future reliability-oriented architecture design.
external noises (thermal noise, crosstalk, or radiation) [3]
or inherent device stochastic behaviors [4]. In literature, the 1.3. Complexity of Gate-Level Reliability Analysis. It is under-
term “soft error” is used to emphasize the temporariness of stood that the problem of determining whether the signal
the errors due to random external noises (e.g., glitches). In probability at a given node is nonzero is equivalent to
this paper, however, a more general term of von Neumann the Boolean satisfiability (SAT) problem [8], a problem
gate error model is used instead, as the probabilistic feature of determining whether there exists an interpretation that
of gates is expected to exist widely and independently satisfies a given Boolean formula. A Boolean formula is
throughout the circuit. This differs from single-event upsets called satisfiable if the variables of this given formula can be
due to soft errors, where external noises are usually correlated assigned in such a way as to make the formula evaluate to
temporally and spatially. In other words, our focus is the error TRUE (3). The SAT has been proved to be an NP-complete
propagation in combinational networks, where the gate-level problem (see [9]). The problem of computing all signal
logic masking is considered. For instance, some logic errors probabilities in a circuit can be formulated as a random
may not affect (or propagate to) final outputs if they occur satisfiability problem, which is to determine the probability
in a nonsensitized portion of the circuit. Identifying these that a random assignment of variables will satisfy a given
nonsensitized gates would be critical for reliability estimation Boolean formula [9]. The random satisfiability problem lies in
and improvement. a class of problems, called #P-complete, which is conjectured
to be even harder than NP-complete. In the following, we
1.2. Role of Reliability Analysis. In order to guide the IC show that the reliability evaluation problem is equivalent to
design for reliable logic operations, it is required to develop the signal probability calculation problem and thus prove that
tools that can accurately and efficiently evaluate circuit reli- it is also a #P-complete problem.
ability, which is also a first step towards reliability improve- Let us consider a two-input AND gate (𝑍 = 𝑋1 𝑋2 ) which
ment. However, reliability analysis is a nontrivial task due to has the gate reliability 𝑟, as shown in Figure 1. We first add
the large size of IC circuits as well as the complexity of signal an extra XOR gate at the output, as well as an extra input 𝑋𝑟 ,
correlation and probability/reliability propagation within the with an assumption that both the XOR gate and original AND
circuit (as will become clear later in this paper). On the gate are error-free. The signal probability of this extra input
other hand, circuit reliability can be generally improved by is equal to the original gate error rate 𝜀 (i.e., 𝑃𝑋𝑟 = Pr{𝑋𝑟 =
increasing the gate reliabilities. This can be done by using “1”} = 1 − 𝑟). This ensures that the output 𝑍 of this extra XOR
redundant components. Classic redundancy techniques such gate is equivalent to the original output of the AND gate.
as TMR [5] or NAND-multiplexing [7] achieve this by For a combinational logic circuit, we first duplicate the
systematically replicating logic gates (other than sizing up whole circuit. In the original circuit, we make each gate
the transistors) at the cost of increased area and power error-free in order to compute the correct value at primary
VLSI Design 3

outputs. For the duplicated one, we extract the reliability of respectively. The fanout behavior is represented by explicit
each gate using the aforementioned method (as a result, all fanout gates, where a 1-input 𝑚-output fanout gate is simply
gates are also error-free in the duplicated circuit and the gates’ mimicked by a 1-input 𝑚-output buffer gate. A fault-free
number is doubled). Then, we add 2-input XNOR gates for circuit has an ideal transfer matrix (ITM), where the correct
each pair of corresponding primary outputs in the original value of the output occurs with the probability of 1. This
and duplicated circuits. Thus, the output reliability can be means that, in each row of the PTM, there is single “1” for
expressed as the signal probability at the output of the XNOR the correct output value and there are “0”s for other output
gates. By doing so (i.e., duplicating the circuit and extracting combinations. The circuit reliability (i.e., the probability of
gate reliabilities), we see that the reliability estimation of outputs being correct) is evaluated by comparing its PTM and
original circuit is equivalent to the problem of computing the ITM.
signal probabilities of the transformed circuit. The process of combining gate probability matrices
For a combinational logic circuit with 𝑁in primary inputs, implicitly takes into account the signal dependency between
𝑁out primary outputs, and 𝑁𝑔 logic gates, the problem of gates by considering the underlying joint and conditional
evaluating the signal reliability of all primary outputs and probabilities within the circuit. As a result, the calculation
their joint reliability (i.e., the overall circuit reliability 𝑅𝐶) can of the circuit PTM is exact. However, the limited scalability
be solved by exhaustively calculating all 2(𝑁in +𝑁𝑔 ) scenarios. is often a price that has to be paid for this computational
In each scenario, the expected (correct) output and actual framework to capture complex circuit behaviors. Consider
output values need to be calculated with the complexity of a combinational logic circuit with 𝑁in primary inputs, 𝑁out
𝑂(𝑁𝑔 ). The total complexity is then 𝑂(𝑁𝑔 ⋅ 2𝑁in +𝑁𝑔 ). As primary outputs, and 𝑁𝑔 logic gates. The circuit PTM is a
circuits become very large, it would be difficult or even matrix with 2𝑁in rows and 2𝑁out columns (i.e., 2𝑁in × 2𝑁out ),
impossible to perform the exact analysis of the reliability due which contains the transition probability from all input com-
to the exponential complexity. Usually, some tradeoff has to binations toward all output combinations. In other words,
be made between the accuracy and efficiency for reliability its space complexity is 𝑂(2𝑁in +𝑁out ). This exponential space
analysis. requirement is the main bottleneck of PTM approach. Partic-
In order to tackle this issue, a number of different ularly, for a computer with 2 GB memory, the maximum size
approaches have been reported in literature, including prob- of the circuit that can be handled is limited to 16 input/output
abilistic transfer matrix (PTM) method [10–12], Bayesian signals. By utilizing some advanced computation methods
networks (BN) [13–15], Markov random field (MRF) [16– (such as algebraic decision diagrams (ADDs) and encoding
20], Monte Carlo (MC) simulation, testing-based method [3], [10, 11]), the signal width may be extended up to ∼50, where
stochastic computation model (SCM) [2, 21], probabilistic the signal width is defined as the largest number of signals
gate model (PGM) [22–25], observability-based analysis at any level in the circuit. Unfortunately, this limit is still
[26], Boolean difference-based error calculator (BDEC), and computationally unacceptable in the real world for large-scale
correlation coefficient method- (CCM-) based approaches benchmark circuits (e.g., C2670 which has 157 inputs and 64
[8, 26–28]. In the following, we overview some of these outputs). Nonetheless, for small circuits, the PTM is a very
approaches and analyze their pros and cons in terms of good analytical method, as it provides exact results within a
accuracy, efficiency, and flexibility with simulation results. reasonable runtime and shows the probabilistic behavior of
unreliable logic gates.
Also, this approach can serve as the foundation of many
2. Probabilistic Transfer other heuristic approaches by providing other important
Matrix (PTM) Method information such as signal probabilities and observability,
An accurate analytical model for reliability analysis problem with the capability of analyzing the effect of electrical masking
is based on the probabilistic transfer matrices (PTMs), which on error mitigation as well. For instance, in [10], the observ-
compute the circuit output reliability for all input patterns ability of a gate 𝑔 is defined as the ratio of the error probability
[10, 11]. This computational framework begins with the of the whole circuit and the error probability 𝜀𝑖 of this gate,
definition of a probability matrix which is used to represent that is, (1 − 𝑅𝐶(𝜀𝑖 ))/𝜀, where 𝑅𝐶(𝜀𝑖 ) is the circuit reliability
the probability of a logic gate’s output for each input pattern. when the only unreliable gate is 𝑖th gate (with all other gates
For instance, the probability matrix representation for a two- being error-free). Clearly, the gate with highest observability
input NAND logic gate is shown in Figure 2, where each can be regarded as the most susceptible, meaning that it will
column of the matrix M𝑔 represents the probability of the impact (or decrease) the circuit reliability the most. It should
gate output 𝑍 being “0” or “1” for all different input patterns be noted that this only represents the simplest case where only
(i.e., 𝑋1 𝑋2 = “00,” “01,” “10,” and “11”). For example, the single gate failure is considered. In most real cases, however,
element M11 = Pr{𝑍 = 0 | 𝑋1 𝑋2 = 00} = 1 − 𝑟, where 𝑟 the gate observabilities may not be independent, and thus the
is the gate reliability. In general, the probability matrix for an joint observabilities usually need to be considered instead.
𝑛-input 1-output gate is a 2𝑛 × 2 matrix. The detailed algorithm with the PTM is summarized as
For a circuit, all gate probability matrices shall be com- follows.
bined together to construct the PTM of the whole circuit.
More specifically, the serial and parallel connections of gates Step 1. Levelize the circuit; compute PTMs of each logic
𝑗
correspond to a matrix product and tensor product [10], component in each level denoted by MLv𝑖 .
4 VLSI Design

0 1⌊
00 1− r r
X1 01⌈1− r r
Z Mg =
r 10 1− r r ⌈
X2 11⌊r 1− r

(a) (b)

Figure 2: (a) A 2-input NAND gate and (b) its probability matrix M𝑔 (according to [10]).

Lv 1 Lv 2 Lv 3 Lv 4 product due to their parallel connection). More specifically,


X1 we have (based on [10])
r2 X6
X2 r1 X5 r4 Z M = (𝐼 ⊗ NAND1 ⊗ 𝐼)16×8 ⋅ (𝐼 ⊗ 𝐹 ⊗ 𝐼)8×16
X3 F1
r3 X7 ⋅ (NAND2 ⊗ NAND3 )16×4 ⋅ (NAND4 )4×2
X4
1 0 1 0 0 0
I=[ ], 𝐹=[ ],
Figure 3: The example circuit schematic (a portion of C17 bench- 0 1 0 0 0 1 (3)
mark circuit).
1 − 𝑟𝑖 𝑟𝑖
[1 − 𝑟𝑖 𝑟𝑖 ]
NAND𝑖 = [ ]
[1 − 𝑟𝑖 𝑟𝑖 ] ,
[ 𝑟𝑖 1 − 𝑟𝑖 ]
Step 2. Within one level, the PTMs of each logic components where the matrix I refers to a 2 × 2 identity PTM, and each
(gates, wires, and fanout nodes) are tensored together to form parenthesized term in (3) corresponds to a specific circuit
the PTM of the current level; that is, MLv𝑖 = M1Lv𝑖 ⊗ M2Lv𝑖 ⋅ ⋅ ⋅ ; level. Assuming the gate reliabilities are 𝑟1 = 𝑟2 = 𝑟3 = 𝑟4 =
0.95 and the probability of all input signals is equally 0.5, the
Step 3. The PTMs of all levels are then multiplied together to circuit PTM and ideal transfer matrix are found using the
get the circuit PTM; that is, M = ∏𝑖 MLv𝑖 . above algorithm as follows:
Step 4. Calculate the ideal transfer matrix J using the truth 0.8622 0.1378 1 0
table of the logic function (error-free signal probabilities 0.1312 0.8688 0 1
𝑝(𝑖) for input patterns are evaluated with the computation 0.8622 0.1378 1 0
complexity of 𝑂(𝑁𝑔 ⋅ 2𝑁in )). 0.1312 0.8688 0 1
0.8622 0.1378 1 0
Step 5. The circuit reliability is given by [11]: 0.1312 0.8688 0 1
0.8622 0.1378 1 0
0.8238 0.1762 1 0
M= , J= . (4)
𝐽(𝑖,𝑗)=1 0.1312 0.8688 0 1
𝑅𝐶 = ∑ 𝑀 (𝑖, 𝑗) 𝑝 (𝑖) . (2) 0.0928 0.9073 0 1
𝑖,𝑗 0.1312 0.8688 0 1
0.0928 0.9073 0 1
0.1312 0.8688 0 1
0.0928 0.9073 0 1
We take a simple circuit as an example to illustrate the 0.8238 0.1762 1 0
analysis process of PTM approach. The circuit schematic 0.8217 0.1783 1 0
is shown in Figure 3, where the circuit has 4 levels, and
the fanout 𝐹1 reconverges at gate number 4, generating the It can be seen from M that the output reliability depends on
dependency between signal 𝑋6 and 𝑋7 . Since there are four input patterns. The lowest and highest values for the output
inputs (𝑋1 , 𝑋2 , 𝑋3 , and 𝑋4 ) and one single output 𝑍, the reliability are 0.8217 and 0.9073, which occur when the input
circuit PTM would be a 16 × 2 matrix M which stores the vector (𝑋1 𝑋2 𝑋3 𝑋4 ) = (1111) and (1001, 1011, and 1101),
probability of occurrence of all input-output vector pairs. The respectively. The circuit reliability is found to be 𝑅𝐶 = 0.8658
M is constructed by combining PTMs of all levels (using with the runtime of 0.2798 s.
matrix product due to serial connection in this case), while The PTM algorithm has been implemented on some small
the PTM of each level is calculated by combining PTMs of circuits. The simulation results show that its performance is
each logic components within the current level (using tensor fairly good for circuits with less than 20 gates. If the circuit
VLSI Design 5

size increases to ∼40, both runtime and memory cost will 0.01
grow dramatically, making the PTM method computationally 0.008
expensive. In order to handle large-scale circuits, a variant
0.006
PTM method was proposed in [11], where the input vector

Relative error at RC
sampling is used. The simulation results show that this does 0.004
improve efficiency with reduced memory cost, while the 0.002
accuracy remains to be seen.
0
In summary, the PTM method has two major limitations.
First, the signal width of the circuit that can be analyzed −0.002
is very limited. This is due to the fact that its space com- −0.004
plexity grows exponentially with the number of inputs and −0.006
outputs, leading to prohibitively massive matrix storage and
manipulation overhead for large-scale circuits. Secondly, the −0.008
circuit structure needs to be preprocessed (such as circuit −0.01
0 1 2 3 4 5 6 7 8 9 10
levelization and identification of the fanout nodes and wire
pairs) prior to the algorithm implementation. Also, the PTM Number of simulation runs ×104
assumes all signals are correlated, which makes the method Figure 4: The relative error of circuit reliability 𝑅𝐶 of Figure 3 versus
less efficient for circuits with no or a few reconvergent the number of MC simulation runs 𝑁MC .
fanouts.

3. Monte Carlo (MC) Simulation the MC result is a nonzero value (5.636𝑒 − 04), indicating a
low convergent rate with the MC. This is a common feature
MC is a widely known simulation-based approach, where
for stochastic computations.
experimental data are collected to characterize the behavior
of a circuit by randomly sampling its activity [2]. It is usually
used when an analytical approach is unavailable or difficult 4. Stochastic Computation Model (SCM)
to implement. The obvious drawbacks of this approach lie in
the fact that numerous pseudorandom numbers need to be Unlike the MC method which uses Bernoullisequences
generated, and a large number of simulation runs must be for simulation, the SCM approach takes non-Bernoulli
executed to reach a stable result. This makes the reliability sequences [2, 21]. In a non-Bernoulli sequence, for a given
analysis for large circuits a very time-consuming process. probability 𝑝 and a sequence length 𝑁, the number of “1”s to
As a stochastic computation framework, the MC method be generated is fixed and given by 𝑁⋅𝑝, and only the positions
makes the result gradually converge to its exact value as more of the “1”s are determined by a random permutation of
simulation runs are performed. In the process of achieving binary bits. Therefore, in SCM approach, less pseudorandom
relatively stable results, certain statistical parameters (such numbers are generated for the same length of simulation,
as standard deviation 𝜎 and/or coefficient of variance (CV) compared to MC simulation where pseudorandom numbers
which is defined as the ratio of the standard deviation and are independently generated for each gate or input to mimic
the mean, i.e., 𝜎/𝜇) are usually used as the stopping criteria. the behavior of probabilistic circuits [2].
In [2], CV = 0.001 is used to represent an acceptable level Consider a circuit with 𝑁in , 𝑁out , 𝑁𝑔 , Pin , and 𝜀 (refer
of accuracy, and the number of simulation runs required is to the previous sections for definitions of these variables). If
given by we use a sequence length of 𝑁, the total required number
of random numbers is given by (𝑁in + 𝑁𝑔 ) ⋅ 𝑁 in MC
1 − 𝑅𝐶 1 1 simulation. In contrast, for the SCM approach with the same
𝑁MC = ⋅ ≈ 106 ⋅ ( − 1) , (5)
𝑅𝐶 CV 2 𝑅 𝐶
sequence length, only 𝑁𝜀 pseudorandom numbers need to
be generated (for the positions of “1”s) for a gate with error
where 𝑅𝐶 is again the circuit reliability. Since the circuit rate 𝜀. Therefore, the total number of random numbers is
reliability usually decreases with the circuit size (𝑁𝑔 ), the reduced to (𝑁in ⋅ 𝑝in + 𝑁𝑔 ⋅ 𝜀) ⋅ 𝑁. Since the gate error
𝑁MC will increase with the circuit size for a given accuracy rate 𝜀 is usually a small value which can be viewed as a
(measured by CV). Assuming that the 𝑅𝐶 ranges from 0.1 to scale factor, the total required random number is significantly
0.9, the number of MC runs will vary around 105 ∼ 107 . reduced. In other words, for a specific level of accuracy, the
It should be mentioned that (5) only gives an approximated non-Bernoulli sequence requires a smaller sequence length
range of 𝑁MC , and its actual value is usually determined than the Bernoulli sequence does. However, how to efficiently
experimentally for real circuits. Let us take the circuit of determine the required minimum sequence length for the
Figure 3 again as an example. From (5), the required 𝑁MC SCM is still an open question. In [2], an empirical function
is ∼1.55 × 105 if 𝑅𝐶 = 0.8658. Figure 4 shows the relative (rather than an analytical expression) was used for this
error at 𝑅𝐶 against 𝑁MC . It can be seen from the figure that purpose.
after ∼104 runs, the result becomes relatively stable around its Again, we took the example circuit of Figure 3 and used
final value. However, a small random fluctuation is inevitable. the same sequence length with MC (i.e., 𝑁SCM = 𝑁MC =
Even after ∼105 simulation runs, the relative error of 105 ) with gate error rate 𝜀 = 0.001. The SCM and MC
6 VLSI Design

Table 1: Runtime comparison of MC and SCM on benchmark 5. Probabilistic Gate Model (PGM)
circuits (𝜀 = 0.01).
The PGM is another reliability analysis method which is
MC (106 runs) SCM (106 runs) based on the probabilistic models of unreliable logic gates
Circuit Size Runtime (s) [22–25]. In the simple version of PGM, the input signals
Runtime (s)
𝜀 = 0.01 𝜀 = 0.1 of each gate in the circuit are assumed to be independent.
c432 160 183 31 38 Under this assumption, the output probability of each gate
c499 202 203 37 45 can be easily calculated using the information of input signal
c880 383 373 63 77 probabilities and gate error rate. For instance, consider a
c1355 546 472 92 111 2-input NAND gate with input probabilities of 𝑋1 and 𝑋2
c1908 880 842 183 215
and gate error rate of 𝜀. Its output signal probability can be
expressed as (after [24])
c2670 1193 1151 265 311
c3540 1669 1616 409 505 𝑍 = Pr (“1” | gate faulty) ⋅ Pr (gate faulty)
c5315 2406 2548 786 961
c7552 3512 3732 1325 1495 + Pr (“1” | gate not faulty) ⋅ Pr (gate not faulty) (6)
= (1 − 𝜀) + (2𝜀 − 1) 𝑋1 𝑋2 .
0.01
This output probability 𝑍 can be used recursively as the
0.008
input information at next level of gates. One of the main
0.006 features with PGM is that the circuit reliability is analyzed by
0.004 exhaustively evaluating each input combination and output.
Relative error at RC

For any given input combination, the error-free output value


0.002
𝑍ef is calculated, and then the output signal probability 𝑃𝑂 is
0 evaluated using the PGM of all gates in the circuit. Depending
−0.002 on the error-free output value, the output reliability 𝑅 for this
specific input combination is given by [24]
−0.004
−0.006 𝑃 , 𝑍ef = 1,
𝑅={ 𝑂 (7)
−0.008 1 − 𝑃𝑂, 𝑍ef = 0.
−0.01 Finally, the overall output reliability is the weighted sum
0 1 2 3 4 5 6 7 8 9 10
Number of simulation runs ×104 of all conditional output reliabilities over all possible input
combinations, where the weight is the probability of a specific
SCM input combination.
MC Intuitively, the operation process of PGM is similar to
Figure 5: The relative error of circuit reliability 𝑅𝐶 versus the
PTM in the sense that both of them consider all input
number of simulation runs 𝑁MC /𝑁SCM . combinations in a forward topological order. An obvious
disadvantage with the PGM approach is that it is almost
impossible to exhaustively enumerate all input combinations
when the number of inputs increases (say to 30 and above).
simulation results are compared in Figure 5, where both Therefore, a certain sampling technique is often necessary for
have a similar convergence rate. However, the runtimes with large circuits. The input patterns sampling becomes another
SCM and MC are 𝑇SCM = 0.0745 s and 𝑇MC = 1.2528 s, source of errors, in addition to the inaccuracy caused by
respectively, indicating that the SCM method is more efficient signal independence assumption in constructing gate PGMs
than the MC. This efficiency improvement is mainly due (it should be pointed out that while signal correlations due to
to less random numbers that are generated in the SCM fanouts originating from the primary inputs are eliminated
simulation. by assigning the deterministic values (either “0” or “1”) to all
We also implemented both SCM and MC approaches primary inputs, those caused by other reconvergent fanouts
in Matlab with the same sequence length of 106 (gate error nodes are not).
rate 𝜀 = 0.01) and tested their performance on ISCAS’85 In order to eliminate all signal correlations, an accurate
benchmark circuits. The results are shown in Table 1, where PGM algorithm was proposed in [24] where deterministic
the runtime with the SCM is around 1/6∼1/3 of that with the values are assigned explicitly to all reconvergent fanout nodes
MC. One of the disadvantages of SCM is the difficulty in within the circuit. More specifically, for each fanout, the orig-
determining its simulation sequence length 𝑁SCM . Also, its inal circuit is transformed to two auxiliary circuits [24], one
runtime is proportional to gate error rate 𝜀 as well as input with the fanout node being set to logic value “0” and the other
probabilities. If 𝜀 is relatively large (say 0.2), the runtime to “1.” In each of these two circuits, the output probability
improvement of SCM over MC would be marginal (only is computed by using conditional probabilities for the given
scaled by a constant). value at the fanout. This procedure is executed iteratively until
VLSI Design 7

Table 2: Simulation results for simple PGM in comparison with MC.

Simple PGM approach (103 samples) Monte Carlo (106 runs)


Circuit Size
Average error (%) Max error (%) Runtime (s) Runtime (s)
C432 160 0.54 1.59 4.66 183
C499 202 0.1 0.31 4.92 203
C880 383 0.61 2.83 9.17 373
C1355 546 1.26 1.66 12.21 472
C1908 880 0.39 0.85 18.69 842
C2670 1193 2.43 16.61 25.72 1151
C3540 1669 0.077 2.27 39.46 1616
C5315 2307 10.88 43.16 61.42 2548
C7552 3512 2.68 13.68 75.19 3732
Average — 2.18 9.22 — —

Table 3: Comparison of simple PGM and accurate PGM.

Simple PGM approach (103 samples) Accurate PGM approach (103 samples) [24]
Circuit Size
Average error (%) Runtime (s) Runtime (s)
Cu 43 1.37 0.0277 0.10
z4ml 45 0.94 0.0039 0.05
x2 38 0.52 0.0275 0.22
Mux 50 0.52 0.282 0.10

all fanouts have been processed. If all input combinations 1


are simulated, this procedure will lead to exact results for 0.9
any circuits. However, for a circuit with 𝑁𝑓 reconvergent
0.8
fanouts, a total of 2𝑁𝑓 auxiliary circuits are required and
analyzed. Therefore, the computation complexity becomes 0.7
Conditional RC

𝑂(𝑁𝑔 ⋅ 2𝑁in +𝑁𝑓 ) [24]. However, in many real circuits, the 0.6
number of reconvergent fanouts 𝑁𝑓 is comparable to the 0.5
number of gates (𝑁𝑔 ). Thus, the complexity of the above
accurate PGM algorithm is still an exponential function of the 0.4
circuit size, making it infeasible in general for large circuits. 0.3
In an effort to improve the efficiency of the accurate PGM
0.2
method, a modular PGM approach was also introduced in
[24]. It is based on the observation that many large circuits 0.1
contain a limited number of simple logic components that are 0
used repeatedly. With this in mind, circuits can be decom- 2 4 6 8 10 12 14 16
posed into several modules whose reliabilities are calculated Input patterns
using the accurate PGM method. The circuit output reliability
Figure 6: The conditional output reliability of Figure 3 for different
is then evaluated by combining these modules along the path
input combinations (𝑥-axis labels 1∼16 indicate 16 input patterns
from primary inputs. Unfortunately, the input sampling is still from 0000∼1111).
needed in this case for large-scale circuits.
For the example circuit of Figure 3 with 4 input signals, a
total of 16 input combinations need to be considered. We plot In order to see the performance of different PGM algo-
the conditional output reliability for each input combination rithms on large circuits, we implemented the simple PGM
in Figure 6, which shows that the output reliability varies algorithm in Matlab and tested it on ISCAS’85 benchmarks.
within a relatively small range (no more than ±10%) for The results are shown in Table 2. We also compare the simple
different input combinations. In other words, the input vector PGM with both accurate and modular PGM methods in
sampling can be implemented effectively with small errors. Tables 3 and 4, where the simulation results for both accurate
The overall output reliability is given by a weighted sum over and modular PGM methods are taken from [24].
all input combinations and is found to be 𝑅𝐶 = 0.8701 (with It can be seen from these tables that the simple PGM
the runtime of 𝑇PGM = 0.0093 s), compared to the accurate algorithm can provide highly accurate results if the circuits
value of 0.8658 given by PTM (i.e., the relative error is as low (such as C432 and C1355) have no or few reconvergent fanouts
as ∼0.5%). and/or if the fanouts originate from the primary inputs.
8 VLSI Design

Table 4: Simulation results for modular PGM in comparison with MC.

Modular PGM approach (103 samples) [24] Monte-Carlo (106 runs)


Circuit Size
Average error (%) Runtime (s) Runtime (s)
C432 160 9.21 0.25 183
C499 202 0.11 2.15 203
C1355 546 4.2 2.98 472
C2670 1193 0.43 4.26 1151

For those circuits with significant fanouts (such as C2670 and If a multiple-error case is considered, the complexity of
C5315), the average (or maximum) errors for the simple PGM computing the reliability will grow exponentially with 𝑁𝑔 .
can increase significantly (in particular, the maximum error is In order to improve the efficiency in this case, the following
up to 43% for C5315, as shown in Table 2). From Table 3, the two assumptions are used in [26]: (a) the impacts of gate
accurate PGM need longer runtimes than the simple PGM failures on the primary output are decoupled, which implies
for small circuits. Results in Table 4 confirm that the modular that the output is erroneous if an odd number of gates
PGM is very efficient while the accuracy may not always be are simultaneously observable and (b) the observabilities
good enough for some circuits (with an average error of 9% of all gates are independent. As a result, the simultaneous
for C432). observability of multiple gates is simply the product of their
In summary, for all the above three different versions of individual observabilities.
PGM, the input sampling is inevitable for improved efficiency We took the example circuit of Figure 3 for illustration.
if the number of primary inputs 𝑁in is large (∼30). This is First, let us assume all four gates (G1 ∼ G4 ) in the circuit
mainly where the analysis errors come in. Thus, it can be con- are erroneous with the probabilities of 𝜀1 , 𝜀2 , 𝜀3 , and 𝜀4 ,
cluded that they represent a good model only for circuits with respectively (other cases can be analyzed similarly). Based
a small number of primary inputs, where no input sampling on the above assumption (a), we only need to consider the
is required. For the circuits without reconvergent fanouts, the cases where an odd number (1 or 3) of gates is simultaneously
input sampling in the PGM approach is unnecessary, because observable. This means that when an even number (0, 2, or
both signal probability and output reliability in this case can 4) of gates is observable, the output signal 𝑍 will has correct
be computed within 𝑂(𝑁𝑔 ) time (see [29] for details). value as gate errors are logically masked by one another.
Secondly, under the assumption (b), the probability of only
6. Observability-Based Reliability Analysis one gate being observable is given by ∑𝑖 (𝑜𝑖 ⋅∏𝑗 ≠
𝑖 (1−𝑜𝑗 )) (the
probability of three gates being simultaneously observable
Another reliability analysis method was presented in [26], can be calculated similarly). Based on these assumptions, a
which is based on the observation that an error at the closed-form expression for the circuit reliability of the circuit
output of any gate is the cumulative effect of a local error (assuming a single primary output) can be written generally
component attributed to the error probability of the gate, and as a function of error probabilities and observabilities of all
a propagated error component was attributed to the failure gates [26]; that is,
of gates in its transitive fan-in cone. In [26], the observability
of a gate (or its output signal) is the conditional circuit error 1
𝑅𝐶 = (1 + ∏ (1 − 2𝜀𝑖 𝑜𝑖 )) , (9)
probability given the single error at current gate. The value 2 𝑖
of this observability can be simply defined as 𝑜𝑖 = (1 −
𝑅𝐶(𝜀𝑖 = 1)), where 𝑅𝐶(𝜀𝑖 = 1) is the circuit reliability given which can be computed efficiently if all gate observabilities
a single error with the current gate, and can be calculated are known (however, this analysis is only suitable for small
using Boolean differences [29], symbolic techniques (such as circuits or large ones with small values of gate error prob-
BDDs), or simulation method. It can be expected that the gate abilities, which will be clear later). The gate observability
observabilities are highly related to the input probabilities. can be determined using the PTM method. For instance, the
For a single-fault case (i.e., only one gate in the circuit is observability of gate G1 in Figure 3 is calculated as the output
erroneous), the circuit reliability (assuming a single primary reliability by setting 𝑟1 = 0 and 𝑟2 = 𝑟3 = 𝑟4 = 1. The results
output) can be simply calculated by considering each fault are [𝑜1 , 𝑜2 , 𝑜3 , 𝑜4 ] = [0.25, 0.375, 0.375, 0]. We calculate the
case individually. Assume that the error rate and observability circuit reliability 𝑅𝐶 using the above expression and plot
of the 𝑖th gate are 𝜀𝑖 and 𝑜𝑖 , respectively. If gate 𝑖 is erroneous the results against the accurate values given by the PTM in
while the other gates are fault-free, the output reliability Figure 7(a) for different values of gate reliability. The relative
simply is equal to 𝑜𝑖 . Thus, the overall reliability can be easily error is shown in Figure 7(b). It can be seen clearly from these
calculated by figures that the observability-based analysis is only accurate
for small gate error rates, in which case the probability for
single gate failure is significantly higher than that for multiple
𝑅𝐶 = ∑ (𝜀𝑖 ⋅ 𝑜𝑖 ⋅ ∏ (1 − 𝜀𝑗 )) , (8) gate failures.
all 𝑖 𝑗 ≠
𝑖
To reduce the computational complexity of the above
which is exact for the single-fault case. observability-based reliability analysis, [26] also proposed
VLSI Design 9

1 the single-pass algorithm [26] and takes the shorter runtime


0.95 than MC or SCM.
0.9
7. Correlation Coefficient Method (CCM)
0.85
Circuit reliability RC

0.8 CCM is a widely used approach that evaluates the signal


probabilities for (fault-free) combination circuits [30]. As
0.75
mentioned before, the reliability analysis can be transformed
0.7 to signal probability computation. Therefore, the CCM can be
0.65 used to evaluate the reliability estimation [8, 26–28, 30]. The
main idea of CCM is briefly described below.
0.6
In order to compute the signal probability, the correlation
0.55 coefficient between two probabilistic signals (denoted by 𝑖
0.5
and 𝑗) is defined as [8]
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Gate reliability rg 𝑃 (𝑖𝑗) 𝑃 (𝑗 | 𝑖)
𝐶𝑖,𝑗 = 𝐶𝑗,𝑖 = = , (10)
Observability-based 𝑃 (𝑖) 𝑃 (𝑗) 𝑃 (𝑗)
PTM
which is equal to 1 for signals 𝑖 and 𝑗 are independent. It
(a) should be noted that, here, only the first order correlation
0.4 coefficients are considered, and the correlation of two signals
with a third one (denoted by ℎ) is approximated as 𝐶𝑖𝑗,ℎ = 𝐶𝑖,ℎ ⋅
0.35 𝐶𝑗,ℎ . For reliability computation, four correlation coefficients
for a pair of signals are needed. Each coefficient corresponds
0.3
to a combination of events (i.e., 0 → 1 or 1 → 0 error) on the
signal pair. In other words, the signal error (or reliability)
Relative error

0.25
correlation coefficient between signals 𝑖 and 𝑗 is defined as
0.2 [26]
0.15 𝑃 (𝑖0 → 1 𝑗0 → 1 )
𝐶𝑖𝑗 = ,
0.1 𝑃 (𝑖0 → 1 ) 𝑃 (𝑗0 → 1 )

0.05 𝑃 (𝑖0 → 1 𝑗1 → 0 )
𝐶𝑖𝑗̃ = ,
𝑃 (𝑖0 → 1 ) 𝑃 (𝑗1 → 0 )
0 (11)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
𝑃 (𝑖1 → 0 𝑗0 → 1 )
Gate reliability rg 𝐶̃𝑖𝑗 = ,
𝑃 (𝑖1 → 0 ) 𝑃 (𝑗0 → 1 )
(b)
𝑃 (𝑖1 → 0 𝑗1 → 0 )
Figure 7: (a) Circuit reliability 𝑅𝐶 versus gate reliability and (b) 𝐶̃𝑖𝑗̃ = ,
relative error versus gate reliability for the example circuit of 𝑃 (𝑖1 → 0 ) 𝑃 (𝑗1 → 0 )
Figure 3.
where the 𝑃(𝑖0 → 1 ) is the probability that the value of signal
𝑖 flips to 1 from its correct value 0, that is, the error
a sampling algorithm by considering the constraint that only probability of 𝑖 given that its error-free value is 0. Once
a maximum of 𝑘 gates can fail simultaneously. This algorithm the error correlations and error-free signal probabilities are
first generates a set of samples for failed gates and guarantees generated, the single-pass analysis is conducted using the
that the total number of gates with error is no more than forward topological order with the computational complexity
𝑘. Then, a single-pass reliability analysis algorithm [26] was of 𝑂(𝑁𝑔 ). Since the computation complexity of CCM is linear
used to evaluate the error probability at the primary outputs, with the number of levels (𝐿) and pseudoquadratic with the
number of gates per level (𝑁𝐿 ), the overall complexity of
leading to the computational complexity of 𝑂(𝑁𝑔 ⋅ 𝑘2 ), where
𝑁𝑔 is the number of gates with error. For a specific sample, CCM-based reliability analysis turns out to be 𝑂(𝑁𝑔1.5 ) if
the reliabilities of gates in the sampling are set to be 0 and a square circuit is assumed (i.e., 𝑁𝐿 = 𝐿 = 𝑁𝑔0.5 ). This
the rest are set to be 1. Finally, the overall circuit reliability complexity is an upper bound as not all signals are correlated
is estimated by averaging the reliabilities over all samples. in real circuits.
Therefore, this maximum-𝑘 gate failure model can be viewed In [26] which uses the CCM, an average relative error
as a hybrid method that makes a trade-off between the of up to ∼13% over all outputs was reported for circuits
accuracy of simulation-based method and the efficiency of with significant fanout (e.g., C499 and C1355) when the gate
analytical approach. It provides more accurate results than error rates range within [0, 0.5] (for other benchmark circuits,
10 VLSI Design

the error was around 2∼6%). Also, the relative errors may Small circuit with many
not be mitigated significantly by using more correlation reconvergent fanouts
coefficients. For instance, by using 0, 4, and 16 correlation Large circuit with a few
coefficients, the relative errors for C499 are only improved

Accuracy level
reconvergent fanouts
to 13.1%, 11.2%, and 11.11%, respectively [26], where the zero-
coefficient case means that all signals are treated as indepen-
dent with the computation complexity of 𝑂(𝑁𝑔 ). It is shown
in [26] that the runtime of using 4 coefficients is several orders
of magnitude longer than the zero-coefficient case (∼100 s
Large circuit with lots
versus ∼1 s, for circuit with ∼1000 gates). Therefore, it may of reconvergent fanouts
not be worthwhile to calculate more correlation coefficients Small circuit with a few
reconvergent fanouts
for slightly improved accuracy. In [30], the relative error for
large circuits (with hundreds of gates) was reported at ∼7% Computation cost
on average with the runtime of ∼10 s, which is comparable to
Figure 8: A general sketch of solution space for different circuit
those from [26].
categories.

8. Comparison and Future Work


It shows better performance than the observability-
In summary, the ultimate goal of existing approaches for
based approach in terms of both accuracy and effi-
reliability analysis is to achieve more accurate results with
ciency for lower gate error rates.
as low computational cost as possible. Both accuracy and
efficiency depend on specific circuit structures and their size, (e) If all reconvergent fanouts within circuits originate
and, in most cases, the tradeoff between them needs to be from primary inputs, the simple PGM method gives
made. The main features of each approach are described as exact results with the computational complexity of
follows. 𝑂(𝑁𝑔 ⋅ 2𝑁in ). For circuits consisting of a few logic
modules that are repetitively used, the modular PGM
(a) If circuits have no reconvergent fanouts (e.g., a circuit method is a good option that can provide good
with tree structure), both signal probability and reli- accuracy with short runtime.
ability can be calculated exactly with linear time (i.e.,
𝑂(𝑁𝑔 )). The readers are referred to [29] for further From the above discussions, it can be concluded that errors
details. in reliability analysis are mainly due to the reconvergent
fanouts (or signal correlation) inherent in many circuits
(b) For those circuits with reconvergent fanouts, the PTM under consideration in the sense that the accurate results
method and accurate PGM model can promise exact can be obtained efficiently for circuits with no or a few
results, while their computation costs are exponen- reconvergent fanouts. On the other hand, the circuit size
tially high. The PTM approach requires the space (i.e., a large number of primary inputs or a large number of
complexity of 𝑂(2𝑁in +𝑁out ), and the accurate PGM gates or both) is the main contributor to high computational
has the computation complexity of 𝑂(𝑁𝑔 ⋅ 2𝑁in +𝑁𝑓 ). costs for reliability analysis. Therefore, the most challenging
Thus, some sampling techniques are usually needed problem is to analyze the reliability for large-scale circuits
to handle large-scale circuits in these computation with a lot of reconvergent fanouts. Figure 8 illustrates the
frameworks, leading to less accurate results. expected solution space in general in terms of accuracy
and computational cost for different circuit categories. Any
(c) Simulation-based methods (such as MC or SCM) can
existing approach for the reliability analysis corresponds to
provide the results with high level of accuracy, as
a specific point in this space, which represents a tradeoff
long as enough simulation sequences are applied. To
between accuracy and efficiency. For instance, the results
achieve a required level of accuracy, the number of
from PTM, PGM, or MC fall into the right-upper corner
simulation runs need to be determined statistically or
of this figure with expensive computation and high level of
empirically. The time complexity can be estimated by
accuracy. An ideal approach should be able to provide results
𝑂(𝑁𝑔 ⋅ 𝑁MC ) or 𝑂(𝑁𝑔 ⋅ 𝑁SCM ), where 𝑁𝑔 is again
somewhere near the left-upper corner where both accuracy
the circuit size and 𝑁MC (or 𝑁MC ) represents the
and efficiency can be ensured.
number of simulation runs. The SCM is more efficient
While gate-level reliability analysis methods are well
than MC especially for small gate error rates, as the
documented, there are some other important issues that
runtime of the former is approximately scaled by a
remain to be tackled. First of all, most existing methods
constant factor.
only deal with the reliability of each individual output and/or
(d) The observability-based approach has some theoreti- the averaged reliability over all outputs. However, the joint
cal implications, since it gives reasonable results only reliability for multiple outputs (i.e., the probability that all
for circuits with extremely-low gate error rates. The outputs are error-free simultaneously) is what really matters.
maximum-𝑘 method can be viewed as the combina- This joint reliability could be totally different from any
tion of CCM-based and simulation-based methods. individual output reliability or the averaged output reliability,
VLSI Design 11

depending on the possible correlation among individual [5] C. Chen, “Reliability-driven gate replication for nanometer-
output reliabilities. For an extreme case where all individual scale digital logic,” IEEE Transactions on Nanotechnology, vol.
output reliabilities are independent, the joint reliability will 6, no. 3, pp. 303–308, 2007.
simply be the product of all these reliabilities, which leads [6] J. von Neumann, “Probabilistic logics and the synthesis of
to a minimum value. As the correlation of output reliabilities reliable organisms from unreliable components,” in Automata
becomes strong, the joint reliability tends to rise. In general, Studies, C. E. Shannon and J. McCarthy, Eds., pp. 43–98,
the complexity of computing the joint reliability would be an Princeton University Press, Princeton, NJ, USA, 1956.
exponential function of the number of primary outputs. It is [7] J. Han and P. Jonker, “A system architecture solution for
still an open question how to estimate the joint reliability for unreliable nanoelectronic devices,” IEEE Transactions on Nan-
multiple-output circuits in an efficient way. Secondly, most otechnology, vol. 1, no. 4, pp. 201–208, 2002.
of the current reliability analysis frameworks assume that [8] S. Ercolani, M. Favalli, M. Damiani, P. Olivo, and B. Ricco, “Esti-
the reliability for an error-free output being “0” (denoted mate of signal probability in combinational logic networks,” in
Proceedings of the 1st European Test Conference, pp. 132–138,
by 𝑟0 ) is the same as that for an error-free output being “1”
Paris, France, April 1989.
(denoted by 𝑟1 ). This is the so-called symmetric reliability
[9] M. R. Garey and D. S. Johnson, Computers and Intractability:
model. However, this assumption does not always hold true
A Guide to the Theory of NP-Completeness, W. H. Freeman, San
in the real world. Thus, an asymmetric reliability model Francisco, Calif, USA, 1979.
(where 𝑟0 ≠ 𝑟1 ) would make more sense for better estimation
[10] S. Krishnaswamy, G. F. Viamontes, I. L. Markov, and J. P.
of reliability. This requires further research work that can
Hayes, “Accurate reliability evaluation and enhancement via
take the asymmetric model into consideration. Finally, there probabilistic transfer matrices,” in Proceedings of the Design,
is also plenty of room for gate-level reliability improvement Automation and Test in Europe, vol. 1, pp. 282–287, March 2005.
using reliability-critical gates as well as considering other [11] S. Krishnaswamy, G. F. Viamontes, I. L. Markov, and J. P. Hayes,
performance metrics (such as circuit area and delay and “Probabilistic transfer matrices in symbolic reliability analysis
power consumption). Unfortunately, to the best of authors’ of logic circuits,” ACM Transactions on Design Automation of
knowledge, little or limited study has been done so far in this Electronic Systems, vol. 13, no. 1, article 8, 2008.
regard. [12] W. Ibrahim, V. Beiu, and M. H. Sulieman, “On the reliability of
majority gates full adders,” IEEE Transactions on Nanotechnol-
9. Conclusion ogy, vol. 7, no. 1, pp. 56–67, 2008.
[13] T. Rejimon, K. Lingasubramanian, and S. Bhanja, “Probabilistic
We have reviewed the state-of-the-art methods for reliability error modeling for nano-domain logic circuits,” IEEE Transac-
analysis and shown their advantages and disadvantages. Some tions on Very Large Scale Integration (VLSI) Systems, vol. 17, no.
of these methods have been implemented on benchmark 1, pp. 55–65, 2009.
circuit examples to compare their performance in terms of [14] T. Rejimon and S. Bhanja, “Scalable probabilistic computing
accuracy and efficiency. While these methods seem to be models using Bayesian networks,” in Proceedings of the IEEE
effective for some specific cases/circuits, no single one of International 48th Midwest Symposium on Circuits and Systems
(MWSCAS ’05), pp. 712–715, August 2005.
them stands out as an all-time winner due to the nature and
complexity of the reliability analysis problem. Further work [15] J. T. Flaquer, J. M. Daveau, L. Naviner, and P. Roche, “Fast relia-
has also been suggested for the future research in this area. bility analysis of combinatorial logic circuits using conditional
probabilities,” Microelectronics Reliability, vol. 50, no. 9–11, pp.
1215–1218, 2010.
Conflict of Interests [16] R. I. Bahar, J. Chen, and J. Mundy, “A probabilistic-based design
for nanoscale computation,” in Nano, Quantum and Molecular
The authors declare that there is no conflict of interests Computing: Implications to High Level Design and Validation,
regarding the publication of this paper. S. Shukla and R. I. Bahar, Eds., chapter 5, Kluwer Academic,
Norwell, Mass, USA, 2004.
References [17] R. I. Bahar, J. Mundy, and J. Chen, “A probability-based design
methodology for nanoscale computation,” in Proceedings of the
[1] S. Borkar, “Designing reliable systems from unreliable compo- International Conference on Computer-Aided Design, pp. 480–
nents: The challenges of transistor variability and degradation,” 486, November 2003.
IEEE Micro, vol. 25, no. 6, pp. 10–16, 2005. [18] A. R. Kermany, N. H. Hamid, and Z. A. Burhanudin, “A study
[2] J. Han, H. Chen, J. Liang, P. Zhu, Z. Yang, and F. Lombardi, of MRF-based circuit implementation,” in Proceedings of the
“A stochastic computational approach for accurate and efficient International Conference on Electronic Design (ICED ’08), pp. 1–
reliability evaluation,” IEEE Transactions on Computers, vol. 63, 4, December 2008.
no. 6, pp. 1336–1350, 2014. [19] D. Bhaduri and S. Shukla, “NANOLAB—a tool for evaluating
[3] S. Krishnaswamy, S. M. Plaza, I. L. Markov, and J. P. Hayes, reliability of defect-tolerant nanoarchitectures,” IEEE Transac-
“Signature-based SER analysis and design of logic circuits,” tions on Nanotechnology, vol. 4, no. 4, pp. 381–394, 2005.
IEEE Transactions on Computer-Aided Design of Integrated [20] X. Lu, J. Li, and W. Zhang, “On the probabilistic characterization
Circuits and Systems, vol. 28, no. 1, pp. 74–86, 2009. of nano-based circuits,” IEEE Transactions on Nanotechnology,
[4] C. Chen and Y. Mao, “A statistical reliability model for single- vol. 8, no. 2, pp. 258–259, 2009.
electron threshold logic,” IEEE Transactions on Electron Devices, [21] H. Chen and J. Han, “Stochastic computational models for
vol. 55, no. 6, pp. 1547–1553, 2008. accurate reliability evaluation of logic circuits,” in Proceedings
12 VLSI Design

of the 20th Great Lakes Symposium on VLSI (GLSVLSI ’10), pp.


61–66, May 2010.
[22] J. B. Gao, Y. Qi, and J. A. B. Fortes, “Bifurcations and fun-
damental error bounds for fault-tolerant computations,” IEEE
Transactions on Nanotechnology, vol. 4, no. 4, pp. 395–402, 2005.
[23] J. Han, E. Taylor, J. Gao, and J. Fortes, “Faults, error bounds and
reliability of nanoelectronic circuits,” in Proceedings of the IEEE
16th International Conference on Application-Specific Systems,
Architectures, and Processors (ASAP ’05), pp. 247–253, July 2005.
[24] J. Han, H. Chen, E. Boykin, and J. Fortes, “Reliability evaluation
of logic circuits using probabilistic gate models,” Microelectron-
ics Reliability, vol. 51, no. 2, pp. 468–476, 2011.
[25] J. Han, E. R. Boykin, H. Chen, J. H. Liang, and J. A. B. Fortes,
“On the reliability of computational structures using majority
logic,” IEEE Transactions on Nanotechnology, vol. 10, no. 5, pp.
1009–1022, 2011.
[26] M. R. Choudhury and K. Mohanram, “Reliability analysis of
logic circuits,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 28, no. 3, pp. 392–405, 2009.
[27] L. Chen and M. B. Tahoori, “An efficient probability framework
for error propagation and correlation estimation,” in Proceed-
ings of the IEEE 18th International On-Line Testing Symposium
(IOLTS '12), pp. 170–175, Sitges, Spain, June 2012.
[28] S. Ercolani, M. Favalli, M. Damiani, P. Olivo, and B. Ricco,
“Testability measures in pseudorandom testing,” IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 11, no. 6, pp. 794–800, 1992.
[29] N. Mohyuddin, E. Pakbaznia, and M. Pedram, “Probabilistic
error propagation in logic circuits using the boolean difference
calculus,” in Proceedings of the 26th IEEE International Confer-
ence on Computer Design (ICCD ’08), pp. 7–13, October 2008.
[30] S. Sivaswamy, K. Bazargan, and M. Riedel, “Estimation and opti-
mization of reliability of noisy digital circuits,” in Proceedings of
the 10th International Symposium on Quality Electronic Design
(ISQED ’09), pp. 213–219, March 2009.
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 343960, 6 pages
http://dx.doi.org/10.1155/2014/343960

Research Article
Low-Area Wallace Multiplier

Shahzad Asif and Yinan Kong


Department of Engineering, Macquarie University, Sydney, NSW 2109, Australia

Correspondence should be addressed to Shahzad Asif; [email protected]

Received 18 March 2014; Accepted 23 April 2014; Published 12 May 2014

Academic Editor: Yu-Cheng Fan

Copyright © 2014 S. Asif and Y. Kong. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Multiplication is one of the most commonly used operations in the arithmetic. Multipliers based on Wallace reduction tree provide
an area-efficient strategy for high speed multiplication. A number of modifications are proposed in the literature to optimize the
area of the Wallace multiplier. This paper proposed a reduced-area Wallace multiplier without compromising on the speed of the
original Wallace multiplier. Designs are synthesized using Synopsys Design Compiler in 90 nm process technology. Synthesis results
show that the proposed multiplier has the lowest area as compared to other tree-based multipliers. The speed of the proposed and
reference multipliers is almost the same.

1. Introduction are reported in [8–10] to improve the speed of the RCW.


However, the focus of their research is to reduce the delay by
Multiplication is one of the most widely used arithmetic using a faster final adder while still using the same reduction
operations. Due to this a wide range of multiplier archi- tree as RCW. As a result, the final adder size for the multipliers
tectures are reported in the literature providing flexible in [8–10] is the same as that of RCW.
choices for various applications. Among them the simplest The focus of this paper is to optimize the reduction tree in
is array multiplier [1] which is also the slowest. Some high a way that can reduce the size of the final adder. The reduced
performance multipliers are presented in [2–5]. The focus of size of the final adder resulted in low area of the multiplier
this paper is Wallace multiplier [6]. Wallace multiplier uses without incurring any extra delay. We call our design “PW
full adders and half adders to reduce the partial product tree (Proposed Wallace) multiplier.” We also considered Dadda
to two rows, and then a final adder is used to add these two multiplier [11] for comparison due to its similarity with the
rows of partial products. We call this design “TW (traditional Wallace multiplier.
Wallace) multiplier” in this text. TW multiplier performs its This paper makes a contribution in the design of Wallace
operation in three steps. (1) Generate all the partial products. treed based multipliers by proposing a strategy to reduce the
(2) The partial product tree is reduced using full adders and area of reduced complexity Wallace (RCW) multiplier. This
half adders until it is reduced to two terms. (3) Finally, a fast innovative method allows for an effective utilization of half
adder is used to add these two terms. adders in such a way that the size of the final adder is reduced.
Waters and Swartzlander [7] presented a reduced com- It also provides a more regular structure of the reduction tree
plexity Wallace multiplier by reducing the number of half and the final adder.
adders in the reduction process. We call this design “RCW The rest of the paper is organized as follows. Section 2
(reduced complexity Wallace) multiplier” from now on. The discusses some previous approaches for partial product tree
speed of the RCW multiplier is expected to be the same reduction. In Section 3, the proposed Wallace multiplier is
as of TW multiplier due to the equal number of reduction presented. In Section 4, the choice of final adder is discussed.
stages in both multipliers. The RCW uses a larger final adder Section 5 evaluates the results for all the designs synthesized
as compared to the TW multiplier. A number of strategies in Synopsys. The work is concluded in Section 6.
2 VLSI Design

Partial products generator

Reduction tree

Final adder

Figure 1: Block diagram of tree-based multipliers.

Figure 3: 8-bit reduced complexity Wallace reduction.

Figure 2: 8-bit traditional Wallace reduction.


Figure 4: 8-bit Dadda reduction.

2. Previous Architectures calculates the number of rows in the last group for each stage
as
This section discusses some previous Wallace tree-based
multiplier architectures. The general block diagram of tree- Last Group𝑖 = 𝑟𝑖 mod 3. (1)
based multipliers is shown in Figure 1.
The dot notation [11] is used to represent the partial An 𝑁-bit multiplier has 𝑁 rows in the first stage. The
product tree in all the architectures discussed in this section number of rows in remaining stages can be calculated by
as shown from Figures 2 to 5. The full adders and half adders using
are represented by boxes around the dot products. The box 2𝑟𝑖−1
which encloses three dot products represents a full adder, 𝑟𝑖 = ⌊ ⌋ + 𝑟𝑖−1 mod 3. (2)
3
whereas the box containing only two dot products is used
to represent a half adder. The stages are separated by a thick Reduction is performed using a full adder or a half adder
horizontal line. depending on the number of elements in that particular
column of the group. If a column has only one element then
that is passed on to the next stage without any reduction. If
2.1. Traditional Wallace (TW) Multiplier. In TW multiplier the last group of a stage contains less than three rows then no
architecture, the partial product tree is divided into groups reduction is performed on that group as shown in stage 1 of
[6]. Each stage can have one or more groups as shown in the Figure 2.
8-bit TW reduction process in Figure 2. The size of the final adder for an 𝑁-bit TW multiplier with
The groups in a stage are separated by a thin horizontal 𝑆 stages can be calculated by
line. Each group consists of three rows except the last group
where the number of rows can be less than three. Equation (1) FinalAdderTW = (2𝑁 − 1) − 𝑆. (3)
VLSI Design 3

Table 1: Final adder size for different multipliers.

Final adder size


𝑁 Logic levels
TW RCW Dadda PW
8 11 14 14 10 4
16 25 30 30 24 5
24 40 46 46 39 6
32 55 62 62 54 6
64 117 126 126 116 7

3. Proposed Wallace (PW) Multiplier


In this section, we proposed a modification in the RCW
multiplier to further reduce its area by reducing the size of
the final adder. PW multiplier has the same number of stages
and the same rule for maximum number of rows in a stage as
in the other multipliers discussed in this paper.
An 8-bit PW reduction process is shown in Figure 5. PW
uses an additional half adder in each stage in order to reduce
the size of the final adder. The algorithm scans from the right
side and starts the reduction by using a half adder when it
finds the first column where the number of elements is greater
than one. The additional half adders are shown in solid boxes
at each stage of PW multiplier in Figure 5.
Compared with RCW in Figure 3, the introduction of half
adders in Figure 5 makes the final adder in PW “less wide”,
Figure 5: 8-bit proposed Wallace reduction. namely, a smaller size. This is because, in each stage, the half
adder that we introduced computes the final product bit for
that particular column of the partial product tree. Therefore,
2.2. Reduced Complexity Wallace (RCW) Multiplier. Waters the size of the required final adder is decreased by one in
and Swartzlander [7] presented a modification in the TW each stage. The least significant bit (LSB) of the product,
multiplier to reduce the complexity of the reduction tree. An 𝑃0 , is produced by the partial product generation block by
8-bit RCW reduction process is shown in Figure 3. computing 𝐴 0 × 𝐵0 . In the first stage of the reduction process,
The partial products are readjusted in a reverse pyramid product bit 𝑃1 is computed by using the additional half adder.
style which makes it easy to analyse the tree for efficient In the second stage, 𝑃2 is computed. Similarly, stage 3 and
reduction. The number of stages for RCW multiplier remains stage 4 compute the product bits 𝑃3 and 𝑃4 , respectively.
the same as that of TW multiplier. RCW tries to reduce the Thus, when the partial product tree is reduced to two rows,
partial product tree using only full adders. Half adders are five LSBs (𝑃4 − 𝑃0 ) of the product are already computed as
used only where they are necessary to satisfy the number shown in Figure 5. Therefore, the size of the final adder in 8-
of rows in a stage according to (2). This approach allows bit PW is reduced by four as compared to the final adder in
RCW multiplier to reduce the area of the reduction process. RCW multiplier. The size of the final adder for an 𝑁-bit PW
However, RCW multiplier uses a much larger final adder as multiplier with 𝑆 stages can be computed by
compared to TW multiplier. The size of the final adder for an FinalAdderPW = (2𝑁 − 2) − 𝑆. (5)
𝑁-bit RCW multiplier can be computed by
The comparison of (4) and (5) shows a reduction of 𝑆 in
FinalAdderRCW = 2𝑁 − 2. (4) the size of final adder from RCW to PW. This is achieved at
the expense of an increased area for reduction process due to
2.3. Dadda Multiplier. Dadda multiplier [11] tries to reduce the insertion of additional half adders in the PW. However,
the number of full adders and half adders by performing the the effect of additional half adders is very small as compared
reduction only where it is essential to satisfy (2). An 8-bit to the area saved by reducing the final adder size. Therefore,
Dadda reduction process is shown in Figure 4. the overall area of the PW is less than that of the RCW.
Dadda has the same number of stages as that of TW The size of the final adder and their required logic levels
and RCW. We can see from Figure 4 that Dadda performed for different multipliers are given in Table 1. The PW has
reduction only on four columns in stage 1. This is because the the smallest final adder as compared to all other multipliers.
other columns already satisfy (2). The same approach is used All the multipliers need the same number of logic levels
in all stages to reduce the tree until we achieve a tree of only to implement the final adder, which means that all the
two rows. The size of the final adder is the same for Dadda multipliers will have almost the same delay. The architecture
and RCW multiplier as computed in (4). of the final adder is discussed in Section 4.
4 VLSI Design

Table 2: Synthesis parameters for Synopsys Design Vision.

Technology 90 nm CMOS
Supply voltage 1.2 V
Temperature 25∘ C

Normalized area
Process model Typical 1
Interconnect model Balanced tree

4. Final Adder Design


The third step of the Wallace tree-based multipliers is to 0.9
add the remaining two rows using a fast adder. Some of the
most widely used parallel-prefix adders used for high speed
operations are Kogge-Stone [12], Sklansky [13], and Brent- 8 16 24 32 64
Kung [14]. These adders use the same tree topology but differ
Size of multiplier
in terms of logic levels, fanout, and interconnect wires. We
used Kogge-Stone adder in all the multipliers discussed in this TW multiplier Dadda multiplier
paper. The logic levels for implementation of an 𝑁-bit Kogge- RCW multiplier PW multiplier
Stone adder can be calculated by using Figure 6: Area of different multipliers on Synopsys 90 nm technol-
LogicLevels = ⌈log2 (𝑁)⌉ . (6) ogy.

5. Results
Figure 6 shows the normalized area for each multiplier.
In this section, we will discuss the verification of designs for The area of all multipliers is normalized with respect to TW
correct operation, synthesis tool, and the results. multiplier by using

5.1. Functional Verification. The multipliers are implemented


in VHDL with the test programs to verify the designs. All Original ValueMult XX
Norm ValueMult XX = . (7)
the possible input combinations are applied to thoroughly Original ValueTW Mult
test the 8-bit multipliers. Since an exhaustive testing of
bigger multipliers was not practical, they are tested with It is clear from Figure 6 that the TW has the largest area
random inputs applied. The Galois-type linear feedback shift as expected. Areas of RCW and Dadda are almost the same
registers (LFSRs) are designed to generate pseudorandom which also conforms to the results of [7]. PW has the lowest
binary sequence (PRBS) of maximum cycle for the multipliers area for all the multiplier configurations. The reduction in
under test [15]. All the designs are compiled and simulated area is more prominent when the size of the multiplier is
using Synopsys VCS. small. As the size of the multiplier increases, the area of PW
tends to asymptotically approach the area of RCW and Dadda
5.2. Synthesis Tool. All the multipliers are synthesized in multiplier. Therefore, the PW is particularly useful when the
Synopsys Design Compiler (DC) using 90 nm technology. final adder has a significant area in the multiplier, while the
The designs can be optimized for delay, power, and area by area advantage might decrease in larger multipliers.
setting the appropriate options in the DC. The designer has Figure 7 shows the normalized delay for each multiplier.
the option of setting the various synthesis parameters such The delays are normalized according to (7). All the multipliers
as fanout, wire load models, interconnect strategy, and PVT use the same number of reduction stages and the same logic
(process, voltage, and temperature). levels in the final adder. Therefore their delays are expected
The scripts are written to synthesize the TW, RCW, to be the same. However, the Design Compiler uses different
Dadda, and PW multipliers for optimized area. In order cells to optimize the area in each design due to their different
to have a fair comparison, the same synthesis parameters architectures and different final adder sizes. This can result
are specified for all the designs. Table 2 shows different in larger delays for the designs where the synthesizer can
parameters from SAED 90 nm library used for synthesis. optimize the area by using relatively slower standard cells. It
can be seen in Figure 7 that the PW has the least delay in
5.3. Synthesis Results. The detailed synthesis reports are 24-bit multiplier. In the rest of the multiplier sizes, PW has
generated by Design Compiler for area and timing. The area almost the same delay as of RCW which is less than the Dadda
report includes number of cells, the area used by cells, and and TW. One exception to this is the 16-bit multiplier where
the interconnect area. The timing report shows the complete PW has larger delay than RCW multiplier.
critical path along with the delay associated with each cell in Figure 8 shows the normalized power consumption for
the path. Table 3 shows the synthesis results for delay and area each multiplier. The power consumption of all multipliers is
for different multipliers. normalized with respect to TW multiplier by using (7).
VLSI Design 5

Table 3: Synthesis results from Synopsys DC on 90 nm technology.

Delay (ns) Area (𝜇m2 ) Power (mW)


Size
TW RCW Dadda PW TW RCW Dadda PW TW RCW Dadda PW
8 2.81 2.64 2.64 2.66 3392 3346 3262 3148 1.92 2.04 1.94 1.96
16 4.13 3.65 3.80 3.75 14847 14372 14242 13876 11.09 11.41 11.22 11.20
24 4.64 4.58 4.62 4.44 34337 33479 33323 32352 27.69 28.24 28.32 28.04
32 14.83 14.62 14.82 14.62 61526 59271 59086 58375 51.85 52.74 53.11 52.48
64 22.88 21.57 21.98 21.62 246842 238843 238597 237553 216.50 219.17 222.64 218.92

1.1
Normalized delay

Normalized power
1

0.9

8 16 24 32 64 0.9
8 16 24 32 64
Size of multiplier Size of multiplier
TW multiplier Dadda multiplier
TW multiplier Dadda multiplier
RCW multiplier PW multiplier
RCW multiplier PW multiplier
Figure 7: Delay of different multipliers on Synopsys 90 nm technol- Figure 8: Power consumption of different multipliers on Synopsys
ogy. 90 nm technology.

It is clear from Figure 8 that the TW multiplier has the As our future work, we plan to implement the designs
lowest power consumption as compared to the other multi- using Synopsys IC Compiler to analyze the postlayout results
pliers. One reason for this could be that the Design Compiler for area and delay. Synopsys Prime Time can be used to
was able to find the low-power cells to synthesize the TW analyze the multipliers for their power consumption.
multiplier. The regular structure of TW multiplier could
also be a reason of its low-power consumption. The power Conflict of Interests
consumption of PW is less than that of the RCW multiplier
due to the smaller final adder used in PW multiplier. It can be The authors declare that there is no conflict of interests
noted that the difference in power consumption of PW and regarding the publication of this paper.
RCW is very little for large multipliers, as expected, due to
the small difference in their area. References
[1] N. H. E. Weste and D. M. Harris, Integrated Circuit Design,
6. Conclusion and Future Work Pearson, 2010.
[2] J.-Y. Kang and J.-L. Gaudiot, “A fast and well-structured mul-
This paper presents a method to reduce the area of the tiplier,” in Proceedings of the EUROMICRO Systems on Digital
Wallace multiplier. The proposed architecture, named as PW System Design (DSD ’04), pp. 508–515, September 2004.
(proposed Wallace) multiplier, uses a smaller final adder to [3] C. R. Baugh and B. A. Wooley, “A twos complement parallel
reduce the area of a multiplier. The designs are synthesized array multiplication algorithm,” IEEE Transactions on Comput-
in Synopsys Design Compiler using 90 nm process technol- ers, vol. C-22, no. 12, pp. 1045–1047, 1973.
ogy. The synthesis results verify that the PW multiplier, as [4] S.-R. Kuang, J.-P. Wang, and C.-Y. Guo, “Modified booth mul-
expected, has the smallest area as compared to the other tipliers with a regular partial product array,” IEEE Transactions
Wallace based multipliers. The speed of the PW multiplier is on Circuits and Systems II: Express Briefs, vol. 56, no. 5, pp. 404–
almost the same as of other multipliers. 408, 2009.
6 VLSI Design

[5] B. C. Paul, S. Fujita, and M. Okajima, “ROM-based logic (RBL)


Design: a low-power 16 bit multiplier,” IEEE Journal of Solid-
State Circuits, vol. 44, no. 11, pp. 2935–2942, 2009.
[6] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Transac-
tions on Electronic Computers, vol. EC-13, no. 1, pp. 14–17, 1964.
[7] R. S. Waters and E. E. Swartzlander, “A reduced complexity
wallace multiplier reduction,” IEEE Transactions on Computers,
vol. 59, no. 8, pp. 1134–1137, 2010.
[8] S. Rajaram and K. Vanithamani, “Improvement of Wallace
multipliers using parallel prefix adders,” in Proceedings of the
International Conference on Signal Processing, Communication,
Computing and Networking Technologies (ICSCCN ’11), pp. 781–
784, July 2011.
[9] P. Jagadeesh, S. Ravi, and K. H. Mallikarjun, “Design of high
performance 64 bit mac unit,” in Proceedings of the International
Conference on Circuits, Power and Computing Technologies
(ICCPCT ’13), pp. 782–786, March 2013.
[10] M. Kumaran and M. Kamarajan, “Multicore embedded system
using parallel processing technique,” International Journal of
Emerging Trands in Electrical and Electronics, vol. 5, no. 3, 2013.
[11] L. Dadda, “Some schemes for parallel multipliers,” Alta Fre-
quenza, vol. 34, pp. 349–356, 1965.
[12] P. M. Kogge and H. S. Stone, “A parallel algorithm for the
efficient solution of a general class of recurrence equations,”
IEEE Transactions on Computers, vol. C-22, no. 8, pp. 786–793,
1973.
[13] J. Sklansky, “Conditional-sum addition logic,” IRE Transactions
on Electronic Computers, vol. EC-9, pp. 226–231, 1960.
[14] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,”
IEEE Transactions on Computers, vol. C-31, no. 3, pp. 260–264,
1982.
[15] R. Ward and T. Molteno, “Table of linear feedback shift regis-
ters,” Datasheet, Department of Physics, University of Otago,
2007.
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 652187, 11 pages
http://dx.doi.org/10.1155/2014/652187

Research Article
Efficient Hardware Trojan Detection with Differential Cascade
Voltage Switch Logic

Wafi Danesh, Jaya Dofe, and Qiaoyan Yu


Department of Electrical and Computer Engineering, University of New Hampshire, Durham, NH 03824, USA

Correspondence should be addressed to Qiaoyan Yu; [email protected]

Received 26 February 2014; Accepted 7 April 2014; Published 11 May 2014

Academic Editor: Chih-Cheng Lu

Copyright © 2014 Wafi Danesh et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Offshore fabrication, assembling and packaging challenge chip security, as original chip designs may be tampered by malicious
insertions, known as hardware Trojans (HTs). HT detection is imperative to guarantee the chip performance and safety. Existing HT
detection methods have limited capability to detect small-scale HTs and are further challenged by the increased process variation.
To increase HT detection sensitivity and reduce chip authorization time, we propose to exploit the inherent feature of differential
cascade voltage switch logic (DCVSL) to detect HTs at runtime. In normal operation, a system implemented with DCVSL always
produces complementary logic values in internal nets and final outputs. Noncomplementary values on inputs and internal nets in
DCVSL systems potentially result in abnormal power behavior and even system failures. By examining special power characteristics
of DCVSL systems upon HT insertion, we can detect HTs, even if the HT size is small. Simulation results show that the proposed
method achieves up to 100% HT detection rate. The evaluation on ISCAS benchmark circuits shows that the proposed method
obtains a HT detection rate in the range of 66% to 98%.

1. Introduction one for given inputs and work well for a functional unit
with a small set of inputs, as the probability of rare events
The growing number of ICs manufactured offshore increases is relatively high. When the circuit complexity increases, the
the threats to chip security [1–3]. Research has exposed an number of test vectors for ATPG will significantly increase
increase in existence of hardware Trojans (HTs), which are to an unaffordable degree. The benefits relative to the testing
malicious additions or modifications to the circuit design
efforts of ATPG become worse if the input nodes for the
that alter the original function. Malicious inclusions of
HT’s trigger circuit are spread out throughout the system.
hardware have the potential to degrade system performance,
The main challenge with logic testing approaches [7, 8] is the
surreptitiously delete data, leave a backdoor for secret key
leaking, or eventually destroy the chip [4, 5]. It is imperative generation of stimulus for sequential HTs. Voltage inversion
to detect HTs. technique alternates supply voltage and ground grids in
HTs can be detected by destructive approaches such CMOS-based functional blocks to change the original logic
as the chemical mechanical polishing (CMP) method. The function and thus increases the HT trigger probability [9].
CMP approach detects HTs by analyzing pictures of the Dummy flip-flops are inserted into the design to increase
demetalized chips under an electron microscope [6]. In transition probability of particular paths and reduce Trojan
addition to being expensive, this type of technique is also time activation time [10]. Alkabani [11] introduces the concept of
consuming (takes several months) and loses its efficiency creating dual circuits for a given design. By testing the dual
when the transistor density increases. Nondestructive HT with a few random input vectors, a HT inserted in the original
detection methods are broadly classified into two categories: design can be detected.
logic testing and side-channel analysis (SCA) approaches SCA approaches examine the anomalous behavior
[6]. Automatic test pattern generation (ATPG) approaches (resulting from HTs) in system parameters such as transient
examine whether the measured outputs match the expected current, power, and path delay [12–15]. A multiple-parameter
2 VLSI Design

side-channel analysis method and a platform are developed Power


to reliably test, analyze, and detect a wide range of HTs for
both combinational and sequential designs [14]. Recently,
HT detection approaches rely on multiple-parameter side-
channel analysis technique, which can be integrated with Current
statistical logic testing in order to improve the detection of monitor Time
HTs with very small design area [16]. SCA-based methods
[17, 18] achieve a high coverage and are effective for finding Out
HTs that span a large area of a system. However, the Input Error
sensitivity of SCA-based methods is challenged by the DUT
Out
increasing process variation [16, 19]. False detection on small
HTs can happen when process variation effects exceed the
signal threshold (e.g., power) for side-channel analysis. Figure 1: Proposed HT detection system.
To address the challenge from process variation on HT
detection, region-based approach [4] magnifies the region
potentially affected by HTs and forces the remaining regions
to be inactive. Postsilicon spatial thermal and power maps small and ultrasmall HTs. Moreover, due to the demand of
are simultaneously utilized in a multimodal characterization short time-to-market, the verification and testing period has
procedure to improve the HT detection sensitivity [13]. A been reduced significantly. Although it is always desired,
unified framework combines different HT detection methods thorough testing is not economically feasible. It is imperative
in a systematic analysis platform, which studies the impact of to develop a HT detection method that is not limited by the
small HTs [20]. HT size and does not take very long time to perform chip
In this work, we propose a method to remove the need testing and authorization.
of golden design for comparison and detect small HTs at We propose a HT detection method that allows users to
runtime. The difference with other side-channel analysis detect potential HTs at runtime and without a golden refer-
approaches is that our method focuses on enhancing the ence. The proposed method exploits the abnormal instanta-
side-channel signals by using a logic family’s inherent char- neous power for DUT to detect small HTs. Figure 1 shows
acteristics. We exploit the special characteristic of differen- the overview of the proposed method. Given a stable supply
tial cascade voltage switch logic (DCVSL) to detect HTs. voltage, we examine the current through the current monitor
Trojan detection using DCVSL can be performed using the for abnormal power behavior. A notable difference with
constant, abnormal power consumption peaks, or erroneous offline power-based side-channel analysis methods is that we
outputs. The method is inexpensive as there is no extra are not interested in a particular power value; instead, the
hardware overhead required in order to implement the HT current monitor detects the current (we can interpret it to
detection platform. Simulation results show that the power power consumption) staying at a constant high value for a
consumption of a DCVSL system with a HT triggered is relatively long duration. As shown in Figure 1, the current
constantly three orders of magnitude higher than that of monitor will trigger an alarm circuit, when the power value
the system with inactive HTs. This unique, abnormal power falls in and remains in the blue shadow region for a relatively
consumption phenomenon complements existing power- long period. This duration is comparable with the duration of
based side-channel analysis methods. an input vector, rather than input rising and fall times. The
The remainder of this work is organized as follows. triggered HT in the DUT causes the abnormal power period.
In Section 2, we highlight the basis for abnormal power We propose to implement DUTs with DCVSL, which always
consumption in DCVSL and introduce the proposed HT produces Out and Out bar, a pair of complementary outputs.
detection method. In Section 3, we thoroughly evaluate the Such complementary outputs will be used as inputs for next
area, power, and HT detection rate of our method in full stage. In DCVSL, noncomplementary inputs (invalid inputs)
adders and ISCAS benchmark circuits. Conclusions and result in short-circuit power remaining for a long period
future work are provided in Section 4. of time until the noncomplementary inputs disappear.The
proposed method exploits this inherent feature of DCVSL to
detect the presence of HTs.
2. Proposed DCVSL-Based Besides power detection, the proposed method further
HT Detection Method examines the complementary characteristic of the output
pair, Out and Out bar. The noncomplementary output pair
2.1. Method Overview. HT detection is typically carried indicates a potential hardware Trojan insertion in the DUT.
out during test stages, when numerous test vectors are These noncomplementary outputs can be utilized for HT
simultaneously applied to both the device under test (DUT) detection when no abnormal power values appear due to the
and a golden reference. As it is difficult to obtain a golden HT being triggered.
version, the behavioral model is often used as a reference. The current monitor is connected with the DUT on a
Because of the effects of process variation and imperfect separate platform at the user end. If the current monitor is
device libraries for computer-aided design tools, a behavioral integrated on the same chip with the DUT, this potentially
model based golden version is not precise enough to detect leaves an opportunity for an attacker to tamper or remove
VLSI Design 3

VDD VDD VDD VDD

P0 P1

NAND Out AND Out


Out Out
A N0
A A B C
A N3 N4 N5
B N1
PDN1 PDN2
B B
C N2

(a) (b)

Figure 2: DCVSL logic gates. (a) General gate structure and (b) circuit schematic of NAND3-AND3. Current track highlighted in the figure
is for noncomplementary inputs on 𝐴 and 𝐴.

the HT detection mechanism. A current sensor is needed another one is through N3. The path through N0, N1, and N2
to convert the transient current of the DUT and produce pulls the NAND Out port low as before, which turns on P1. P1
an analog voltage that is proportional to the measured DUT then tries to pull the AND Out port high. At the same time,
current. A programmed microcontroller can sample the the path to ground through N3 tries to pull the AND Out
analog voltage signal at specific intervals using interrupts. node low. If N3 is stronger than P1 (which is typically the
When the voltage value stays approximately constant for case), the AND Out port is pulled low and this activates P0.
multiple interrupts, it indicates an abnormal short-circuit Therefore, a path from 𝑉DD to ground is created through P0,
power due to a HT creating a short-circuit path from supply N0, N1, and N2, resulting in a high and constant short-circuit
voltage 𝑉DD to ground. The microcontroller can be further power. The constant short-circuit power remains as long as
configured to set off an alarm or trigger a light-emitting diode the duration of the input vector.
to indicate HT detection to the user. Figures 3(a) and 3(b) show the power waveforms with
complementary and noncomplementary inputs, respectively.
2.2. Short-Circuit Power-Based HT Detection As shown in Figure 3(b), in the duration of the input vector
A = 𝐴 = B = C = 1 (from 7 to 8 𝜇s on the time axis), the
2.2.1. Unique Short-Circuit Power in DCVSL. Each DCVSL peak power has a constant high value. This is because the
gate needs complementary inputs and produces comple- noncomplementary input pair (𝐴 = 𝐴) makes NAND Out
mentary outputs [21], as shown in Figure 2(a). In normal and AND Out both stay at logic low. The time from 7 to
operation, short-circuit power consumption of DCVSL gate 8 𝜇s represents the high time of the shortest input pulse
is close to that of CMOS logic gate, as the time period for the 𝐴. As a result, the two PMOS transistors, P0 and P1, are
direct current path from 𝑉DD to ground is extremely short both turned on; thus, the two current paths from 𝑉DD
compared with that in switching and steady state conditions. to ground (highlighted in Figure 2(b)) exist till the input
When the input pair is noncomplementary (both inputs vector is changed. The amplitude of short-circuit power is
being either logic 0 or logic 1), a DCVSL gate loses its typically three orders of magnitude higher than the leakage
complementary nature. More specifically, the output pair may power. This significant power difference between the cases
be noncomplementary, resulting in the short-circuit power using complementary and noncomplementary inputs is large
consumption lasting for a significantly longer time than the enough for a monitoring device to indicate the presence of a
case with complementary inputs. HT.
Take a 3-input NAND-AND gate as an example. The cir- We examine the average power for complementary and
cuit schematic is shown in Figure 2(b). In normal operation noncomplementary inputs for basic DCVSL gates using a
conditions, we give the input vector of 𝐴 = 𝐵 = 𝐶 = 1 typical IBM7RF technology library. As shown in Table 1,
and 𝐴 = 𝐵 = 𝐶 = 0. The NAND Out port is pulled down the increase on the average power (averaging power for
to logic low through NMOS transistors N0, N1, and N2; this all possible input patterns) caused by noncomplementary
in turn activates PMOS transistor P1. As P1 is turned on, inputs is over three orders of magnitude. This is the basis for
the AND Out node is pulled to logic high and thus P0 is choosing DCVSL to implement functional units that facilitate
turned off. The time period when both PMOS and NMOS HT detection. If the triggered HT flips the internal node of a
transistors are on is extremely short. Let us reconsider the 3- functional unit, it will create a noncomplementary signal in
input NAND-AND gate with the same input vector, except the middle of that functional unit. Consequently, the power
that we make 𝐴 = 𝐴 = 1. Now, there exist two paths from 𝑉𝐷𝐷 consumption will stay high for a long time, which is different
to the ground terminal: one is through N0, N1, and N2 and from normal switching power.
4 VLSI Design

2.0 2.0
1.58 1.58

AND Out
AND Out

V (V)
V (V)

1.16 1.16
0.74 0.74
0.32 0.32
−1 −1
2.0 2.0

NAND Out
NAND Out

1.58 1.58

V (V)
V (V)

1.16 1.16
0.74 0.74
0.32 0.32
−1 −1
500.0 500.0
400.0 400.0
Total power

Total power
W (𝜇W)

W (𝜇W)
300.0 300.0
200.0 200.0 Constant short-circuit power
100.0 100.0
0.0 0.0
0.0 2.0 4.0 6.0 8.0 0.0 2.0 4.0 6.0 8.0
Time (𝜇s) Time (𝜇s)
(a) (b)

Figure 3: Voltage and power waveforms for DCVSL NAND3-AND3 gate. (a) Complementary inputs and (b) noncomplementary input 𝐴 = 𝐴.

Table 1: Power increase caused by noncomplementary inputs.


Obtain transistor level circuit diagram
Average power
Logic gates Power for
Power for
complementary
noncomplementary inputs Make each input noncomplementary; group
inputs
all possible combinations of non
Inverter 20.51 nW 205.25 𝜇W complementary inputs
NAND2-AND2 12.76 nW 92.84 𝜇W
NOR2-OR2 11.85 nW 92.61 𝜇W
XNOR2-XOR2 19.97 nW 171.2 𝜇W Apply all possible input patterns and find all
NAND3-AND3 7.501 nW 40.36 𝜇W paths from VDD to ground
NOR3-OR3 6.673 nW 39.81 𝜇W
XNOR3-XOR3 16.49 nW 84.90 𝜇W
D-Flip-Flop 17.23 nW 181.8 𝜇W Let number of paths from VDD to ground =
number of abnormal peaks

2.2.2. Probability of Abnormal Short-Circuit Power. The key Define HT detection probability = number
reason for DCVSL gate having abnormal short-circuit power of abnormal peaks/total number of input
is the noncomplementary output nodes turning on the patterns
two PMOS transistors simultaneously. We assume that the
consequence of a HT insertion on DCVSL functional units Figure 4: Flowchart for analyzing the HT detection probability for
a DCVSL gate.
is flipping one of the complementary inputs. This is similar
to HT insertion in other technologies; that is, a triggered HT
is used to change the logic value of a logic gate or memory
element.
Because of electrical and logical masking, the noncom- In order to create an erroneous output in DCVSL, a HT
plementary inputs (caused by HTs) do not always yield has to make one or more of the inputs noncomplementary.
abnormal short-circuit power. As the logic gate topology This may result in an erroneous output if the effect of
varies between gates, it is difficult to obtain a closed-form the noncomplementary input is propagated and reaches
expression for the probability of abnormal power occurrence. the output port. An important point to note is that not
We summarize the general procedure for how to analyze all erroneous outputs are accompanied by abnormal power
the HT detection probability in DCVSL systems through peaks. Only if the erroneous output creates at least one path
abnormal power observation. Figure 4 is the flowchart for the from 𝑉DD to ground, will we observe the abnormal short-
analysis procedure. circuit power.
VLSI Design 5

Table 2: Probability of abnormal power and output error rate over Table 3: Number of transistors for DUTs and HTs in this work.
all possible input patterns for DCVSL logic gates.
Circuit CMOS 64-bit adder HT-1 HT-2 HT-3
DCVSL Percentage of abnormal Percentage of output error Transistor number 2560 8 28 100
gates power over all input patterns over all input patterns Circuits DCVSL 64-bit Adder C432 C1908 C3540
Inverter 50.00% 100% Transistor number 1644 2070 5516 9874
XOR2 41.66% 100% Circuits S526 S832 S1196 S1488
XOR3 33.92% 100% Transistor number 1682 1408 3056 2824
AND2 25.00% 25.00%
AND3 12.50% 12.50% Table 4: Power consumption for two 64-bit full adders and HT
insertions.
OR2 25.00% 50.00%
OR3 12.50% 25.00% Unit under test Dynamic power Leakage power
OAI21 23.21% 28.57% (mW) (nW)
AOI21 26.78% 50.00% Adder 24.6 65.78
AOI22 25.89% 45.08% CMOS-based 64-bit HT-1 0.444 0.419
full adder HT-2 0.942 1.243
OA22 25.89% 35.27%
MUX21 30.35% 55.00% HT-3 1.028 2.583
Average 27.72% 52.00% Adder 8.002 47.50
DCVSL-based HT-1 0.566 0.328
64-bit full adder HT-2 0.892 0.993
We examine the probability of abnormal power and HT-3 1.544 2.465
output error occurrence for all input patterns. Table 2 shows
the ratio of the total number of abnormal power peaks over
the total number of all input patterns for various basic DCVSL design. The fastest switching period for input is 1 𝜇s. We
gates. The average probability for power exception and output synthesized the Verilog codes of ISCAS benchmark circuits in
mismatch are 27.7% and 52%, respectively. This means our HT Synopsys Design Compiler with IBM CMOS7RF technology.
detection method has over 50% chance to detect HTs, even The synthesized netlist is modified with an in-house python-
if the HT trigger circuit is implemented with a single gate. based netlist generator, which converts CMOS netlist to
This is a significant advantage over other power-based side- DCVSL netlist. The behavior model of CMOS library is mod-
channel analysis methods, which have a lower bound on the ified according to the gate output and power performance
size of detectable HTs. obtained from simulation in Cadence Virtuoso.
Moreover, we observe that abnormal power occurs more HT detection rate is evaluated through gate-level simu-
often on the input pattern that produces the rare output lation in Cadence NCVerilog. To observe the accumulated
value. For example, an AND3 gate produces high output only HT-induced effects through the system, we inserted the HTs
when all three inputs are high; the abnormal power appears payload on the inputs of DUTs. We particularly did so to
at the exact input pattern if one of the inputs is not in the model the propagation of HT effect in a large-scale system. To
complementary form. To hide a HT, hackers often utilize compare the area and power consumption of DUT and HTs,
the rare case to trigger the HT. As discussed above, our we designed three HTs. HT-1 is OR3 trigger circuit with XOR2
approach inherently achieves a higher detection rate for the payload. HT-2 is OR(XOR(AND(x,y),z),w) trigger circuit
HT triggered by rare cases. This means a system equipped by with XOR2 payload. HT-3 is AND4 plus modulo-8 counter
our method will pose a greater challenge to attackers in order trigger circuit with XOR2 payload. The complexity of the
to conceal HTs. DUTs and HTs in this work is listed in Table 3. As can be seen,
the HTs are significantly smaller than the target design.
3. Experimental Results
3.2. Case Study on a 64-Bit Full Adder. We implemented
3.1. Experimental Setup. We evaluated the proposed method a 64-bit full adder using CMOS and DCVSL in Cadence
on the 64-bit ripple carry adder, ISCAS’85 and ISCAS’89 Virtuoso. The layout area for these two adders is shown
benchmark circuits. The schematic and layout of the 64- in Table 3. Because less PMOS transistors are needed in
bit adder were implemented in Cadence Virtuoso with the DCVSL, the area of DCVSL-based full adder is less than that
IBM CMOS7RF technology. We set all transistor lengths of CMOS full adder when optimization is applied on both
to 220 nm (minimum length in the CMOS7RF technology) implementations. HTs are rarely triggered and the leakage
and set the PMOS and NMOS transistor widths to 500 nm power for HTs is a few orders of magnitude less than the adder
and 600 nm, respectively. The average power, leakage power, switching power, as shown in Table 4.
and peak dynamic power were obtained from schematic- All possible input patterns were applied to the 64-bit
level simulations by examining all possible input patterns. ripple carry adder. We placed a HT circuit to alter one
The area for DCVSL modules was obtained from customized complementary input pin in the adder. The power over
layout in Virtuoso. Five metal layers were used in layout time waveform is shown in Figure 5. As can be seen in
6 VLSI Design

8.0 7.5

6.4 6.0
Power (mW)

Power (mW)
4.9 4.5

3.2 3.0

1.6 1.5

0.0 0.0
0.0 10.0 20.0 30.0 40.0 50.0 60.0 0.0 10.0 20.0 30.0 40.0 50.0 60.0
Time (𝜇s) Time (𝜇s)
(a) (b)
7.5

6.0
Power (mW)

4.5

3.0

1.5

0.0
0.0 10.0 20.0 30.0 40.0 50.0 60.0
Time (𝜇s)
(c)

Figure 5: Power consumption for a 64-bit DCVSL full adder. (a) No HT, (b) HT on the 49th 1-bit full adder carry in port, and (c) HT on the
2nd 1-bit full adder carry in port.

Figure 5(a), when no HT is triggered, the switching power has than CMOS. However, when the HT is triggered to change
instantaneous peaks whereas the leakage power remains flat the noncomplementary inputs for the DCVSL-based full
(close to zero). Figure 5(b) shows the power for the adder with adder, the increased short-circuit power results in a dramatic
one HT inserted at the 49th 1-bit full adder. As can be seen, the increase on the average power. Figure 6 also shows that the
power has an extra periodical increase, which is noticeably average power difference between original and HT affected
higher than the leakage power. This is the short-circuit power version is over 50X. If the HT is inserted at the early stage in
(discussed in Section 2.2) induced by the noncomplementary the functional block, the average power difference increases
inputs from HT insertion. We placed the HT payload circuit to over two orders of magnitude. This is favorable for power-
to the 2nd 1-bit full adder and observed different power based side-channel analysis HT detection methods.
behavior. As shown in Figure 5(c), the increased short- To assess the HT detection rate, we assume that HTs
circuit power appears in almost all input patterns. This is are inserted to change the complementary inputs. As input
because the 2nd 1-bit full adder with noncomplementary vectors 𝐴 and 𝐵 for a 64-bit full adder are equivalent, we
inputs yields noncomplementary outputs, and those outputs select 64-bit input 𝐴 to receive the potential impact from HTs.
are further propagated to other 1-bit full adders. Because of Besides half of the inputs, 𝐴, the carry-in bit for the first 1-
the propagation of HT effects, the power consumption is bit full adder is another potential location for HT insertion.
exceptionally higher than that in normal cases. As the proposed method is independent of the particular HT
CMOS circuits have more PMOS transistors than the trigger circuit, we flipped one of the complementary inputs to
DCVSL version. Consequently, the dynamic power consump- model the effect of HT insertion. As shown in Figure 7, for the
tion of CMOS is higher than that of DCVSL. As shown HTs on 𝐴, the HT detection rate reaches 1. Given a HT area
in Figure 6, DCVSL has less average power consumption over chip area ratio below 1%, the HT detection rate is higher
VLSI Design 7

1
1000
Average power (mW)

0.8

HT detection rate
100
0.6

0.4
10
0.2
1
No HT HT on 49th HT on 33rd HT on 2nd 0
1-bit FA 1-bit FA 1-bit FA 1 3 5 7 9 11 13 15 17 19 21 23 25
HT insertion on different internal gates
DCVSL
CMOS Figure 8: Impact of HT insertion location on HT detection rate.
Figure 6: Impact of HT location on average power of 64-bit DCVSL
adder.
difference is high enough for use in HT detection. As shown
in Figure 9(b), the HT inserted in the early 1-bit full adder
1
stage yields an abnormal energy that is up to three orders of
0.8
magnitude higher than normal leakage energy. HT insertion
HT detection rate

location approaching the final output yields less abnormal


0.6 power, in terms of absolute energy value and the frequency of
abnormal energy. As explained before, the latter HT injection
0.4 location has a higher probability to demonstrate errors on the
final outputs.
0.2

0
3.3. Evaluation on Benchmark Circuits. The proposed
method is further evaluated with ISCAS benchmark circuits,
Cin
A4
A9
A14
A19
A24
A29
A34
A39
A44
A49
A54
A59

which are composed of various logic gates listed in Table 2.


HT injection on different input pins In the experiments below, we assume that single HT is
inserted in the benchmark circuit. More HT insertions in
Figure 7: Impact of HT insertion locations on HT detection rate.
the target circuit lead to a higher HT detection rate, as more
gates experience abnormal short-circuit power. The HT
detection rate is defined as the number of cases experiencing
than the one reported in [13]. Such high HT detection rate is abnormal short-circuit power over the total number of test
mainly contributed by the noncomplementary inputs, which cases. Three combinational benchmark circuits, c432, c1908,
lead to internal noncomplementary outputs. Those outputs and c3540, are used to assess the HT detection rate of our
are further propagated to the remaining gates. Consequently, method. 500,000 random input patterns were applied to the
one HT injection possibly leads to more gate failures. Figure 7 evaluation of c432 and c1908 circuits. Because of larger scale,
also shows that the HT inserted on the carry-in (Cin) input c3540 was evaluated with 1,000,000 random input patterns.
can be detected with a HT detection rate of 0.5, which can be As shown in Figure 10, our method achieves the HT
compensated by comparing outputs. Our simulation results detection rate up to 1 in the c432 circuit. The lowest HT
show that, after the output comparison, the HT detection rate detection rate is 0.7333. The majority of logic gates in c432 are
can be enhanced close to 1. The simulated HT detection rate Inverter and AND2; thus the HT rates are centered around
was obtained from 200,000 random input patterns. two particular regions, 1 and 0.73. The scales of c1908 and
HTs placed on input pins at earlier stages in the design c3540 are larger than c432; the kind of logic gates in c1908
have higher potential to be detected, because of the propaga- and c3540 is more diverse than c432. These two factors affect
tion of noncomplementary outputs. We examine the impact the HT detection rate. Figures 11 and 12 show that the HT
of HT insertion locations on the HT detection rate. As shown detection rate is distributed over the whole range, but the
in Figure 8, as the HT insertion location shifts towards the HT detection rate stays mostly above 0.7. We averaged the
final output, the HT detection rate decreases to around 0.5. HT detection rate over all test cases in Figure 13. As can be
The earlier the HT is inserted, the higher the probability seen, our method achieves a HT detection rate over 0.8 in
of obtaining abnormal power behavior which can be used c432 and c1908. The HT detection rate for c3540 is slightly
to determine the presence of HTs will be. For HT injection low; however, our HT detection rate is still significant, as our
on the very early inputs, each HT detected case will have method is not limited by the size of HTs and can be used to
about 1.7 gates experiencing high short-circuit power, as detect extremely small HTs. The average HT detection rate for
shown in Figure 9(a). According to Tables 1 and 4, the short- the examined ISCAS’85 benchmark circuits is 0.76.
circuit power for one gate is one order of magnitude higher To examine the amount of power increased by each HT
than the leakage power of a full adder. Therefore, the power insertion, we first investigated the number of gates having
8 VLSI Design

10−9

Energy for one input pattern (J)


2

10−10
Number of gates having

1.5
abnormal power

10−11
1
10−12

0.5 Energy level for


10−13 normal leakage

0 10−14
0 2 4 6 8 10 12 14 16
Cin
A4
A9
A14
A19
A24
A29
A34
A39
A44
A49
A54
A59
Time (𝜇s)
HT injection locations
HT on 1st 1-bit adder
HT on 32nd 1-bit adder
HT on 48th 1-bit adder
(a) (b)

3.64E − 04
4.0E − 04
Average power (W)

3.0E − 04

2.0E − 04 1.09E − 04 1.10E − 04

1.0E − 04 2.16E − 06

1.0E − 06
Without HT on 1st HT on HT on
HT 1-bit FA 32nd 1- 48th 1-
bit FA bit FA
(c)

Figure 9: Results for HT-induced abnormal power assessment. (a) Average number of gates experiencing high short-circuit power per HT
inserted case. (b) Abnormal energy caused by HT insertion over regular leakage energy. (c) Average power for three different HT injection
locations.
Hardware Trojan detection rate

1 1
Hardware Trojan detection rate

0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0
0 5 10 15 20 25 30 35 40 0
0 5 10 15 20 25 30 35
Noncomplementary input location Noncomplementary input location
Figure 10: HT detection rate in c432. Figure 11: HT detection rate in c1908.

abnormal power upon HT insertion in the different locations


of three ISCAS’85 benchmark circuits. Figure 14 shows the power behavior decreases because the path of HT effect
number of gates that are affected by one HT insertion. As propagation is reduced. Since the abnormal short-circuit
can be seen, the number of gates yielding abnormal power power also depends on input patterns of the target gate, the
generally increases with the circuit size and complexity. results reported in Figure 14 is not always integer valued. We
As shown in Figure 14, c3540 has the highest number of averaged the number of gates affected by each HT insertion in
gates experiencing abnormal power per each HT insertion, three benchmark circuits. As shown in Figure 15, the average
compared to c1908 and c342. As HT insertion position moves affected gate number for c3540 exceeds three. The higher
towards the final output, the number of gates with abnormal number means more significant power will be induced by
VLSI Design 9

Number of abnormal gates detected


1
Hardware Trojan detection rate

9
0.9

through power observation


8
0.8
0.7 7
0.6 6
0.5 5
0.4
4
0.3
0.2 3
0.1 2
0 1
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Noncomplementary input location Noncomplementary input location
Figure 12: HT detection rate in c3540. c342
c1908
c3540
1 0.81 0.82
0.76
0.66 Figure 14: The number of gates experiencing abnormal power
0.8 during each HT insertion.
HT detection rate

0.6

0.4 4

0.2 Average number of abnormal


gates per HT insertion
3
0
c432 c1908 c3540 Average

Figure 13: Average HT detection rate. 2

1
HT insertions; this feature has potential to be used in power-
based HT detection.
0
Detecting the noncomplementary final output of DUT
c432 c1908 c3540
helps to improve the HT detection rate. As shown in Fig-
ure 16, not all test cases have abnormal power behavior. We Figure 15: Average number of gates with abnormal power per each
collected the number of cases that have noncomplementary noncomplementary input pair.
outputs (i.e., output error) and observed that the cases of
noncomplementary DUT final output can achieve a HT
detection rate of 1. This outstanding performance depends
on circuit topology and the employed logic gates. Sometimes, ×103
the output error occurs at the same moment when abnormal 500
short-circuit power is observed.
Sequential circuits are more likely to be affected by HT 400
effect propagation, as latches and flip-flops have a higher
Number of detected cases

probability to remain high with short-circuit power than


combinational logic gates. We injected single HT on the 300
inputs of benchmark circuits, s526, s832, s1196, and s1488, to
model the impact of HT on circuits. As shown in Figure 17, on
average, the HT detection rate on sequential circuit is higher 200
than that in combinational circuits. The HT detection of s1488
and s1196 is close to 1. The average HT detection rate for the 100
examined ISCAS’89 benchmark circuits is 0.85.

0
4. Conclusion 1 5 9 13 17 21 25 29 33
Hardware Trojans (HTs) challenge the chip security because Output error
of the increasing number of chips being fabricated, assem- Abnormal power
bled, and packaged offshore. To enforce the confidence of
chip security, efficient HT detection is imperative. HT detec- Figure 16: HT detection rate improvement by comparing comple-
tion can be performed during chip testing stage, although it mentary outputs in c432 circuit.
10 VLSI Design

In future work, we will validate the proposed method in


Average larger-scale circuits. In addition, we will integrate our method
ISCAS benchmark circuits

with a current monitor to demonstrate the significance of


s1488 proposed concept in real applications.

s1196 Conflict of Interests


s832 The authors declare that there is no conflict of interests
regarding the publication of this paper.
s526
References
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
[1] J. Markoff, “Old Trick Threatens the Newest Weapons,”
HT detection rate
October 2009, http://www.nytimes.com/2009/10/27/science/
Figure 17: Average HT detection rate of different sequential bench- 27trojan.html?pagewanted=all& r=1&.
mark circuits. [2] J. Ellis, “Trojan integrated circuits,” http://chipsecurity.org/
2012/02/trojan-circuit/.
[3] S. Johnson, “Fake chips threaten military,” San Jose Mer-
cury News, September 2010, http://www.mercurynews.com/
breaking-news/ci 15990184.
requires large numbers of test vectors and long verification [4] M. Banga and M. S. Hsiao, “A region based approach for
times. As argued by many researchers, testing approaches the identification of hardware Trojans,” in Proceedings of the
may not be practical in identifying the rare events caused by IEEE International Workshop on Hardware-Oriented Security
HTs in a short period of time. Chip fingerprint is examined and Trust (HOST ’08), pp. 40–47, June 2008.
in IC authorization stages through side-channel analysis. [5] D. Mukhopadhyay and R. S. Chakraborty, “Testability of cryp-
Existing side-channel analysis approaches are challenged by tographic hardware and detection of Hardware Trojans,” in
process variation, lack of a perfect golden chip for compar- Proceedings of the 20th Asian Test Symposium (ATS ’11), pp. 517–
ison, and the presence of small-scale HTs. To address this 524, November 2011.
need, we propose to use the inherent characteristic of DCVSL [6] M. Tehranipoor and F. Koushanfar, “A survey of hardware trojan
to detect HTs at runtime, without requiring a golden chip taxonomy and detection,” IEEE Design and Test of Computers,
and a large number of test vectors. Our method is low-cost, vol. 27, no. 1, pp. 10–25, 2010.
convenient for user, and complementary to existing power- [7] F. Wolff, C. Papachristou, S. Bhunia, and R. S. Chakraborty,
based side-channel analysis methods. “Towards Trojan-free trusted ICs: problem analysis and detec-
In this work, we exploit DCVSL’s complementary feature tion scheme,” in Proceedings of the Design, Automation and Test
on both inputs and outputs to detect hardware Trojans at in Europe (DATE ’08), pp. 1362–1365, Munich, Germany, March
runtime, rather than offline. Noncomplementary inputs in 2008.
DCVSL-based systems lead to constant and abnormal short- [8] R. S. Chakraborty, F. Wolff, S. Paul, C. Papachristou, and S.
circuit power peaks, which remain until the noncomplemen- Bhunia, “MERO: a statistical approach for hardware Trojan
tary inputs disappear. A case study on a 64-bit ripple carry detection,” in Proceedings of the 11th International Workshop on
Cryptographic Hardware and Embedded Systems, pp. 396–410,
adder shows that the proposed method achieves from 50X
2009.
to two orders of magnitude higher average power difference
than CMOS-based power analysis. Such high power differ- [9] M. Banga and M. S. Hsiao, “VITAMIN: voltage inversion
technique to ascertain malicious insertions in ICs,” in Proceed-
ence between normal operation and HT triggered conditions
ings of the IEEE International Workshop on Hardware-Oriented
is desirable for power-base side-channel analysis. Evaluation Security and Trust (HOST ’09), pp. 104–107, July 2009.
on a 64-bit adder shows that our method achieves a HT
[10] H. Salmani, M. Tehranipoor, and J. Plusquellic, “A novel tech-
detection rate approaching 100%, if HTs are inserted to flip nique for improving hardware trojan detection and reducing
one of the adder inputs logic value. As HT payload circuits are trojan activation time,” IEEE Transactions on Very Large Scale
placed close to the final outputs, our abnormal power-based Integration (VLSI) Systems, vol. 20, no. 1, pp. 112–125, 2012.
HT detection slightly loses its efficiency. The examination [11] Y. Alkabani, “Trojan immune circuits using duality,” in Proceed-
on the complementary characteristic of the outputs can ings of the 15th Euromicro Conference on Digital System Design
improve the HT detection rate. Assessment on ISCAS’85 and (DSD ’12), pp. 177–184, 2012.
ISCAS’95 benchmark circuits shows that the HT detection [12] Y. Jin and Y. Makris, “Hardware Trojan detection using path
rate is in the range of 66% to 98%. On average, our method delay fingerprint,” in Proceedings of the IEEE International
can detect 76% and 85% of HTs inserted in ISCAS’85 and Workshop on Hardware-Oriented Security and Trust (HOST
ISCAS’89 benchmark circuits, respectively. By examining the ’08), pp. 51–57, June 2008.
complementary nature of the final output, we further improve [13] K. Hu, A. N. Nowroz, S. Reda, and F. Koushanfar, “High-
the HT detection rate. Simulation on ISCAS’85 c432 circuit sensitivity hardware Trojan detection using multimodal char-
shows that the HT detection rate can be compensated to reach acterization,” in Proceedings of the Design, Automation & Test in
100%. Europe Conference & Exhibition (DATE ’13), pp. 1271–1276, 2013.
VLSI Design 11

[14] C. Bell, M. Lewandowski, and S. Katkoori, “A multi-parameter


functional side-channel analysis method for hardware trust ver-
ification,” in Proceedings of the 31st IEEE VLSI Test Symposium
(VTS ’13), pp. 1–4, 2013.
[15] L. Wang, H. Xie, and H. Luo, “Malicious circuitry detection
using transient power analysis for IC security,” in Proceedings
of the International Conference on Quality, Reliability, Risk,
Maintenance, and Safety Engineering (QR2MSE ’13), pp. 1164–
1167, 2013.
[16] S. Narasimhan, D. Du, R. S. Chakraborty et al., “Hardware
Trojan detection by multiple-parameter side-channel analysis,”
IEEE Transactions on Computers, vol. 62, no. 11, pp. 2183–2195,
2013.
[17] S. Narasimhan, X. Wang, S. Bhunia, W. Yueh, and S. Mukhopad-
hyay, “Improving IC security against Trojan attacks through
integration of security monitors,” IEEE Design & Test of Com-
puters, vol. 29, no. 5, pp. 37–46, 2012.
[18] T. Huffmire, J. Valamehr, T. Sherwood et al., “Trustworthy
system security through 3-D integrated hardware,” in Proceed-
ings of the IEEE International Workshop on Hardware-Oriented
Security and Trust (HOST ’08), pp. 91–92, June 2008.
[19] X. Wang, H. Salmani, M. Tehranipoor, and J. Plusquellic, “Hard-
ware Trojan detection and isolation using current integration
and localized current analysis,” in Proceedings of the 23rd IEEE
International Symposium on Defect and Fault Tolerance in VLSI
Systems (DFT ’08), pp. 87–95, October 2008.
[20] F. Koushanfar and A. Mirhoseini, “A unified framework for
multimodal submodular integrated circuits trojan detection,”
IEEE Transactions on Information Forensics and Security, vol. 6,
no. 1, pp. 162–174, 2011.
[21] D. A. Rennels and H. Kim, “Concurrent error detection in self-
timed VLSI,” in Proceedings of the 24th International Symposium
on Fault-Tolerant Computing, pp. 96–105, June 1994.
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 801241, 14 pages
http://dx.doi.org/10.1155/2014/801241

Research Article
On-Chip Power Minimization Using Serialization-Widening
with Frequent Value Encoding

Khader Mohammad,1 Ahsan Kabeer,2 and Tarek Taha3


1
Birzeit University, P.O. Box 14, Birzeit, West Bank, Palestine
2
Clemson University, Clemson, SC 29634, USA
3
University of Dayton, Dayton, OH 45469, USA

Correspondence should be addressed to Khader Mohammad; [email protected]

Received 19 January 2014; Accepted 2 April 2014; Published 6 May 2014

Academic Editor: Qiaoyan Yu

Copyright © 2014 Khader Mohammad et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

In chip-multiprocessors (CMP) architecture, the L2 cache is shared by the L1 cache of each processor core, resulting in a high
volume of diverse data transfer through the L1-L2 cache bus. High-performance CMP and SoC systems have a significant amount
of data transfer between the on-chip L2 cache and the L3 cache of off-chip memory through the power expensive off-chip memory
bus. This paper addresses the problem of the high-power consumption of the on-chip data buses, exploring a framework for
memory data bus power consumption minimization approach. A comprehensive analysis of the existing bus power minimization
approaches is provided based on the performance, power, and area overhead consideration. A novel approaches for reducing the
power consumption for the on-chip bus is introduced. In particular, a serialization-widening (SW) of data bus with frequent value
encoding (FVE), called the SWE approach, is proposed as the best power savings approach for the on-chip cache data bus. The
experimental results show that the SWE approach with FVE can achieve approximately 54% power savings over the conventional
bus for multicore applications using a 64-bit wide data bus in 45 nm technology.

1. Introduction same time improving bus bandwidth for the compression


technique and reducing the bus capacitance for the SW
There is a need for high-performance, high-end products approach. The goal is similar to using switching activity
to reduce their power consumption. The high-performance and capacitance reduction in bus power savings; the key
systems require complex design and a large power budget difference between the prior work and the work presented
having considerable temperature impact to integrate several here is that the primary focus of this work is to explore a
powerful components. Therefore, low energy consumption framework for bus power minimization approaches from an
is a major design criterion in today’s design. Low energy architectural point of view. As a result, this paper presents
consumption improves battery longevity and reliability, and a a comprehensive analysis of most of the possible bus power
reduction in energy consumption lowers both the packaging minimization approaches for the on-chip. This research
and overall system costs [1]. As the technology scaling down explores a framework for power minimization approaches
the power consumption is also decreasing and results in more for an on-chip memory bus from an architectural point of
sensitivity to soft errors so reliability would be affected. There view. It also considers the impact of coupling capacitance for
are tradeoffs between power consumption and reliability estimating the on-chip bus power consumption. Finally this
in different ways. In future work overall reliability will be paper proposes a serialized-widened bus with frequent value
discussed and it will be evaluated how it can be improved by encoding (FVE) as the best power savings approach for the
reducing the power consumption. on-chip (L1-L2 cache) data bus.
The primary goal of this research is for bus power The organization of the rest of the paper is as follows.
minimization by reducing the switching activity while at the Section 2 presents background. Section 3 presents framework
2 VLSI Design

Wire 1 The general equation of the bus power calculation is given


as follows:
Load capacitance, CL
2
Coupling capacitance, CC 𝑃 = 𝛼𝑓𝑊𝐶𝑉𝐷𝐷 , (1)

where 𝛼 is the switching activity, 𝑓 is the frequency of the


Wire 2
bus, 𝑊 is the number of parallel data bus lines, 𝐶 is the
Load capacitance, CL total capacitance of the bus, and 𝑉𝐷𝐷 is the swing voltage.
The capacitance 𝐶 of (1) can be divided into two parts as
Figure 1: Load capacitance of a wire and coupling capacitance load capacitance 𝐶𝐿 which is the parasitic capacitance to
between the wires.
substrate with a constant potential and coupling capacitance
𝐶𝐶 which is the parasitic capacitance between the adjacent
lines (see Figure 1). In a deep submicron technology, the
and proposed on-chip bus power model, a framework total capacitance no longer only depends on load capacitance
for bus power minimization approaches and their efficacy. of the wire. Coupling capacitance between the wires is a
Section 4 present experiment setup followed by Section 5 large factor as coupling capacitance is some order of load
which presents the experiment results, a thorough compar- capacitance of the wire line [16–20].
ison of the proposed technique with the other approaches. The total capacitance is the sum of the load capacitance
and coupling capacitance and it can be expressed as 𝐶 = 𝐶𝐿 +
2𝐶𝐶 [2, 16, 21–23]. The equation of the power consumption
2. Background calculation of the conventional bus line will be
Memory bus power minimization techniques can be cat- 2
𝑃𝐶 = (𝛼𝐶𝐿 𝐶𝐶𝐿 + 𝛼𝐶𝐶 𝐶𝐶𝐶 ) 𝑓𝐶𝑊𝐶𝑉𝐷𝐷 , (2)
egorized as bus serialization [2–4], encoding [5–8], and
compression techniques [9–14]. Non-cache-based encoding where 𝛼𝐶𝐿 is the signal transition switching activity, 𝐶𝐶𝐿 is
techniques reduce power by reordering the bus signals. Bus the load capacitance, 𝛼𝐶𝐶 is the coupling transitions switching
serialization reduces the number of wire lines, eventually activity, and 𝐶𝐶𝐶 is the coupling capacitance between the con-
reducing the area overhead. A serialized-widened bus reduces ventional bus lines. The signal transition switching activity
the capacitance of on-chip interconnections. Cache-based [2, 22] is given by
encoding techniques reduce the number of switching transi-
tions using encoded hot-code. These techniques keep track of 0 if data transition from 0 󳨀→ 0 or 1 󳨀→ ∗
some of the previous transmitted data using a small cache on 𝛼𝐶𝐿 = {
1 if data transition from 0 󳨀→ 1.
both sides of the data bus. Compression techniques reduce
the number of wire lines contributing a reduction on in (3)
area overhead and an increase in the bus bandwidth. These
The coupling switching activity [2, 22] depends on the
compression techniques also reduce the switching activity.
transitions activity between two adjacent bus lines as follows:
Serialization changes the data ordering transmitted through
the data bus. This method contributes to reducing the
{
{ 0 if wire 1 transition 0 󳨀→ 1 and
switching activity as well. It may also improve the chance of {
{
{
{ wire 2 transition 0 󳨀→ 1
data matching by incorporating it with cache-based encoding {
{
{
{ 1 if wire 1 transition 0 󳨀→ 1 and
techniques because partial data matching is three times more {
{
frequent than full-length data matching [7]. {
{ wire 2 transition 0 󳨀→ 0
Jacob and Cuppu [3] explored the dynamic random access 𝛼𝐶𝐶 ={ (4)
{
{ 1 if wire 1 transition 0 󳨀→ 1 and
memory (DRAM) system and memory bus organization in {
{
{
{ wire 2 transition 1 󳨀→ 1
terms of performance, presenting design tradeoffs for the {
{
{
{ 2 if wire 1 transition 0 󳨀→ 1
bank, channel, bandwidth, and burst size. They also mea- {
{ and
{
sured the performance in relation to optimize the memory { wire 2 transition 1 󳨀→ 0.
bandwidth and bus width. Suresh et al. [7] presented a data
bus transmit protocol called the power protocol to reduce the Two of the main approaches to minimize the power con-
dynamic power dissipation of off-chip data buses. Hatta et al. sumption of a bus are to reduce the bus switching activity and
[2] proposed the concept of bus serialization-widening (SW) the bus wire capacitance. Switching activity can be reduced
to reduce wire capacitance; their work focused on the power through encoding techniques while the wire capacitance can
minimization of the on-chip cache address and data bus. Li be reduced by changing the wire width and spacing.
et al. [15] proposed reordering the bus transactions to reduce
the off-chip bus power. 2.1. Bus Serialization and Widening. Bus serialization
In this chapter we present on-chip bus power model, a involves reducing the number of wires on the bus. If the
framework for bus power minimization approaches and their number of transmission lines in a conventional bus is
efficacy. We also discuss in detail the proposed technique and NC and the serialization factor is S, then the number of
present a thorough comparison of our proposed technique to transmission of lines in the serialized version of the bus is
the possible approaches from power savings stand point. given by NS = NC/S. The serialization factor can be any
VLSI Design 3

Table 1: Serialization may increase or decrease switching activity.


Parts (a) and (b) illustrate two different 16-bit data streams passing Serializer Deserializer
through a conventional 8-bit bus and a serialized 4-bit bus. In
Driver
example (a) switching activity decreases, while in (b) it increases.
L1 L2
cache cachet
(a) 16-bit data stream → 0011 0011 0011 0011 NC NS NS NC

Passing
Data
through 8-bit Signal Coupling
sequence
bus →
0000 0000 — — Figure 2: Basic structure and position of serializer and deserializer.
0011 0011 4 3
0011 0011 0 0 WC WS
Total number
4 3
of transitions
DC Serialization = 2 DS
Passing
Data
through 4-bit Signal Coupling
sequence
bus →
0000 — — Conventional bus Serialized-wider bus
0011 2 1
Figure 3: Basic structure of conventional and serialized bus lines.
0011 0 0
0011 0 0
0011 0 0
was smaller, but the throughput of the bus was halved (since
Total number
2 1 the frequency remained the same).
of transitions
In a deep submicron technology, the switching energy
(b) 16-bit data stream → 0011 1100 0011 1100 consumed due to coupling capacitance is dominant [16,
Passing 17, 24–26]. The disadvantage of bus widening is that the
Data bus occupies more area than a conventional bus. Hatta
through - bit Signal Coupling
sequence et al. [2] looked at combining bus serialization with bus
bus →
0000 0000 — — widening in order to reduce bus power without increasing
the bus area. In that study, the bus frequency was increased
0011 1100 4 2
to keep the throughput constant. Although this required
0011 1100 0 0 increasing the width of the wires, the extra spacing between
Total number the wires allowed this to be accommodated without a bus
4 2
of transitions area overhead. Hatta et al. [2] also looked at combining a
Passing serialized-widened bus with differential data encoding and
Data
through 4 Signal Coupling found that it helped on the address bus but not on the data
sequence
bit-bus → bus.
0000 — — In a serialized-widened bus, the operating frequency
0011 2 1 can be increased to keep the throughput the same as in a
1100 2 2 conventional bus. In this case, the serialized frequency is
0011 2 2
given by fS = S ⋅ fC, where S is the serialization factor
and fC is the frequency of the conventional bus. In order to
1100 2 2
implement bus serialization at a higher frequency, a serializer
Total number and deserializer are required at the sending and receiving
8 7
of transitions ends of the bus, respectively (as shown in Figure 2).
Figure 3 shows the structure of data lines of a con-
ventional bus and those of a serialized-widened bus. The
integer multiple of 2. The throughput of a bus serialized relationship of the wire width and spacing between the wires
by a factor of two is halved. To prevent a reduction in the of a conventional bus and a serialized-widened bus is
throughput, the bus frequency can be doubled. This requires
the increasing of the wire widths to support higher switching WS + DS = (WC + DC) S, (5)
speeds. The advantage of serialization is that the bus occupies
less area than a conventional bus. Serialization on its own where WC is the wire width of the conventional bus, DC is
may not necessarily reduce the switching activity and thus wire spacing between the lines in the conventional bus, WS is
the energy consumption of a bus (see Table 1). Loghi et al. the wire width of a serialized-widened bus, DS is wire spacing
[5] examined the use of bus serialization combined with data between the lines in the serialized-widened bus, and S is the
encoding for power minimization. In this case, the bus area serialization factor.
4 VLSI Design

Table 2: Comparison of possible approaches to reduce on-chip data bus power.

Approach Bus freq. Switching activity Line cap. Bus area


1 C (conventional) f 𝛼C CC Original bus area
2 S (serial) 2f 𝛼S CS Reduced
3 W (widened) f 𝛼C CW At least double
4 E (encoded) f 𝛼E CC Unchanged
5 SW 2f 𝛼S CSW Unchanged
6 SE 2f 𝛼SE CS Reduced
7 WE f 𝛼E CW At least double
8 SWE 2f 𝛼SE CSW Unchanged

Equations (7) and (8) can be used to determine the optimum


Metal 3 wire width for the serialized-widened bus at the higher
frequency.
Metal 2

Metal 1 3. Framework and Proposed Technique


The three fundamental approaches discussed earlier in this
section to reduce bus power are serialization (S), encoding
(E), and widening (W) of the bus. Combinations of these
CL
approaches are also possible, and in fact yield better results.
Table 2 lists the possible types of buses based on these three
CC approaches and their combinations (the first of which is a
conventional bus (C) not employing any of the approaches).
Figure 4: Line-to-line and crossover capacitance of a multilevel These approaches reduce the power through changes in the
metal layer.
switching activity and the line capacitance of the bus.
Table 2 lists the relation between the switching activity
and the line capacitance of the different approaches. It
The width WC is different from the width WS to allow a
also lists the change in bus area and frequency due to the
higher frequency. Since the wire widths have to be changed
approaches. Two other important methods to reduce bus
to accommodate the higher operating frequency, the load
power are variations in the swing voltage and operating fre-
capacitance of the bus wires (given in (5)) will change. In
quency. These two techniques can be applied in conjunction
addition, the increase in wire spacing changes the cross-
with all of the methods listed in Table 2. The framework
coupling capacitance. Thus the power consumption of the bus
shown in Table 2 can be used to categorize many of the
is given by
approaches used to minimize the bus switching activity and
2
𝑃S = (𝛼SL 𝐶SL + 𝛼SC 𝐶SC ) 𝑓S 𝑊S 𝑉𝐷𝐷 , (6) wire capacitance. The encoding techniques proposed in [12–
23, 27–47] fall under the category E listed in the table. The
where 𝛼SL is the signal transition switching activity, 𝛼SC is the narrow bus encoding technique presented by Loghi et al. [5]
coupling switching activity, 𝐶SL is the load capacitance, and falls under the category SE, while Hatta’s serialized-widened
𝐶SC is the coupling capacitance of the serialized-widened bus. bus [2] falls under the category SW.
Figure 4 shows the capacitance values in a multilevel metal There are four unique capacitance values and switching
layer. The wire configurations values are taken from ITRS activities listed in Table 2. The relation between these capac-
2004 Update [27] and those values are used in the Chern itance values can generally be described as CW < CSW <
et al. [23] equations to calculate the capacitance values. The CS < CC. If the serialized bus is running at a higher
frequency of the bus is given by the Kawaguchi and Sakurai frequency to preserve the bus throughput, the wires and
[28] equation: their spacing may have to be widened, thus possibly reducing
1 𝐶𝐶 their capacitance CS from the original bus value, CC. In
≈ (1.63 ⋅ + 0.37) ⋅ 𝑅 ⋅ (𝐶𝐶 + 𝐶𝐿 ) . (7) the widened bus, the wires spacing is increased, making this
𝑓 𝐶𝐶 + 𝐶𝐿
type of bus having the lowest wire capacitance. However,
Here, 𝑅 is the resistance of the wire given by its width 𝑊, there is a significant bus area overhead in this approach.
thickness 𝑇, and rate of resistance 𝛼 (dependent on material The serialized-widened bus running at a higher frequency to
property). preserve the throughput will have slightly less wire spacing
Consider than the widened bus since the wires will have to be made
𝛼 wider for the higher frequency. Thus the capacitance of this
𝑅= . (8)
𝑊𝑇 bus, CSW, will be more than that of the widened bus, CW, but
VLSI Design 5

Table 3: Architectural configuration of the simulator used in the Table 4: Benchmarks, types, and number of warm up instructions
experiment. used in the experiment.

System Parameters Benchmarks Type Warm up instructions


Number of processor cores 2, 4, 8 gzip (pro) Int 2000 M
Super scalar width 4, out-of-order gzip (src) Int 1400 M
L1 instruction cache 16/32/64 KB, direct-mapped, 1-cycle Wupwise FP 2000 M
L1 data cache 16/32/64 KB, 4-way, 1-cycle Gcc Int 2000 M
L1 block size 32 B Mesa FP 700 M
Shared L2 cache 1 MB, 4-way, unified,, 12-cycle Art FP 2000 M
L2 block size 64 B Mcf Int 1000 M
RUU/LSQ 16/8 bzip2 (pro) Int 2000 M
Memory ports 2 bzip2 (src) Int 2000 M
TLB 128-entries, 4-way, 30-cycle Twolf Int 900 M
Memory latency 96-cycle mpeg2d MB 0M
Memory bus width 1/2/4/8 B Gsm MB 0M
mpeg2e MB 0M

still less than the serialized bus, CS (since the wires are more
spaced out than a serialized bus). work develops another simulator written in program C to
The relation between the switching activities is highly calculate the switching activity for the bus power estimation.
dependent on the data values passed on the bus. Therefore
a strict relation between the switching activities cannot be
4.2. Benchmark Suites. This experiment uses 6 integers and
shown. However, in general it can be expected that an
3 floating point benchmarks from SPEC2000 suite [49] and
encoded bus will have less switching than a conventional bus
3 benchmarks from MediaBench suite [47]. This selection is
(hence 𝛼E < 𝛼C). In addition the serialized-encoded bus
motivated by finding some memory intensive programs (mcf,
(SE) will also likely have a lower switching activity than a
art, gcc, gzip, and twolf) [3] and some memory nonintensive
conventional bus (hence 𝛼SE < 𝛼C). The relation between the
programs. The simulation wants to use reference inputs of the
switching activity of a serialized bus (𝛼S) and a conventional
SPEC2000 suite because of having smaller data sets of test
bus (𝛼C) is hard to predict.
or training inputs. For each of the benchmark of SPEC2000
This paper proposes data bus power reduction techniques
suite, this work divides the total run length by 5 and warm
for the SWE approach. This work compares these approaches
up for the first 3 portions with a maximum of 2 billion
with existing power reduction methods that fall under the
instructions using fast-forward mode cycle-level simulation.
different categories in Table 2. This work finds that the SWE
A 200 million instruction window is simulated using the
approach works best since this method reduces both the wire
detailed simulator. For MediaBench suite, this work simulates
capacitance and the switching activity significantly.
the whole program to generate the required traces without
any fast forwarding. Table 3 lists the reference inputs that
4. Experimental Setup are chosen from the SPEC2000 benchmark and MediaBench
suite and the number of instructions for which the simulator
This section discusses the target system of the experiment and is warmed up. Among these benchmarks, a group of bench-
the memory structure used to collect the memory traces. The marks are selected to run in multicore processor units qs in
first subsection describes the architecture of sim-outorder, Table 4. This selection gives importance to group the memory
the superscalar simulator from the Simplescalar tool suite intensive programs to get more accurate behavior of memory
[48]. In the subsection followed we discusses the benchmarks access than to group memory nonintensive programs. Table 5
suite and the input sets that are used in this paper. In the summarizes the list of benchmarks used for 8, 4, and 2 cores
last part of this section we present the switching activity processing units.
computation methodology.
4.3. Switching Activity Computation. A power simulator writ-
4.1. Simulator. This experiment uses a modified version of ten in C is integrated with the modified Simplescalar sim-
Simplescalar 3.0d’s sim-outorder simulator [48] to collect outorder simulator [48] to calculate the switching activity of
our cache request traces. The model architecture has mid- the data transitions between L1 and L2 cache through L1-
range configuration. Table 3 summarizes the architectural L2 cache bus. The simulator has several functionalities for
configuration of our simulator. The baseline configuration calculating the switching activity for all six different kinds of
parameters are typical those of a modern chip multiproces- encoding techniques listed in Table 6.
sors and out-of-order simulator. This work keeps the L1 cache During serialization-widening, the simulator uses two
size smaller to get more memory access which results in more sets of value cache (VC) for LSB and MSB data matching
accurate behavior of memory access and memory bus. This instead of using one unified VC. Figure 5 shows the different
6 VLSI Design

Table 5: Combination of benchmarks used for multiprocessing


cores. Serializer LSB External LSB Deserializer
Driver VC control line VC
Number of cores Set Name of benchmarks L1 L2
mcf, art, gcc, twolf, cache cache
1 mpeg2d, gzip (pro), mesa, NC NS NS NS NC
MSB MSB
8 bzip2 (pro) VC VC
gzip (src), mcf, gcc, gsm,
2 wupwise, mpeg2e, art,
bzip2 (src) Figure 5: Structure of 2 sets of value cache combined with
serialization.
1 mcf, art, mpeg2e, gzip (pro)
4 2 twolf, bzip2 (pro), mesa, art
gcc, gzip (src), bzip2 (src), 75
3
gsm 60
1 mcf, art

Power savings (%)


2 45
2 gcc, twolf
30

Table 6: Listing of different encoding techniques implemented in 15


this experiment.
0
Name Abbreviation Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2

Average
−15
8 core 4 core 2 core
Bus-invert coding bi
−30
Transition signaling xor
Frequent value encoding with one hot-code Fv S SE
Frequent value encoding with two hot-code fv2 W WE
TUBE with one hot-code Tube E SWE
SW
TUBE with two hot-code tube2
Figure 6: Comparison of the % of power savings using the different
data bus power reduction approaches. Results are compared to a
structures of two sets of VC with serialization. The data conventional 64-bit L1-L2 cache data bus at 45 nm technology.
bus size is varied frequently to compare the effectiveness
of different possible approaches and encoding techniques
keeping the total amount of data the same. For example, if
minimizing bus switching activity, bus wire capacitance, or
a data stream of 64-bit wide requires 1 transition using 64-bit
both.
wide data bus, it requires 8 transitions using 8-bit wide data
When the three approaches for power reduction are
bus.
applied on their own, bus widening performs the best. The
serialization (S) approach performs poorly for most of the
5. Results and Analysis architecture configurations listed in Figure 6 (the bus power
is generally increased). This is primarily due to the fact that
This section presents the experimental results. It has a serialization generally increases switching activity. The bus
general comparison of the cache bus power minimization capacitance is actually reduced partially since the wires are
using the seven possible approaches listed in Table 2. It spaced out further to allow the frequency to be doubled.
further examines in detail three of the approaches that do However, this reduction in capacitance is not enough to offset
not change the bus area and finds that the SWE approach the increased switching activity. The widening (W) approach
performs the best. It also presents an in depth analysis of performs very well since it reduces the bus wire capacitance
the SWE approach performance under various architecture significantly. The disadvantage of the approach is that it
and technology configurations. At the end of this section we almost doubles the bus area. There are six different encoding
discuss the performance, power, and area overhead for the techniques (E) that are tested (see Table 6). Figure 6 shows the
proposed technique. result from the best encoding technique for each architecture
configuration. Encoding reduces switching activity without
5.1. Power Savings for Different Possible Approaches. The affecting the bus capacitance and so does minimizing the
seven possible bus power savings approaches listed in Table 2 bus power. This approach does not change the bus area or
earlier are different combinations of serialization (S), bus frequency.
widening (W), and encoding (E). Figure 6 shows the power When using combinations of the three approaches, the
savings on the L1-L2 cache data bus for the different serialized-widened-encoded (SWE) method performs the
architecture-benchmark combinations listed in Table 5 using best. The serialized-widened (SW) approach reduces the bus
these approaches. A 64-bit data bus implemented on 45 nm capacitance by widening the wire spacing, but generally
technology is assumed. The techniques reduce bus power by increases the switching activity through serialization. The net
VLSI Design 7

1.2 60
50
1
Absolute power normalized

40

Power savings (%)


0.8
30
0.6
to 64 C

20
0.4
10
0.2
0
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2

Average
0 −10
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2

Average
8 core 4 core 2 core
8 core 4 core 2 core −20

64 SW 2 64 SW 8 64 SW 2 64 SW 8
64 SW 4 32 SW 4 64 SW 4 32 SW 4
32 SW 2 16 SW 2 32 SW 2 16 SW 2
(a) (b)

Figure 7: (a) % of power savings achieved and (b) absolute power normalized to 64-bit conventional bus power using bus serialization for
64-, 32-, 16-bit wide bus for different serialization factors. The figure legend indicates the first number as bus width, S as serialization, and the
last number as the serialization factor.

60 40
35
50
Power savings (%)
30
Capacitance reduction (%)

25
40 20
15
30
10
5
20
0
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2

Average
10
8 core 4 core 2 core

0
2 4 8 bi fv2
xor tube
Figure 8: % of capacitance reduction using serialized-widened data fv tube2
bus for different serialization factors in 45 nm technology.
Figure 9: % of power savings for using different encoding tech-
niques for 64-bit wide data bus for different number processing cores
result of these two opposing effects is generally a decrease with several benchmark combinations.
in the power consumption (although there are cases where
power is actually increased). This is the approach proposed
by Hatta et al. [2] for both the address and data buses. by Hatta et al. [2]) for different bus widths and serialization
The serialized-encoded (SE) method reduces the bus power factors. The results show that the SW approach performs
mainly through a reduction in switching activity. There is well for narrow buses. Figure 7(b) shows the absolute power
also a slight reduction in capacitance due to the serialization. consumption of the SW approach with different architectural
The widened-encoded (WE) approach reduces the power by configurations normalized with a 64-bit wide conventional
minimizing both the switching activity and bus capacitance. data bus. The average power consumption of a specific bus
It however has the disadvantage in increasing the bus area. width does not vary to each other irrespective of serialization
Finally the serialized-widened-encoded (SWE) approach factors.
produces the best results for the architectures in Figure 6 by Figure 8 shows the percentage of capacitance reduction
minimizing the bus capacitance and switching activity while using the serialization-widening data bus approach for differ-
keeping the bus area constant. ent serialization factors. The figure shows that a serialization
The rest of this chapter considers primarily the SW, E, factor of 4 or 8 does not provide a significant reduction of
and SWE approaches as these do not change the bus area. capacitance over a serialization factor of 2.
Unless explicitly stated, a 45 nm technology implementation
is assumed. 5.3. Encoding (E). Figure 9 compares the power savings from
the different encoding schemes presented in Table 6 for a 64-
5.2. Serialization-Widening (SW). Figure 7(a) shows the bit L1-L2 cache data bus. Table 7 shows the power savings of
power savings of using a serialized-widened bus (as proposed the encoding techniques for various cache bus widths. For
8 VLSI Design

Table 7: % of power savings for different bus widths and encoding Table 9: % of power savings for different bus widths and encoding
techniques. techniques.

64-bit 32-bit 16-bit 8-bit 64-bit 32-bit 16-bit


bi 3.491879 13.10983 12.55175 10.22098 bi 25.23832 35.94936 41.25804
xor 2.292305 7.015497 4.343778 3.213289 xor 16.42357 17.68467 22.45874
fv 16.15159 38.09723 25.42917 5.665817 fv 55.94778 59.1091 27.96345
fv2 22.04569 37.30793 17.56435 2.736691 fv2 49.89886 51.8574 10.45392
tube 10.20364 29.08284 9.316818 1.79227 tube 44.20373 44.15911 3.55008
tube2 22.58817 43.95828 20.94449 2.792153 tube2 53.97197 54.80531 10.87319

Table 8: Hit rate and number of hit in one or two transition cache 80
locations using FV and FV2 techniques for 8-core dataset 1. 70

Power savings (%)


FV FV2 60
50
Hit rate (%) 68.08 79.87
40
Number of 1-transition hit 11586571 2047667
30
Number of 2-transition hit 0 11545453
20
10
0
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2
the 64-bit and 32-bit wide buses, the frequent value or TUBE

Average
approaches with two hot-codes (FV2 and TUBE2) perform 8 core 4 core 2 core
the best. This is mainly because the wide bus allows for a
large number of entries in these encoding caches. With a 16- bi SW fv2 SW
xor SW tube SW
bit data bus width, a frequent value cache using one hot-code tube2 SW
fv SW
performs better. This is because the larger cache size of FV2
than FV increases the hit rate, but large number of them hit in Figure 10: % of power savings using SWE approach for different
the location that requires a switching activity of two instead encoding techniques for 64-bit wide data bus.
of one. Table 8 lists the hit rate and the number of one or two
transition cache location hit of FV2 and the number of one
transition cache location hit of FV for simulating 8-core set 1 Table 10: Variation of value cache table size with encoding tech-
application. It is obvious from the data of the table that FV2 niques.
performs poorly as large data matching hits in two transition
Encoding Table size Max. possible
cache locations. An improvement of this situation is to map
technique Data size Number of entry switching activity
the most frequent data value in the cache location of smaller
number of transitions. This type of encoding technique is FV 32-bit 32 1
proposed by Suresh et al. [7]. It can be easily implemented in FV2 32-bit 528 2
advance as their proposed context independent codes works 32-bit 5
for known dataset of embedded processing systems. But, it
24-bit 5
requires very complex hardware design to implement for a
TUBE 16-bit 5 1 or 2
real-time data arrangement. For the 8-bit cache bus width,
none of the cache-based approaches work well as their hit 8-bit 8
rates are low (since values get replaced too often). In this case 16-bit 16
bus-invert has the best performance. 32-bit 30
24-bit 30
5.4. Serialization-Widening with Encoding (SWE). Figure 10 TUBE2 16-bit 30 2 or 4
compares the power savings from the different encoding 8-bit 68
schemes presented in Table 6 using the serialized-widened- 16-bit 68
encoded (SWE) scheme for a 64-bit L1-L2 cache data bus.
Table 9 shows the power savings of the encoding techniques
for various cache bus widths and a serialization factor of 2. For
after serialization, and the cache hit rates become too low for
the 64-bit and 32-bit wide buses, the frequent value approach
this configuration.
(FV) performs the best. This is mainly because the wide bus
allows for a large number of entries with a higher number
of switching activity (as given example in Table 8) in these 5.5. Power Savings under Different Architecture Options.
encoding caches. With a 16-bit data bus width, a bus invert Figure 11 presents the percentage of power savings for the
performs better. This is because we end up with an 8-bit bus SWE approach using frequent value encoding (FVE) and the
VLSI Design 9

60 70
50 60
Power savings (%)

Power savings (%)


50
40
40
30
30
20 20
10 10
0 0
S2 S4 S8 S2 S4 S2 S2 S4 S8 S2 S4 S2
64 bit 32 bit 16 bit 64 bit 32 bit 16 bit
(a) 8-core set 1 (b) 4-core set 1
80
70
Power savings (%)

60
50
40
30
20
10
0
S2 S4 S8 S2 S4 S2
64 bit 32 bit 16 bit

FV SW
best E SW
(c) 2-core set 1

Figure 11: Comparison of % of power savings between different serialization factors with different cache bus width.

3 bus width, types of applications, number of processing cores,


L1 cache size, and type of technology used. For a specified bus
2.5
width, a serialization factor of 2 with encoding gives more
2 power savings than any other combinations. Although higher
serialization factor can contribute in more capacitance reduc-
1.5 tion, it reduces the number of bus lines. This reduction of the
1 number of bus lines decreases the chance of data matching for
cache-based encoding. To choose a cache bus width for L1-L2
0.5 cache bus design, Figure 11 gives a comparative view of power
0
savings for different cache bus width using the proposed
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2 technique with other best encoding technique. The proposed
Average

8 core 4 core 2 core technique works well for wide data bus, but poorly performs
for narrow bus.
64 C 64 SW Cache bus power consumption can be varied with bus
32 C 32 SW width, application sets, and different approaches (E, SW or
16 C 16 SW SWE). Figure 12 is a comparative view of cache bus power
64 E 64 FV SW for a 32 KB L1 cache with 64-/32-/16-bit bus size. The graph
32 E 32 FV SW shows that a 32-bit wide bus consumes more power than
16 E 16 bi SW a 64-bit wide bus for most of the application sets used in
Figure 12: Absolute power consumption for 64-bit, 32-bit, and this experiment. For a 16-bit wide bus, it consumes almost
16-bit bus, with encoding (E), serialization-widening (SW), and similar or sometimes more power than a 32-bit wide data bus.
serialization-widening with encoding (SWE) normalized to 64-bit Encoding (E) approach consumes almost the same amount of
bus width for 32 KB L1 cache. power for 64-/32-bit wide data buses. This indicates that the
power consumption of the E approach is independent of the
bus size. A 16-bit data bus requires a bit higher power than
either a 64-bit or 32-bit wide data bus using E approach. SW
best encoding for different bus widths and serialization fac- approach gives us a similar result for the 64-bit and 32-bit
tors. The amount of power savings achieved by this approach data buses. But, a 16-bit data bus requires quite less power
depends on several factors. These factors include cache data than a 64-bit or 32-bit using the SW approach. Using the SWE
10 VLSI Design

80 35

Over power savings (%)


70 30
Power savings (%)

60 25
50
20
40
15
30
10
20
10 5
0 0
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2

Average
Average
8 core 4 core 2 core 8 core 4 core 2 core

64 KB 64 KB
32 KB 32 KB
16 KB 16 KB
(a) (b)

Figure 13: Comparison of (a) % of absolute power savings using different L1 cache sizes for a 64-bit wide data bus using serialization-widening
with frequent value encoding and (b) % of relative power savings using a 64-bit wide bus compared to a 32-bit wide bus (both of the bus used
serialization-widening with frequent value encoding).

48 Reliability is also another concern which points to the


Capacitance reduction (%)

46 need for low-power design. There is a close correlation


44 between the power dissipation of circuits and reliability
42
problems such as electromigration and hot-carrier. Also,
the thermal caused by heat dissipation on chip is a major
40
reliability concern. Consequently, the reduction of power
38 consumption is also crucial for reliability enhancement. As a
36 future work, we will be working in another paper to evaluate
34 constraint on reliability and power.
70 45 35 25 20
(nm)
5.6. Different L1 Cache Size. Figure 13 gives a comparison of
Figure 14: % of capacitance reduction for using serialized-widened the absolute power savings of using a 64-bit wide data bus
bus with respect to conventional bus for a serialization factor of 2 for with SWE (FVE) approach having 64 KB, 32 KB, and 16 KB
different technologies. L1 cache size. According to the results, the cache size does not
affect in power savings of the proposed technique. Although
the cache size can change the order of data transitions
through the cache bus, the proposed technique works well
approach, a 64-bit wide data bus consumes approximately irrespective of the changing of data transitions transmitted
22% less power than a 32-bit wide data bus for the same through the data bus. Thus, this proposed approach keeps
application sets. The best encoding that supports the SWE consistent result with the variation of L1 data cache size.
approach is frequent value encoding (FVE). FVE works much This figure also compares the percentage of relative power
better with SWE approach than other cache-based techniques savings of using a 64-bit wide bus compared to a 32-bit bus
because of the reduced number of bus lines. The value cache for the same L1 cache size. Different bus size may change
size of the cache-based encoding depends on the number of the ordering of the same data set and can significantly affect
bus lines. The reduced number of bus lines reduces the value the number of switching activity. So, changing the cache size
cache entry which hurts in a data matching chance for TUBE. alters the data requests from the lower level cache and passing
For FV2 and TUBE2, it increases the table size to a large the data requests using different bus width may revise the
number, but the overhead increases yielding a large number number of switching activity. This effect can visualize from
of switching activity. Table 10 gives a comparison of value the Figure 13(b) but still it favors a 64-bit wide data bus from
cache size among different cache-based encoding techniques a power saving standpoint compared to a 32-bit wide data bus.
for a 32-bit data bus. The comparison of the same study for
the 32-bit and 16-bit data bus gives us a good indication that 5.7. Different Technologies. This work extends the experiment
the SWE approach (FVE as the best encoding) for a 32-bit for different technologies not keeping limited to different
wide bus consumes approximately 17% less power than that cache bus width and L1 cache size. As industry is already
of a 16-bit wide data bus. The results also notice that both SW started to manufacture for less than 65 nm process technol-
approach and SWE approach more or less performs the same ogy, the experiment considers small gate size as 70, 45, 35, 25
for a narrow bus (a 16-bit wide data bus). and 20 nm technology. The experiment finds the capacitance
VLSI Design 11

Absolute power normalized


1.2

to 70 nm technology
60 1
50
Power savings (%)

0.8
40
0.6
30
20 0.4
10 0.2
0 0
70 45 35 25 20 70 45 35 25 20 70 45 35 25 20
FVE SW FVE SW 70 45 35 25 20
(nm) (nm)

C
FVE SW
(a) (b)

Figure 15: (a) % of power savings using different technologies for a 64-bit data bus experimenting on application set 1 in 8 processing cores,
(b) absolute power consumption of the same set (8 core set 1) for different technologies (power consumption values are normalized to 70 nm
technology).

14 VC instead a unified from the VC entry. Figure 16 presents a


Power savings (%)

12 comparison of the power consumption VC. Two sets of VC


10
8
hold the least significant bits (LSB) and most significant bits
6 (MSB) part of the data value for implementing serialization-
4 encoding approach as serialization breaks the whole data
2 sequence. Utilization of two sets of VC increases the chance of
0 hits in the VC. This also keeps consistent of the two separate
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2
8 core 4 core 2 core Average VC as LSB part changes more frequent than MSB part. This
removes the necessity of a frequent replacement using these
fv tube two types of VC implementation. The figure shows that using
fv2 tube2 two separate VC structure gives approximately 5% of more
power savings for FVE-based technique and 8% for TUBE
Figure 16: % of over power savings for using split cache instead of
than using one unified VC.
unified cache in cache-based encoding techniques.

reduction for different technology as shown in Figure 14. 5.9. Widened-Encoded Data Bus of 32 Bit Wide. A widened-
Figure 15(a) presents a comparison of the power savings encoded (WE) 32-bit wide data bus requires the same area
using encoding, serialization-widening, and serialization- as the SWE approach of a 64-bit wide. The results of
widening with FVE. The results shown in Figure 15 uses a 64- Figure 6 show that WE approach works very close to SWE
bit wide data bus for application set 1 in 8 processing cores. approach in power savings. But, the 64-bit wide WE data
The amount of power savings is in similar fashion for different bus requires double area. This motivates us to compare the
technologies, but the absolute power consumption reduces power consumption of the WE approach having a 32-bit data
with shrinking the technology as shown in Figure 15(b). bus compared to the SWE approach having a 64-bit wide
This is because the swing voltage reduces with shrinking data bus. Figure 17(a) gives the absolute power consumption
the technology [27]. Although shrinking the technology normalized with the 64-bit wide conventional data bus. The
increases the capacitance (capacitance, 𝐶 = 𝜀 ∗ 𝐴/𝑑), results show that these two approaches consume almost the
serialization-widening gives us the advantage of using extra same amount of power. The benefit of using WE approach is
space between the wires which reduces the overall capaci- that it does not require higher operating frequency. But, it has
tance compared to the conventional bus and finally reduces to pay performance loss in terms of IPC for using narrow data
the total power budget. Using this advantage, the proposed bus. The experimental results having the performance loss are
approach improves the power savings significantly. shown in Figure 17(b).

5.8. Split Value Cache versus Unified Value Cache. Frequent


value encoding (FVE) uses a unified value cache (VC) to 5.10. Performance Overhead. Performance overhead is a
implement the VC structure. The size of the VC depends on considerable issue in designing a serializer with frequent
the type of pattern matching algorithm (full or partial) and value cache (FVC) unit. Figure 18 presents the architectural
type of hot-code (one or two) used in the implementation. configuration of a serializer-deserializer with the FVC unit
In the proposed technique, the simulation uses two sets of between the L1-L2 cache block.
12 VLSI Design

0.9 5
Absolute power normalized

0.8 4.5
4

Performance loss in
0.7

terms of IPC (%)


0.6 3.5
3
to 64 C

0.5 2.5
0.4 2
0.3 1.5
0.2 1
0.1 0.5
0
0 Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2

Average
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2

Average
8 core 4 core 2 core 8 core 4 core 2 core

64 SWE
32 WE
(a) (b)

Figure 17: (a) Comparison of absolute power consumption of SWE approach (64 bit wide data bus) and WE approach (32-bit wide data bus)
and (b) % of performance loss of using 32-bit wide data bus instead of 64-bit wide data bus.

Serializer Control 6. Conclusion


Driver line
L1
cache FVC In system power optimization, the on-chip memory buses are
NC NS NS Deserializer
good candidates for minimizing the overall power budget.
This paper explored a framework for memory data bus power
.. L2
. FVC cache minimization techniques from an architectural standpoint.
NS NC
A thorough comparison of power minimization techniques
Serializer Driver Control used for an on-chip memory data bus was presented. For
line
L1 FVC on-chip data bus, a serialization-widening approach with
cache
NC NS NS frequent value encoding (SWE) was proposed as the best
power savings approach from all the approaches considered.
Figure 18: Architectural configuration of serializer-deserializer with In summary, the findings of this study for the on-chip data
FVC unit. bus power minimization include the following.
(i) The SWE approach is the best power savings approach
with frequent value encoding (FVE) providing the
Hatta et al. [2] presented a novel work about bus best results among all other cache-based encodings
serialization-widening and showed that the serialized- for the same process node.
widened bus operates in faster frequency than the (ii) SWE approach (FVE as encoding) achieves approxi-
conventional bus. Liu et al. [32] talked about pipelined mately 54% overall power savings and 57% and 77%
bus arbitration with encoding to minimize the performance more power savings than individual serialization or
penalty which might be less than 1 cycle. Although most of the best encoding technique for the 64-bit wide data
the works supports minimized performance penalty using bus. This approach also provides approximately 22%
serialization-widening with frequent value technique, it takes more power savings for a 64-bit wide bus than that for
2 cycles penalty in worst case. This work runs the simulation a 32-bit wide data bus using 32 KB L1 cache and 45 nm
using 2 cycles and 1 cycle performance penalty for L2 data technology.
cache access using different application sets. The results (iii) For a 32-bit wide data bus, the SWE approach (FVE
present the performance loss in Figure 19. This work further as encoding) gives approximately 59% overall power
includes a comparative view of absolute power savings using savings and 17% more power savings than a 16-bit
a 64-bit wide bus with serialization-widening and frequent wide bus for the same L1 cache size and technology.
value encoding for 32 KB L1 cache size at 45 nm technology.
(iv) For different cache sizes (64 KB L1 cache size and
According to the dataset of Figure 19, about 2.5% average
45 nm technology), a 64-bit wide data bus gives
performance loss for worst case (2 cycle penalty) if the
approximately 59% overall power savings and 29%
approach cannot achieve the advantages of faster serialized
more power savings than for a 32-bit wide data bus
bus and pipelining in data transmission. This comes down to
using the SWE approach with FV encoding.
average 1.35% performance degradation for 1 cycle penalty.
Further investigation looks into the area required for addi- In conclusion, the novel approaches for on-chip memory data
tional circuitry. Different citations find that a minimum of bus minimization were presented. The simulation studies for
approximately 0.05 mm2 area is required to implement the the same process node indicate that the proposed techniques
value cache with serializer. The additional peripheral also outperform the approaches found in the literature in terms
consumes extra 2% power required by the wire [2, 7, 35]. of power savings for the various applications considered. The
VLSI Design 13

Power savings (%)


Performance loss (%) 70
4
3.5 60
3 50
2.5 40
2
1.5 30
1 20
0.5 10
0 0
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2

Average
Average
8 core 4 core 2 core 8 core 4 core 2 core

2 cycles
1 cycle
(a) (b)

Figure 19: (a) % of performance degradation in term of instruction per cycle (IPC) for using 64-bit serialized bus with encoding for 2/1 cycle
performance penalty instead of using conventional bus and (b) % of power savings.

work in this paper primarily involved the software simulation in Proceedings of the 35th Annual ACM/IEEE International
of the proposed techniques for bus power minimization Symposium on Microarchitecture, pp. 345–355, 2002.
considering performance overhead. As far as future work, we [10] N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. V.
will continue to evaluate the proposed approach with lower Venkatrao, “A limit study on the potential of compression for
process node (14 and 10 nm) for reliability especially with new improving memory system performance, power consumption,
process. and cost,” Journal of Instruction-Level Parallelism, vol. 7, pp. 1–37,
2005.
[11] A. Park and M. Farrens, “Address compression through base
Conflict of Interests register caching,” in Proceedings of the Annual ACM/IEEE
International Symposium on Microarchitecture, pp. 193–199,
The authors declare that there is no conflict of interests November 1990.
regarding the publication of this paper. [12] M. Farrens and A. Park, “Dynamic base register caching: a
technique for reducing address bus width,” in Proceedings of
the 18th International Symposium on Computer Architecture, pp.
References 128–137, May 1991.
[1] M. R. Stan and K. Skadron, “Power-aware computing,” IEEE [13] D. Citron and L. Rudolph, “Creating a wider bus using caching
Computer, vol. 36, no. 12, pp. 35–38, 2003. techniques,” in Proceedings of the International Symposium on
High Performance Computer Architecture, pp. 90–99, January
[2] N. Hatta, N. D. Barli, C. Iwama et al., “Bus serialization for 1995.
reducing power consumption,” Proceedings of SWoPP, 2004.
[14] K. Sunderasan and N. Mahapatra, “Code compression tech-
[3] B. Jacob and V. Cuppu, “Organizational design trade-offs at niques for embedded systems and their effectiveness,” in Pro-
the DRAM, memory bus and memory controller level: initial ceedings of the IEEE Computer Society Annual Symposium on
results,” Tech. Rep. UMD-SCA-TR-1999-2, University of Mary- VLSI, pp. 262–263, February 2003.
land Systems & Computer Architecture Group, 1999. [15] L. Li, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and I.
[4] Rambus Inc, Rambus Signaling Technologies: RSL, QRSL and Kadayif, “CCC: crossbar connected caches for reducing energy
SerDes Technology Overview, Rambus Inc, 2000. consumption of on-chip multiprocessors,” in Proceedings of the
[5] M. Loghi, M. Poncino, and L. Benini, “Cycle-accurate power Euromicro Symposium on Digital Systems Design (DSD ’03),
analysis for multiprocessor systems-on-a-chip,” in Proceedings 2003.
of the ACM Great lakes Symposium on VLSI, pp. 401–406, April [16] P. P. Sotiriadis and A. P. Chandrakasan, “A bus energy model for
2004. deep submicron technology,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 10, no. 3, pp. 341–349, 2002.
[6] K. Mohanram and S. Rixner, “Context-independent codes for
off-chip interconnects,” in Power-Aware Computer Systems, vol. [17] P. P. Sotiriadis and A. Chandrakasan, “Low power bus coding
3471 of Lecture Notes in Computer Science, pp. 107–119, 2005. techniques considering inter-wire capacitances,” in Proceedings
of the IEEE 22nd Annual Custom Integrated Circuits Conference
[7] D. C. Suresh, B. Agrawal, W. A. Najjar, and J. Yang, “VALVE: (CICC ’00), pp. 507–510, May 2000.
variable Length Value Encoder for off-chip data buses,” in
[18] J. Henkel and H. Lekatsas, “A2BC: adaptive address bus coding
Proceedings of the International Conference on Computer Design
for low power deep sub-micron designs,” in Proceedings of the
(ICCD ’05), pp. 631–633, San Jose, Calif, USA, October 2005.
IEEE 38th Design Automation Conference, pp. 744–749, June
[8] M. R. Stan and W. P. Burleson, “Coding a terminated bus for 2001.
low power,” in Proceedings of the 5th Great Lakes Symposium on [19] T. Lindkvist, “Additional knowledge of bus invert coding
VLSI, pp. 70–73, March 1995. schemes,” in Proceedings of the IEEE 5th International Workshop
[9] K. Basu, A. Choudhury, J. Pisharath, and M. Kandemir, “Power on System-on-Chip for Real-Time Applications (IWSOC ’05), pp.
protocol: reducing power dissipation on off-chip data buses,” 301–303, Alberta, Canada, July 2005.
14 VLSI Design

[20] T. Lindkvist, J. Löfvenberg, and O. Gustafsson, “Deep sub- [37] T. Lang, E. Musoll, and J. Cortadella, “Extension of the working-
micron bus invert coding,” in Proceedings of the 6th Nordic zone-encoding method to reduce the energy on the micro-
Signal Processing Symposium (NORSIG ’04), pp. 133–136, Espoo, processor data bus,” in Proceedings of the IEEE International
Finland, June 2004. Conference on Computer Design, pp. 414–419, October 1998.
[21] K.-W. Kim, K.-H. Baek, N. Shanbhag, C. L. Liu, and S.-M. [38] L. Benini, G. de Micheli, E. Macii, M. Poncino, and S. Quer,
Kang, “Coupling-driven signal encoding scheme for low-power “System-level power optimization of special purpose applica-
interface design,” in Proceedings of the IEEE/ACM International tions: the beach solution,” in Proceedings of the International
Conference on Computer-Aided Design, pp. 318–321, San Jose, Symposium on Low Power Electronics and Design, pp. 24–29,
Calif, USA, 2000. Monterey, Calif, USA, August 1997.
[22] S. Komatsu, M. Ikeda, and K. Asada, “Bus power encoding with [39] L. Benini, G. DeMicheli, E. Macii, M. Poncino, and C. Silvano,
coupling-driven adaptive code-book method for low power “Address bus encoding techniques for system level power
data transmission,” in Proceedings of the European Solid-State optimization,” in Proceeding of the Design Automation and Test
Circuits Conference, 2001. in Europe, pp. 861–866, Paris, France, February 1998.
[23] J.-H. Chern, J. Huang, L. Arledge, P.-C. Li, and P. Yang, [40] N. Chang, K. Kim, and J. Cho, “Bus encoding for low-power
“Multilevel metal capacitance models for CAD design synthesis high-performance memory systems,” in Proceedings of the 37th
systems,” Electron Device Letters, vol. 13, no. 1, pp. 32–34, 1992. Design Automation Conference (DAC ’00), pp. 800–805, June
[24] K. Mohammad, A. Dodin, B. Liu, and S. Agaian, “Reduced 2000.
voltage scaling in clock distribution networks,” VLSI Design, vol. [41] W.-C. Cheng and M. Pedram, “Power-optimal encoding for
2009, Article ID 679853, 7 pages, 2009. DRAM address bus,” in Proceedings of the Symposium on Low
[25] K. Mohammad, B. Liu, and S. Agaian, “Energy efficient swing Power Electronics and Design (ISLPED ’00), pp. 250–252, July
signal generation circuits for clock distribution networks sys- 2000.
tems,” in Proceedings of the IEEE International Conference on [42] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, “A coding
Man and Cybernetics, pp. 3495–3498, 2009. framework for low-power address and data busses,” IEEE
[26] K. Mohammad, S. Agaian, and F. Hudson, “Efficient FPGA Transactions on Very Large Scale Integration (VLSI) Systems, vol.
implementation of convolution,” in Proceedings of the IEEE 7, no. 2, pp. 212–221, 1999.
International Conference on Systems, Man and Cybernetics, San [43] E. Musoll, T. Lang, and J. Cortadella, “Exploiting the locality
Antonio, Tex, USA, October 2009, paper ID 3922. of memory references to reduce the address bus energy,” in
[27] “International Technology Roadmap for Semiconductors,” Proceedings of the International Symposium on Low Power
http://www.itrs.net. Electronics and Design, pp. 202–207, Monterey, Calif, USA,
[28] H. Kawaguchi and T. Sakurai, “Delay and noise formulas for August 1997.
capacitively coupled distributed RC lines,” in Proceedings of the [44] Y. Shin, S.-I. Chae, and K. Choi, “Partial bus-invert coding for
3rd Conference of the Asia and South Pacific Design Automation power optimization of system level bus,” in Proceedings of the
(ASP-DAC ’98), pp. 35–43, February 1998. International Symposium on Low Power Electronics and Design,
[29] C.-L. Su, C.-Y. Tsui, and A. M. Despain, “Saving power in the pp. 127–129, August 1998.
control path of embedded processors,” IEEE Design and Test of [45] M. R. Stan and W. P. Burleson, “Two-dimensional codes for low
Computers, vol. 11, no. 4, pp. 24–30, 1994. power,” in Proceedings of the International Symposium on Low
[30] M. R. Stan and W. P. Burleson, “Bus-invert coding for low- Power Electronics and Design, pp. 335–340, August 1996.
power I/O,” IEEE Transactions on VLSI Systems, vol. 3, no. 1, pp.
[46] S. Yoo and K. Choi, “Interleaving partial bus-invert coding for
49–58, 1995.
low power reconfiguration of FPGAs,” in Proceedings of the 6th
[31] L. Benini, G. de Micheli, E. Macii, D. Sciuto, and C. Sil- International Conference on VLSI and CAD, pp. 549–552, 1999.
vano, “Asymptotic zero-transition activity encoding for address
[47] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Media-
busses in low-power microprocessor-based systems,” in Pro-
Bench: a tool for evaluating and synthesizing multimedia and
ceedings of the 7th Great Lakes Symposium on VLSI, pp. 77–82,
communications systems,” in Proceedings of the 30th Annual
March 1997.
IEEE/ACM International Symposium on Microarchitecture, pp.
[32] C. Liu, A. Sivasubramaniam, and M. Kandemir, “Optimizing 330–335, December 1997.
bus energy consumption of on-chip multiprocessors using
[48] SimpleScalar Simulator, “SimpleScalar LLC,” http://www.sim-
frequent values,” in Proceedings of the 12th Euromicro Conference
plescalar.com/.
on Parallel, Distributed and Network-based Proceedings (PDP
’04), pp. 340–347, February 2004. [49] SPEC, “SPEC CPU2000 Benchmark Suite Ver 1.2,” http://www
[33] J. Yang and R. Gupta, “Frequent value locality and its applica- .spec.org/osg/cpu2000/.
tions,” ACM Transactions on Embedded Computing Systems, vol.
1, no. 1, pp. 79–105, 2002.
[34] J. Yang, R. Gupta, and C. Zhang, “Frequent value encoding for
low power data buses,” ACM Transactions on Design Automa-
tion of Electronic Systems, vol. 9, no. 3, pp. 354–384, 2004.
[35] D. C. Suresh, B. Agrawal, J. Yang, and W. Najjar, “A tunable
bus encoder for off-chip data buses,” in Proceedings of the
International Symposium on Low Power Electronics and Design,
pp. 319–322, San Diego, Calif, USA, August 2005.
[36] W.-C. Cheng and M. Pedram, “Memory bus encoding for low
power: a tutorial,” in Proceedings of the International Symposium
on Quality Electronic Design (ISQED ’01), p. 1999, 2001.

You might also like