Advanced VLSI Architecture Design For Emerging Digital Systems
Advanced VLSI Architecture Design For Emerging Digital Systems
This is a special issue published in “VLSI Design.” All articles are open access articles distributed under the Creative Commons Attribu-
tion License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Editorial Board
Roc Berenguer, Spain Marcelo Lubaszewski, Brazil M. Renovell, France
Chien-In Henry Chen, USA Mohamed Masmoudi, Tunisia Peter Schwarz, Germany
Kiyoung Choi, Korea Antonio Mondragon-Torres, USA Jose Silva-Martinez, USA
Anh Tuan Do, Singapore Jose Carlos Monteiro, Portugal Antonio G. M. Strollo, Italy
Ethan Farquhar, USA Fateme Moradi, Iran Junqing Sun, USA
Dimitri Galayko, France Farshad Moradi, Denmark Rached Tourki, Tunisia
David Hernandez-Garduno, USA Maurizio Palesi, Italy Spyros Tragoudas, USA
Lazhar Khriji, Oman Rubin A. Parekhji, India Sungjoo Yoo, Korea
Israel Koren, USA Zebo Peng, Sweden Avi Ziv, Israel
David S. Kung, USA Gregory Peterson, USA
Chang-Ho Lee, USA A. Postula, Australia
Contents
Advanced VLSI Architecture Design for Emerging Digital Systems, Yu-Cheng Fan, Qiaoyan Yu,
Thomas Schumann, Ying-Ren Chien, and Chih-Cheng Lu
Volume 2014, Article ID 746132, 2 pages
Engineering Change Orders Design Using Multiple Variables Linear Programming for VLSI Design,
Yu-Cheng Fan, Chih-Kang Lin, Shih-Ying Chou, Chun-Hung Wang, Shu-Hsien Wu, and Hung-Kuan Liu
Volume 2014, Article ID 698041, 5 pages
Design of Smart Power-Saving Architecture for Network on Chip, Trong-Yen Lee and Chi-Han Huang
Volume 2014, Article ID 531653, 10 pages
Optimization of Fractional-N-PLL Frequency Synthesizer for Power Effective Design, Sahar Arshad,
Muhammad Ismail, Usman Ahmad, Anees ul Husnain, and Qaiser Ijaz
Volume 2014, Article ID 406416, 7 pages
Performance Analysis of Modified Drain Gating Techniques for Low Power and High Speed Arithmetic
Circuits, Shikha Panwar, Mayuresh Piske, and Aatreya Vivek Madgula
Volume 2014, Article ID 380362, 5 pages
Gate-Level Circuit Reliability Analysis: A Survey, Ran Xiao and Chunhong Chen
Volume 2014, Article ID 529392, 12 pages
Efficient Hardware Trojan Detection with Differential Cascade Voltage Switch Logic, Wafi Danesh,
Jaya Dofe, and Qiaoyan Yu
Volume 2014, Article ID 652187, 11 pages
Editorial
Advanced VLSI Architecture Design for Emerging
Digital Systems
Yu-Cheng Fan,1 Qiaoyan Yu,2 Thomas Schumann,3 Ying-Ren Chien,4 and Chih-Cheng Lu5
1
Department of Electronic Engineering, National Taipei University of Technology, Taipei 10608, Taiwan
2
Department of Electrical and Computer Engineering, University of New Hampshire, Durham, NH 03824, USA
3
Department of Electrical Engineering and Information Technology, Hochschule Darmstadt-University of Applied Sciences,
Birkenweg 8, 64295 Darmstadt, Germany
4
Department of Electrical Engineering, National Ilan University, Yilan 260, Taiwan
5
Division for Biomedical & Industrial IC Technology, Industrial Technology Research Institute, Hsinchu 310, Taiwan
Copyright © 2014 Yu-Cheng Fan et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
With physical feature sizes in VLSI designs decreasing rap- memory data bus. In particular, serialization-widening (SW)
idly, existing efficient architecture designs need be reex- of data bus with frequent value encoding (FVE) is proposed to
amined. Advanced VLSI architecture designs are required minimize the power consumption of the on-chip cache data
to further reduce power consumption, compress chip area, bus.
and speed up operating frequency for high performance In the paper entitled “Efficient hardware trojan detection
integrated circuits. With time-to-market pressure and ris- with differential cascade voltage switch logic,” the authors
ing mask costs in the semiconductor industry, engineering present to exploit the inherent feature of differential cascade
change order (ECO) design methodology plays a main role voltage switch logic (DCVSL) to detect hardware trojans
in advanced chip design. Digital systems such as commu- (HTs) at runtime. By examining special power characteristics
nication and multimedia applications demand for advanced of DCVSL systems upon HT insertion, the authors can detect
VLSI architecture design methodologies so that low power HTs, even if the HT size is small. Simulation results show that
consumption, small area overhead, high speed, and low cost the method achieves up to 100% HT detection rate. The eval-
can be achieved. uation on ISCAS benchmark circuits shows that the scheme
This special issue is dedicated to aspects of VLSI archi- obtains a HT detection rate in the range of 66% to 98%.
tecture design and their applications. Special interest focuses In the paper entitled “Low-Area Wallace multiplier,” the
on emerging digital systems. This special issue contains eight authors propose a reduced-area Wallace multiplier without
papers that focus on the power minimization design, effi- compromising on the speed of the original Wallace multiplier.
cient hardware Trojan detection, low-area Wallace multiplier, The proposed designs are synthesized using Synopsys Design
gate-level circuit reliability analysis, low power and high Compiler in 90 nm process technology and achieve the lowest
speed arithmetic circuits, power effective fractional-N-PLL area cost as compared to other tree-based multipliers. The
frequency synthesizer, power-saving architecture for network speed of the proposed and reference multipliers is almost the
on chip, and ECO design. same.
In the paper entitled “On-chip power minimization using In the paper entitled “Gate-level circuit reliability analysis:
serialization-widening with frequent value encoding,” the a survey,” the authors provide an overview of some typical
authors address the problem of the high-power consumption methods for reliability analysis with special focus on gate-
of the on-chip data buses by exploring a new framework for level circuits that are either large or small, with or without
2 VLSI Design
Acknowledgments
Finally, the Guest Editors would like to thank all the authors
who sent their valuable contributions and all the reviewers for
their valuable comments.
Yu-Cheng Fan
Qiaoyan Yu
Thomas Schumann
Ying-Ren Chien
Chih-Cheng Lu
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 698041, 5 pages
http://dx.doi.org/10.1155/2014/698041
Research Article
Engineering Change Orders Design Using Multiple Variables
Linear Programming for VLSI Design
Copyright © 2014 Yu-Cheng Fan et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
An engineering change orders design using multiple variable linear programming for VLSI design is presented in this paper. This
approach addresses the main issues of resource between spare cells and target cells. We adopt linear programming technique to plan
and balance the spare cells and target cells to meet the new specification according to logic transformation. The proposed method
solves the related problem of resource for ECO problems and provides a well solution. The scheme shows new concept to manage
the spare cells to meet possible target cells for ECO research.
1. Introduction ECO synthesis, and ECO routing [2]. Kuo et al. insert spare
cells with constant insertion for engineering change and
Engineering change orders (ECO) are important technolo- describe an iterative method to determine feasible mapping
gies used for changes in integrated circuit (IC) layout and solutions for an EC problem [3]. Besides, in order to per-
compensate for design problems. Traditionally, when chip form ECO efficiently, literature [4–9] adopt minimal change
shows errors, it often requires new photomasks for all layers. EC equations automatically. Brand proposed incremental
However, photomasks of deep-submicron semiconductor synthesis method [4]. Huang presented a hybrid tool for
fabrication process are very expensive. In order to save automatic logic rectification [5]. Lin et al. addressed logic
money, ECO technology modifies only a few of the metal synthesis techniques for engineering change problems [6].
layers (metal-mask ECO) to reduce the cost of photomasks Shinsha et al. performed incremental logic synthesis through
for all layers [1]. gate logic structure identification [7]. Swamy et al. achieved
To perform the ECO, IC designers adopt sprinkling many minimal logic resynthesis for engineering change [8]. Watan-
unused logic gates during IC design flow. When chip is abe and Brayton presented another kind of incremental
manufactured and shows design errors, IC designers modify synthesis technique for engineering changes [9]. However,
the gate-level net-list using the presprinkling unused logic few researchers discuss the resource between spare cells and
gates. At the same time, the designers track and verify the target cells. Therefore, in order to solve the problems, we
modification to check formal equivalence after ECO process. adopt linear programming technique to plan and balance the
The designers must guarantee the revised design matching spare cells and target cells in this paper. The proposed scheme
the revised specification. meets the new specification according to logic transformation
How to achieve ECO efficiently? There are some litera- and overcomes the related problems of resource for ECO
tures that address this problem and provide related solution. research.
In literature [2], Tan and Jiang describe a typical metal- This paper is organized as follows. In Section 2, we
only ECO flow with four steps that include placement and address typical ECO design flow. In Section 3, logic trans-
spare cell distribution, logic difference extraction, metal-only formation is discussed. In Section 4, multiple variables linear
2 VLSI Design
Figure 3: Example of an ECO problem. (a) EC equation: output = (𝐴 + (𝐵𝐶) ) . (b) Spare cells. (c) Mapping: output = (𝐴 𝐵𝐶).
(a) (b)
Figure 4: AOI22 can be implemented by two NAND and one AND cells.
X3 Y3 𝐶𝑑 + 𝐷𝑑 + 𝐸𝑑 ≧ 𝑌4 ; (5)
Dc
𝐸𝑑 + 𝐸𝑒 ≧ 𝑌5 . (6)
Dd
However, spare cells are not often enough; designer
Logic D Logic d should balance the spare cell allocation to meet all require-
ments of desirable cells.
We assume one case when 𝐵𝑏 ≦ 𝑌2 . In order to provide
X4 Ed Y4 enough spare cells, we should increase the number of 𝐴𝑏 to
achieve 𝐴𝑏 + 𝐵𝑏 ≧ 𝑌2 .
Ee Similarly, when 𝐶𝑐 + 𝐷𝑐 ≦ 𝑌3 , we should increase 𝐵𝑐
number to meet 𝐵𝑐 + 𝐶𝑐 + 𝐷𝑐 ≧ 𝑌3 .
Logic E Logic e
Therefore, we define another restriction rule of the engi-
neering change orders design which is written as follows:
X5 Y5
𝐴𝑎 = 𝑋1 − 𝐴𝑏;
Figure 5: ECO design using multiple variables linear programming (7)
for VLSI design and relation of logic transformation. 𝐵𝑎 = 𝑋2 − 𝐵𝑏 − 𝐵𝑐.
4 VLSI Design
According to formulas (2) and (7), we can balance the transformation, and multiple variables linear programming
number of 𝐴𝑏, 𝐵𝑏, and 𝐵𝑐 to achieve the target number 𝑌1 . for VLSI design. The presented scheme estimates the resource
Consider the following: of spare cells and provides a well solution of ECO problems.
𝑋1 − 𝐴𝑏 + 𝑋2 − 𝐵𝑏 − 𝐵𝑐 ≧ 𝑌1 . (8)
Conflict of Interests
In a similar way, we define the restriction rule of the The authors declare that there is no conflict of interests
engineering change orders design which is written as follows: regarding the publication of this paper.
𝐵𝑐 = 𝑋2 − 𝐵𝑎 − 𝐵𝑏;
Acknowledgments
𝐶𝑐 = 𝑋3 − 𝐶𝑑; (9)
This work was supported by the National Science Council of
𝐷𝑐 = 𝑋4 − 𝐷𝑑. Taiwan under Grant nos. NSC 101-2221-E-027-135-MY2 and
102-2622-E-027-008-CC3. The authors gratefully acknowl-
According to formulas (4) and (9), we can balance the edge the Chip Implementation Center (CIC), for supplying
number of 𝐵𝑐, 𝐶𝑐, and 𝐷𝑐 to achieve the target number 𝑌3 . the technology models used in IC design.
Consider
𝑋2 − 𝐵𝑎 − 𝐵𝑏 + 𝑋3 − 𝐶𝑑 + 𝑋4 − 𝐷𝑑 ≧ 𝑌3 . (10) References
We model the engineering change orders problems using [1] J. A. Roy and I. L. Markov, “ECO-system: embracing the change
multiple variables linear programming. According to the in placement,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 26, no. 12, pp. 2173–2185,
functions, we can understand the engineering change orders
2007.
relation between supply and requirement. Then, designer can
[2] C. Tan and I. H. Jiang, “Recent research development in
estimate and perform ECO using spare cell efficiently.
metal-only ECO,” in Proceedings of the 54th IEEE International
Midwest Symposium on Circuits and Systems (MWSCAS ’11), pp.
5. Discussion 1–4, August 2011.
[3] Y. M. Kuo, Y. T. Chang, S. C. Chang, and M. Marek-Sadowska,
In this Section, we discuss the advantage and disadvantage of “Spare cells with constant insertion for engineering change,”
the related works. Table 1 shows ECO method comparison. IEEE Transactions on Computer-Aided Design of Integrated
The proposed approach designs a multiple variable linear Circuits and Systems, vol. 28, no. 3, pp. 456–460, 2009.
programming ECO for VLSI design. Our method can predict [4] D. Brand, A. Drumm, S. Kundu, and P. Narain, “Incremental
cell resource accurately using multiple variable linear pro- synthesis,” in Proceedings of the IEEE/ACM International Con-
gramming techniques. Traditional ECO is not to predict it ference on Computer-Aided Design, pp. 14–18, 1994.
well. Besides, our scheme provides a high accurate prediction [5] S. Huang, K. Chen, and K. Cheng, “AutoFix: A hybrid tool for
of patching logic number to balance between spare cells and automatic logic rectification,” IEEE Transactions on Computer-
target cells. It is hard for traditional ECO method to do these. Aided Design of Integrated Circuits and Systems, vol. 18, no. 9, pp.
Moreover, we define restriction rule, resource optimization, 1376–1384, 1999.
and solution boundary of ECO problem to increase the [6] C. Lin, K. Chen, and M. Marek-Sadowska, “Logic synthesis
for engineering change,” IEEE Transactions on Computer-Aided
efficiency of the proposed ECO method and provide a well
Design of Integrated Circuits and Systems, vol. 18, no. 2-3, pp.
solution. 282–292, 1999.
[7] T. Shinsha, T. Kubo, Y. Sakataya, J. Koshishita, and K. Ishihara,
6. Conclusion “Incremental logic synthesis through gate logic structure identi-
fication,” in Proceedings of the IEEE/ACM Conference on Design
In this paper, we proposed an engineering change orders Automation, pp. 391–397, Jun 1986.
design using multiple variables linear programming for VLSI [8] G. Swamy, S. Rajamani, C. Lennard, and R. K. Brayton, “Mini-
design. The paper discusses typical ECO design flow, logic mal logic re-synthesis for engineering change,” in Proceedings of
VLSI Design 5
Research Article
Design of Smart Power-Saving Architecture for Network on Chip
Copyright © 2014 T.-Y. Lee and C.-H. Huang. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
In network-on-chip (NoC), the data transferring by virtual channels can avoid the issue of data loss and deadlock. Many virtual
channels on one input or output port in router are included. However, the router includes five I/O ports, and then the power
issue is very important in virtual channels. In this paper, a novel architecture, namely, Smart Power-Saving (SPS), for low power
consumption and low area in virtual channels of NoC is proposed. The SPS architecture can accord different environmental factors
to dynamically save power and optimization area in NoC. Comparison with related works, the new proposed method reduces
37.31%, 45.79%, and 19.26% on power consumption and reduces 49.4%, 25.5% and 14.4% on area, respectively.
1. Introduction two stages arbitration and will select most body flits into
XBAR to transmit. The VA will be working when the packet is
In recent years, the 3-dimensional IC and TSV (Through- arrival. The SA operation when the flit is arrival. The tail flit
Silicon Via) technology are proposed to solve area issues. The represents last flit, and then the router will unregister trans-
3-dimensional IC of Intel Ivy Bridge processor and the 16- mission channel. The router topology includes mesh, star, and
core multicore architecture can be implemented in 22 nm [1]. fat tree [4, 5].
Therefore, the multicore and heterogeneous systems are pop- Yoon et al. [6] analysis of virtual channels (VCs) can avoid
ular research in SoC (system-on-chip). These architectures routing and protocol deadlock and improve the routing per-
require high throughput and performance to transfer data formance when the packet traffic is congested. The VCs can
in a multicore SoC. Therefore, the NoC (network-on-chip) solve packet switch hard issue but it leads the power and area
can be proposed to solve this requirement, but it derived new and so forth issue in NoC.
problems such as power consumption and area [2, 3]. Nicopoulos et al. [2] proposed IntelliBuffer architecture
The NoC architecture [1] consists of processing element to solve PV (process variation) to reduce the power consump-
(PE), network interface (NI), router, and topology which is tion in layer 1 [7]. It differs from the conventional architecture
shown in Figure 1. The PEs transfer information to NI, the NI in two fundamental ways. First, these slots use clock-gating to
packages the information into flits then passes to routers. The reduce the power consumption when slots are empty. In order
routers have difference corner router (CR), edge router (ER), to avoid data loss transmission, one of slots clock keeps to
and router (R); the CR, ER and R has three, four, and five I/O access data in each I/O port. Second, the router creates a leak-
ports to access information then each port includes 𝑛 vir- age classification register (LCR) table; then the write and read
tual channels. Router includes transmission channel, rout- pointer always accesses the lowest power consumption slots
ing computation (RC), virtual channel arbiter (VA), switch from the LCR table.
arbiter (SA), and crossbar (XBAR). The flits includes header, Taassori et al. [3] proposed an adaptive data compression
body, and tail; the header flit has PE priority, source address, technology to reduce the number of packet bits in layer 3 [7].
destination address, and so forth. The RC uses header flit and It reduces of the number of transmissions. Therefore, it can
routing algorithms to find transmission path. VA uses two improve power consumption of router. Palma et al. [8] use
stages arbitration to select most high priority packet trans- T-Bus-Invert technology to reduce the hamming distance
mission and then will sign transmission channel. SA uses transition activity rate to improve the power consumption.
2 VLSI Design
CR ER CR
NI NI NI
PE PE PE
ER R ER
NI NI NI
PE PE PE
CR ER CR
NI NI NI
PE PE PE
Router
Jafarzadeh et al. [9] use end-to-end data coding technology to the highest priority packet sent to next router. The arbitration
minimize switching activity rate and routing path to improve unit includes routing computation (RC), VC arbiter (VA),
NI power consumption. and switch arbiter (SA). The RC is the calculation of routing
Lee et al. [10] proposed buffer clock-gating architecture paths and priorities. The VA contains a number of two-stage
and used clock-gating to reduce the transmit power con- arbitrations to select packet and sign up VCs. First stage
sumption when slots are empty and full. Ezz-Eldin et al. [11] selects the local highest priority packet from input VCs to
proposed an adaptive virtual channel with two sections in crossbar and signs up VCs. Second stage selects the global
layer 1 [7]. First, the work used hierarchical multiplexing tree highest priority packet from input crossbar to output VCs and
for Virtual Channels (VCs) to reduce area. Second, it uses signs up VCs. The SA also contains a number of two-stage
clock-gating to reduce power consumption. Rosa et al. [12] arbitrations to select flits for transmission. First stage selects
proposed dynamic frequency scaling in PE for NoC. It the local highest priority flits from input VCs to crossbar.
considers the communication and loading rate to control the Second stage selects the global highest priority flits from input
router frequency to reduce the power consumption. crossbar to output VCs. The VA executed prepacket and the
Huaxi et al. [13] proposed fat tree-based optical NoC; this SA executed preflits.
architecture includes topology, placement, layout, and proto- The router with transmission unit is illustrated in
col. This paper proposed low power and cost router optical Figure 3. In this unit, it includes 𝑛VCs to access large packet
turnaround router to improve the power consumption. Gu from input physical channel to output physical channel. A
et al. [14] proposed Cygnus router to optimize the router algo- power consumption calculation to VCs is shown in (1). The
rithms to reduce the power consumption. Swaminathan et al. variable of 𝑛 represents the number of access packets or flits
[15] create two FIFOs in NI. Use two FIFO dynamic con- in VCs. The variable of 𝑓 represents access frequency in
figuration data access to improve throughput and power VCs. The variable of 𝑐 represents capacitance and ] represents
consumption. voltage in VCs. Nicopoulos et al. [2] and Katabami et al. [17]
In the next section we analyse the power consumption proposed clock-gating to solve this issue.
under the difference VCs access. Section 3 we introduce the In this paper, we proposed a dynamic control of each
topology and router packet architecture, we addition the SPS virtual channel clock in different transmission environments.
in router to save power. In Section 4 we present SPS with Whether packet transfer is complete, the SPS can effectively
router design. Section 5 contains experimental results and reduce the power consumption and does not affect the
Section 6 concludes this paper. transmission performance. Consider
∞
2
2. Power Issue with Virtual Channels 𝑃𝑛VCs = ∑ 1𝑛 × (𝑓 × 𝑐 × V) . (1)
𝑛=1
The multicore architecture and big data communication are
more popular in next generation. Traditional communication 3. Router and Topology with SPS
technologies cannot meet a large amount of traffic on multi-
core and heterogeneous chip. The NoC can solve this issue. 3.1. Relation of Topology and Router. The relation of topology
It uses network transmission method to make the difference and router is illustrated in Figure 4. The router uses different
core communication at same time. The NoC can solve the transmission mode with topologies. For example, the mesh
communication issue but the big data access enhances the uses the 𝑋-𝑌 routing to transmit. The 𝑋-𝑌 routing flow chart
power consumption. for 2 × 2 meshes is illustrated in Figure 5, when the MSB
The router composed of the arbitration and transmission of destination router address (𝑅𝑑𝑚 ) is equal to the MSB of
unit [16] is illustrated in Figure 2. The arbitration unit selects current router address (𝑅𝑐𝑚 ) and if the LSB of router
VLSI Design 3
Output n
VC 1
RC
VC 2
..
. VA
VC n
SA
Input 2
VC 1
Router VC 2
..
.
VC n
Output n
Input 1
VC 1
VC 2 Output 1
..
. Crossbar
VC n (n × n)
East
physical E W East
channel a S a Crossbar physical
s Crossbar channel
s
t t
p p
h h
nVCs
Arbiter Arbiter Arbiter
East East
physical channel East Crossbar East physical channel
input output
West West
physical channel West West physical channel
input output
South South
physical channel South South physical channel
input output
North North
physical channel North North physical channel
input output
addresses (𝑅𝑑𝑙 and 𝑅𝑐𝑙 ) is equal then it means the flits arrival.
Otherwise, the 𝑋-𝑌 routing algorithm includes two-stage Sign up Algorithm
flows. In stage one, the flits are sent until that the 𝑅𝑑𝑚 equals of Input: 𝑅𝑟𝑜𝑡ℎ and 𝐸𝑚𝑝 .
𝑅𝑐𝑚 on the 𝑥-axis routers. In stage two, the flits are sent to (1) while (flits arrival) do
the destination by 𝑦-axis routers. The virtual channel will be (2) if (𝑅𝑟𝑜𝑡ℎ𝑓2 is header and 𝑎𝑑𝑥 is free channel)
initialed under packet transmit on two routers, which pro- (3) {sign up the channel and select the channel
cedure is shown on Algorithm 1. to output}
The control method of arbiter architecture uses different (4) else if (𝑅𝑟𝑜𝑡ℎ𝑓2 is body and 𝑎𝑑𝑥 = 𝑅𝑟𝑜𝑡ℎ𝑠2 )
transmission mode to design. The VC arbiter and switch bar (5) {select the channel to output}
are by the topology and priority to design the routing compu- (6) else if (𝑅𝑟𝑜𝑡ℎ𝑓2 is tail and 𝑎𝑑𝑥 = 𝑅𝑟𝑜𝑡ℎ𝑠2 )
tation unit. Algorithm 2 constructs VC two stages arbitration (7) {clear the channel and select the channel to output;}
of prepackets. Stage 1 decided high priority packet into (8) else
crossbar from local VCs (input VCs) of each packet at lines (9) {read back flit to virtual channel}
(10) end while
3 to 4 and lines 8 to 10. Stage 2 decided most important packet
to transmission from global VCs (output VCs) of each packet
at lines 5 to 6 and lines 11 to 13. Algorithm 1: Channel sign up algorithm.
4 VLSI Design
Topology
Router architecture
Routing
computation
VCs design Switching bar
Virtual
SPS design Crossbar channel Switch
arbiter
arbiter
Start
Yes Yes
Rdm == Rcm Rdl == Rcl Arrival
No No
Yes Yes
Rdm > Rcm Rdl > Rcl Down
No No
Right Left Up
Switch arbitration
Input: body and tail flits
/∗Control signal enable∗/
(1) while (body or tail flits) do
(2) use channel sign up register to select local and global highest priority flits
(3) if (local)
(4) {𝑆𝑎𝑖 = local input virtual channel address}
(5) if (global)
(6) {𝑆𝑎𝑜 = global input virtual channel address}
(7) end while
/∗Channel switch∗/
(8) Case 𝑆𝑎𝑖
(9) {𝐶𝑟𝑖2 = local packet of 𝑆𝑎𝑖 }
(10) end case
(11) Case 𝑆𝑎𝑜
(12) {𝑅𝑜𝑡 = global packet of 𝑆𝑎𝑜 }
(13) end case
Router Router
Algorithm 3 constructs VC two stages arbitration of the RC algorithm is 𝑋-𝑌 routing, and the VA and SA
preflits. Stage 1 decided high priority flit into crossbar from algorithms are lottery [18].
local VCs (input VCs) of each flit at lines 3 to 4 and lines 8 The router that connects with PE is shown in Figure 7; so
to 10. Stage 2 decided most important flit to transmit from that the PE and router access information, use the network
global VCs (output VCs) of each flits at lines 5 to 6 and lines interface (NI). It handles the information between router
11 to 13. and PE. The NI includes two level designs [19] as shown in
The router includes four directions to connect other Figure 8. It contains three modules to meet the specifications
routers and one local physical channel to connect PE in of the different layers. The shell module needs to meet IP
transmission channel architecture. There have been 𝑛VCs of
specification. The kernel module needs to meet the NoC
each physical channel without local physical channel. The
topology specification.
switch bar support for transmission the most important
packet to output channel. The SPS controls each VCs power
consumption when the channel status changes. The SPS 3.3. Flits with Router Architecture. The flit specification with
architecture is introduced in next section. router is shown in Figure 9; the flit type of 2-bit 00 represents
the one packet; this flit type does not sign up VCs. The 2-bit
3.2. Topology Architecture. The topology is definition of 01 represents the header flit which includes routing informa-
the packet transmission path between router and link. The tion and address; this flit type always is determined in sign up
router connection topology architecture is shown in Figure 6; channel. The 2-bit 10 represents the body flit which includes
they include star, mesh, ring, and tree topologies. The RC transmission information; this flit payload records the
algorithms depend on topology architecture in arbitration segment packet. The 2-bit 11 represent the tail as last transmis-
unit. The VA and SA algorithms depend on packet priority in sion information; this flit not only records the last segment
arbitration unit. In this paper, the topology is the 2 × 2 mesh, packet but also cleans the VCs.
6 VLSI Design
Arbiter
RC VA SA
PE PE Transmission channel
ISL Switch logic OSL
Packet Packet
input PC IVC CR OVC PC output
Router
NI
SPS
Figure 7: Router connection with PE.
Figure 10: Router with SPS architecture.
Shell Kernel
IP NoC 4.2. Design of SPS Control Timimg. The VCs access timing
protocol Handshake Packet packets
NoC
diagrams of SPS architecture are illustrated in Figure 12. The
IP protocol assembling interface
Clock Block A indicates that the VCs have no information to
transmit. The Clock Block B indicates that the VCs are writing
encodes IP frequency information. The Clock Block C indicates that the data in VCs
protocol conversion are waiting to transmit. Our analysis for unused clock-gating
architecture is shown in (2). The slots access information of
power consumption is denoted by 𝑃𝑎 . The slot content full
Figure 8: NI breakdown into Shell, Kernel, and interface. and empty of power consumption are denoted by 𝑃𝑓 and 𝑃𝑒 ,
respectively. The 𝑃𝑠 is power consumption except for 𝑃𝑓 , 𝑃𝑒 ,
and 𝑃𝑎 . The unused clock-gating architecture does not control
Flit type (00) Source Destinations Payload clock for sequential logic in 𝑛VCs. Therefore, the logic will
address address
generate power consumption in high transmission structure.
Source Destinations The clocking gating consumes power in Clock Block B
Flit type (01) Routing information
address address
and Clock Block C. Our analysis for clock-gating architecture
Flit type (10) Payload is shown in (3). The 𝑃𝑔1 is power consumption of empty
gating. The clock-gating architecture does not control clock
Flit type (11) Payload when VCs is full stage. The VCs always store flits to wait for
transmission.
Figure 9: Flits type of router. The SPS consumes power in Clock Block B. Our analysis
for SPS architecture is shown in (4). The 𝑃𝑔2 is power
consumption of SPS. It saves the power consumption of
empty and full gating for 𝑛VCs. Consider
4. SPS with Router Design
𝑃𝑟1 = 𝑃𝑎 + 𝑃𝑓 + 𝑃𝑒 + 𝑃𝑠 , (2)
The VC that contains many slots to access data led to extra
power consumption. In this paper, we propose SPS architec- 𝑃𝑟2 = 𝑃𝑎 + 𝑃𝑓 + 𝑃𝑠 + 𝑃𝑔1 , (3)
ture to reduce the power consumption.
𝑃𝑟3 = 𝑃𝑎 + 𝑃𝑠 + 𝑃𝑔2 . (4)
4.1. Router with SPS Architecture. The proposed router
with SPS architecture is illustrated in Figure 10. The phys- 4.3. Design of SPS. The proposed SPS uses the VCs status to
ical channel (PC) is used to connect other routers and dynamic control clock of each VC. The CFSM of SPS with
access information. The input VCs (IVC) is used to store VCs is illustrated in Figure 13; it contains two CFSM in this
information from PCs. It always is designed by FIFO or architecture.
other sequential logic. The arbiter decides the flits priority The first CFSM includes initial, empty, full, and waiting
to control input switch logic (ISL) and output switch logic status. Initial status: when the VC is reset, the structure is into
(OSL) to transmit flits. It includes RC, VA, and SA. The the initial status until the flit arrive. Empty status: when the
crossbar (CR) connects IVC to OVC, the switch signal form user resets the VCs or the flits transport to next storage unit,
arbiter. The output VCs (OVC) store information from CR. the structure is into this status. Full status: the store flit in VC
The proposed SPS uses the transmission channel status to is full. Waiting status: When the user resest the VCs or the
dynamic control IVC and OVC clock in essential operating. store flit is complete.
The VCs with SPS architecture are illustrated in Figure 11. The VCs with SPS algorithm is illustrated in Algorithm 4.
It controls system clock into I/O VC to reduce power con- In line 3, the VCs will initialize the VCs count and flags.
sumption. In this architecture, the VC contains 0 to 𝑖 − 1 slots The VCs will access flits to change VCs count when channel
to access data. packet or arbiter signal arrive at line 4 to 9. When the VCs
VLSI Design 7
SPS
Packet Packet
input I/O VC n output
ISL OSL
0 i−1
SPS
(Old VC Ff = 1)/
direction := Full
SPS CFSM
(VC1 to i−1 Ef = 1)/
direction := Clock Gating
Initial (VC 0 Ef = 1)/
direction := Wake up
count can be changed, then the VCs flag will be changed at then SPS will disable this VC clock and change to this status.
line 10 to 17. Wake up: when the VC want to store flit, one VC will wake
The second CFSM includes initial, clock-gating, and wake up.
up status. Initial status: this principle is the first CFSM of ini- The SPS algorithm is illustrated in Algorithm 5. In line 3,
tial state. Clock-gating: when the VC changes to full or empty, the SPS will initialize VCs clock and access status from VCs
8 VLSI Design
Start
(test req = 1)/ (output finish = 1)/
direction = Generator Idle direction = Idle
Lottery
Router
VD
with SPS
Control step 1
Compare-vector Implement-results
Conventional
router
Control step 2
No No
Write Read
Yes Yes
475
Power consumption (mW)
Vector Database
category category 425
Compare- 375
Test-vector Compare Lottery
vector
Compare Lottery Compare Lottery 325
database database test-vector test-vector
275
Figure 16: Vector database (VD) control flow graph.
225
0.1 0.3 0.5 0.7 0.9 2 4 6 8 10
Numbers of test flits (k flits)
analysis tools use Modelsim 6.6, Xilinx Chipscope ILA, and IntelliBuffer [2] BCG [10]
Adaptive data compression [3] Proposed
Xpower 12.3, which are supported by Xilinx. The test exper-
imental environment uses 2 × 2 mesh and 𝑋-𝑌 routing; the Figure 18: Power consumption distribution.
PC have 4 VCs to access flits. The power consumption distri-
bution is illustrated in Figure 18; the number of test packets is
from 100 to 10000. The packet format is flit and packet length
is 18 bits. 6. Conclusions
Comparing related works, as shown in Table 1, Intel-
liBuffer [2], adaptive data compression [3], and buffer clock- The Smart Power-Saving (SPS) architecture for network-on-
gating [10], the proposed method reduces 37.31%, 45.79%, chip was presented. A clock control circuit and SPS algorithm
and 19.26% on power consumption, respectively, and reduces are demonstrated to reduce the power consumption on the
49.4%, 25.5% and 14.4% on area, respectively. NoC architecture. From experimental results, the proposed
10 VLSI Design
Constraints
Methods
Power consumption (mW) Area (number of slices) Improved power Improved area
IntelliBuffer [2] 410.42 1551 37.31% 49.4%
Adaptive data compression [3] 474.53 1054 45.79% 25.5%
Buffer clock-gating [10] 318.63 917 19.26% 14.4%
Newly proposed 257.05 785
SPS architecture is more efficient to reduce the power con- Large Scale Integration (VLSI) Systems, vol. 22, no. 3, pp. 675–
sumption than IntelliBuffer [1], adaptive data compression 685, 2014.
[3], and buffer clock-gating [10] in the NoC architecture. [10] T. Y. Lee, C. H. Huang, and X. S. Lin, “Design of buffer clock-
gating architecture for network-on-chip,” in Proceedings of the
22th VLSI Design/CAD Symposium, pp. 2–5, August 2011.
Conflict of Interests [11] R. Ezz-Eldin, M. A. El-Moursy, and A. M. Refaat, “Low leakage
The authors declare that there is no conflict of interests power NoC switch using AVC,” in Proceedings of the IEEE
International Symposium on Circuits and Systems (ISCAS ’12),
regarding the publication of this paper.
pp. 2549–2552, Seoul, Republic of Korea, May 2012.
[12] T. R. da Rosa, V. Larrea, N. Calazans, and F. G. Moraes, “Power
Acknowledgment consumption reduction in MPSoCs through DFS,” in Proceed-
ings of the 25th Symposium on Integrated Circuits and Systems
The authors would like to thank the Ministry of Science and Design (SBCCI '12), pp. 1–6, 2012.
Technology of the Republic of China, Taiwan, for partially [13] G. Huaxi, X. Jiang, and Z. Wei, “A low-power fat tree-based
supporting this research. optical network-on-chip for multiprocessor system-on-chip,” in
Proceedings of the Design, Automation and Test in Europe
Conference and Exhibition (DATE ’09), pp. 3–8, April 2009.
References
[14] H. Gu, K. H. Mo, J. Xu, and W. Zhang, “A low-power low-cost
[1] D. James, “Intel Ivy Bridge unveiled—the first commercial tri- optical router for optical networks-on-chip in multiprocessor
gate, high-k, metal-gate CPU,” in Proceedings of the Custom systems-on-chip,” in Proceedings of the IEEE Computer Society
Integrated Circuits Conference (CICC '12), pp. 9–12, September Annual Symposium on VLSI (ISVLSI ’09), pp. 19–24, Tampa, Fla,
2012. USA, May 2009.
[2] C. Nicopoulos, S. Srinivasan, A. Yanamandra et al., “On the [15] K. Swaminathan, G. Lakshminarayanan, F. Lang, M. Fahmi, and
effects of process variation in network-on-chip architectures,” S. B. Ko, “Design of a low power network interface for Network
IEEE Transactions on Dependable and Secure Computing, vol. 7, on chip,” in Proceedings of the 26th IEEE Canadian Conference
no. 3, pp. 240–254, 2010. on Electrical and Computer Engineering (CCECE ’13), pp. 1–4,
May 2013.
[3] M. Taassori, M. Taassori, and M. Mossavi, “Adaptive data com-
pression in NoC architectures for power optimization,” Inter- [16] R. Mullins, A. West, and S. Moore, “Low-latency virtual-chan-
national Review on Computers and Software, vol. 5, no. 5, pp. nel routers for on-chip networks,” in Proceedings of the 31st
540–547, 2010. Annual International Symposium on Computer Architecture
(ISCA '04), pp. 188–197, 2004.
[4] D. Bertozzi and L. Benini, “Xpipes: a network-on-chip architec-
ture for gigascale systems-on-chip,” IEEE Circuits and Systems [17] H. Katabami, H. Saito, and T. Yoneda, “Design of a GALS-
Magazine, vol. 4, no. 2, pp. 18–31, 2004. NoC using soft-cores on FPGAs,” in Proceeding of the Embedded
Multicore Socs (MCSoC '13), pp. 26–28, September 2013.
[5] S. J. Lee, K. Lee, and H. J. Yoo, “Analysis and implementation
of practical, cost-effective networks on chips,” IEEE Design and [18] J. Wang, Y. Li, Q. Peng, and T. Tan, “A dynamic priority arbiter
Test of Computers, vol. 22, no. 5, pp. 422–433, 2005. for network-on-chip,” in Proceedings of the IEEE International
Symposium on Industrial Embedded Systems (SIES '09), pp. 253–
[6] Y. J. Yoon, N. Concer, M. Petracca, and L. Carloni, “Virtual 256, July 2009.
channels versus multiple physical networks: a comparative anal-
[19] S. Saponara, L. Fanucci, and M. Coppola, “Design and coverage-
ysis,” in Proceedings of the 47th ACM/IEEE Design Automation
driven verification of a novel network-interface IP macrocell for
Conference (DAC '10), pp. 162–165, June 2010.
network-on-chip interconnects,” Journal of Microprocessors and
[7] L. Benini and G. de Micheli, “Networks on chips: a new SoC Microsystems, vol. 35, no. 6, pp. 579–592, 2011.
paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, 2002.
[8] J. C. S. Palma, L. S. Indrusiak, F. G. Moraes, R. Reis, and M.
Glesner, “Reducing the power consumption in networks-on-
chip through data coding schemes,” in Proceedings of the 14th
IEEE International Conference on Electronics, Circuits and Sys-
tems (ICECS ’07), pp. 1007–1010, December 2007.
[9] N. Jafarzadeh, M. Palesi, A. Khademzadeh, and A. Afzali-
Kusha, “Data Encoding Techniques for Reducing Energy Con-
sumption in Network-on-Chip,” IEEE Transactions on Very
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 406416, 7 pages
http://dx.doi.org/10.1155/2014/406416
Research Article
Optimization of Fractional-N-PLL Frequency Synthesizer for
Power Effective Design
Copyright © 2014 Sahar Arshad et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
We are going to design and simulate low power fractional-N phase-locked loop (FNPLL) frequency synthesizer for industrial
application, which is based on VLSI. The design of FNPLL has been optimized using different VLSI techniques to acquire
significant performance in terms of speed with relatively less power consumption. One of the major contributions in optimization
is contributed by the loop filter as it limits the switching time between cycles. Sigma-delta modulator attenuates the noise generated
by the loop filter. This paper presents the implementation details and simulation results of all the blocks of optimized design.
1. Introduction to the phase of the input signal and this signal is called
an output signal of the PLL. The input signal is called
For many manufacturers and product developers, it is a good the “reference” signal. In a feedback loop, the oscillator is
idea to reduce power consumption in electronic products. controlled by the output signal from the phase detector [3, 4].
It is also an important idea to gain competitive advantage The circuit compares the phase of a signal obtained from
in an increasingly power hungry world. Low power con- its output oscillator with the phase of the input signal to
sumption gives many benefits to designers and to users; keep the phases matched by adjusting the frequency of its
for example, the main advantage is that it reduces stringent oscillator. A phase locked loop (PLL) architecture has two
cooling requirements and it results in inexpensive and more types, a Fractional-N PLL (FNPLL) and an integer-N PLL
compact products [1]. The rapid rise in power requirements
[5]. For a given frequency resolution, the latter has high
has promoted governments and industry to increase energy
reference frequency than the former, and, hence, the loop
efficiency and design low power components. The majority of
bandwidth which is limited to 10% of the reference frequency
frequency synthesis techniques fall into two categories: either
can be set larger in the FNPLL than in the integer-N-PLL.
direct frequency synthesis or indirect frequency synthesis
[2]. To achieve fine frequency steps, the direct frequency Therefore, the latter architecture is used for faster locking.
synthesis technique is used because it is based on using digital This speed advantage of the FNPLL, however, comes at the
techniques. To generate multiples (integer or noninteger) of price of increased design complexity [6]. This is because
a reference frequency, indirect frequency synthesis is used the fractional-N operation in steady state requires fractional
because it is based on a phase-locked loop (PLL). Here, the spur reduction circuits whose quantization noise folds into
latter technique is used because we are going to implement the PLL spectrum via loop nonlinearities, demanding more
PLL. It is used to generate a signal whose phase is related significant design efforts to minimize the loop nonlinearities.
2 VLSI Design
fin
Phase
fout
Loop filter VCO
detector
Fref
PFD CP LPF VCO
Divided by
(Nint + y[n])
Sigma- Y[n]
a delta
modulator
W = 2.0 𝜇 W = 2.0 𝜇
L = 0.12 𝜇 L = 0.12 𝜇
clk1 W = 2.0 𝜇 out1
L = 0.12 𝜇
W = 1.0 𝜇 W = 1.0 𝜇
L = 0.12 𝜇 L = 0.12 𝜇
W = 1.0 𝜇
L = 0.12 𝜇
clk2
On the contrary, in the absence of fractional spurs, integer-N- In this equation, 𝑀 is an integer, and 𝑛 is the fractional
PLLs involve less design complexity. Here, FNPLL is required. part. To obtain the desired fractional division ratio dual
The expression of output frequency of the FNPLL is modulus is used [7]. Using the sigma-delta modulation tech-
nique, we can remove the fractional spurs. This technique
generates a random integer number. The average of these ran-
FreqFNPLL = (𝑀 ⋅ 𝑛) ∗ FreqRef . (1) dom numbers will result in the desired ratio. A phase detector,
VLSI Design 3
Figure 5: V versus T.
clk1
out1 R out2
1000
clk2 Res C
0.5 pF
Capa
out3
Figure 8: V versus T.
a loop filter, and a voltage controlled oscillator (VCO) are the difference. Three units are coupled as a feedback system as
main parts of phase-locked loop, as shown in Figure 1. shown in Figure 1. The periodic output signal is generated
The important part of the phase-locked loop (PLL) is by the oscillator. The applications of PLL are versatile; for
phase detector. It is also called a phase comparator, logic example, it can generate different stable frequencies or it can
circuit, frequency mixer, or an analog multiplier that gener- obtain a signal from noisy signals. A complete phase-locked
ates a voltage signal and this voltage signal shows the phase loop block can be obtained from single integrated circuit.
4 VLSI Design
This technique is used in advanced electronic products which 2. PLL Design Using 0.12 Micrometer
have different output frequencies from some Hz to many
Giga Hz [8]. To get low power consumption, high speed, 2.1. Phase Detector. The first block has two inputs, the
and stability, we decide to design phase-locked loop of reference input and the feedback. It compares frequencies
architecture fractional-n using 0.12 micrometer CMOS/VLSI of input and produces an output using phase difference of
design. As the demand of PLL is growing day by day in the inputs. To represent this block XOR gates are used. The gate
field of communications, low leakage transistors will be used produces a square wave when one-fourth of period shift of 90
for maintaining low power but for this we have to make a little degrees takes place at clock input, whereas output is different
compromise on frequency. for all other angles. We apply output of the XOR gates to
The structure of FNPLL is depicted in Figure 2. We can LPF which results in analog voltage, proportional to phase
difference.
control characteristics of PLL by using low pass filter, for
Figure 3 depicts a CMOS circuit of phase detector,
example, transients response and bandwidth. The basic and
Figure 4 describes layout, and Figure 5 represents the output
essential functional unit of PLL is VCO. VCO is used for
waveform.
clock generation [9]. For synthesizing aspired frequencies, we
use PLL with arbitrary frequency division (+N) method. This 2.2. Loop Filter. To get pure DC voltage along with rectifiers
proposed technique has the ability to give fast settling time, filters, the electronic circuits are also used. The second
reduce phase noise, and also reduce the effect of spurious block of PLL is loop filter and it has two distinct functions.
frequencies when compared with existing FNPLL techniques. First, maintains stability, that is defined by describing the
VLSI Design 5
W = 2.0 𝜇
L = 0.12 𝜇
clk1 W = 2.0 𝜇 clk2
L = 0.12 𝜇
C
W = 1.0 𝜇 1 pF
L = 0.12 𝜇 Capa
W = 1.0 𝜇
L = 0.12 𝜇
R W = 1.0 𝜇
100 L = 0.12 𝜇
Res
W = 2.0 𝜇
L = 0.12 𝜇 W = 2.0 𝜇
L = 0.12 𝜇
out1
W = 1.0 𝜇
L = 0.12 𝜇
clk1 clk2 W = 1.0 𝜇
L = 0.12 𝜇
W = 1.0 𝜇
clk3 L = 0.12 𝜇
loop dynamics. This explains the response of the loop to 2.3. Voltage Controlled Oscillator. As VCO is a source of
uncertainties. The 2nd function is applied to the VCO control varying output signal so the frequency of the output signal
input which appears at the phase detector output. This is regulated over a DC voltage range. The output signal can
frequency produces FM sidebands and modulates the VCO be a square wave or a triangular wave form. The oscillation
[10]. Other features of the PLL, for example, bandwidth, frequency is controlled by the value of input voltage [11].
transient response, lock range, and capture range, can be Figure 9 shows a CMOS circuit of VCO, Figure 10 shows
controlled by LPF. The LPF is used to attenuate this energy, layout, and Figure 11 shows output waveform.
but it can also reject band. The low pass filter can be obtained
by using a capacitor of large value and the capacitor is charged 2.4. Sigma-Delta Modulator. Sigma-delta modulation tech-
and discharged with the help of the switch resistance 𝑅on . nique is used to convert high definition signals to low
By the help of 𝑅on .C delay a low pass filter can be created. definition signals in digital domain. We designed sigma-
Figure 6 depicts a CMOS schematic of phase detector with delta modulator using 0.12 micrometer feature size and then
loop filter, Figure 7 shows layout, and Figure 8 shows output the layout was obtained. The input is the aspired fractional
waveform. number (𝑛) and the output is the sum of quantization noise
VLSI Design 7
and a DC part [12, 13]. By the use of integer divider quantiza- [13] R. K. Krishnamurthy, A. Alvandpour, V. De, and S. Borkar,
tion noise was generated. Figures 12 and 15 show the CMOS “High-performance and low-power challenges for sub-70 nm
circuit; Figures 13 and 16 show the layout of comparator and microprocessor circuits,” in Proceedings of the IEEE Custom
operational transconductance amplifier. Figures 14 and 17 Integrated Circuits Conference, pp. 125–128, May 2002.
show the output waveforms.
3. Conclusion
Power usage and heat dissipation are one of the biggest
challenges of VLSI industry today. In order to design the low
power consuming component, without making significant
change in performance, the design of FNPLL frequency
synthesizer was implemented and simulated. The optimized
design was implemented to 0.12 micrometer technology.
Using CMOS logic, the schematics were designed and verified
functionally and then prefabrication layout was sketched. The
simulation curves of the layouts reflected reduction in power
consumption, for the optimized design.
Conflict of Interests
The authors declare that there is no conflict of interests
regarding the publication of this paper.
References
[1] R. Jacob Baker, CMOS Circuit Design, Layout and Simulation,
IEEE Press, John Wiley & Sons, 3rd edition, 2010.
[2] A. Anil and R. K. Sharma, “A high efficiency charge pump for
low voltage devices,” International Journal of VLSI Design &
Communication Systems, vol. 3, no. 3, 2012.
[3] U. L. Rohde, Digital PLL Frequency Synthesis, Prentice-Hall,
Englewood Cliffs, NJ, USA, 1983.
[4] B. K. Mishra, S. Save, and S. Patil, “Design and analysis of second
and third order PLL at 450 MHz,” International Journal of VLSI
Design & Communication Systems, vol. 2, no. 1, 2011.
[5] N. Weste and D. Harris, CMOS VLSI Design—A Circuits and
Systems Perspective, Pearson Education, 3rd edition, 2005.
[6] U. A. Belorkar and S. A. Ladhake, “Design of low power phase
lock loop using 45 nm VLSI technology,” International Journal
of VLSI Design & Communication Systems, vol. 1, no. 2, 2010.
[7] T. A. D. Riley, M. A. Copeland, and T. A. Kwasniewski, “Delta-
Sigma modulation in fractional-n frequency synthesis,” IEEE
Journal of Solid-State Circuits, vol. 28, no. 5, pp. 553–559, 1993.
[8] M. H. Perrott, “Fractional-N Frequency Synthesizer Design
Using The PLL Design Assistant and CppSim Programs,” July
2008.
[9] S. Franssila, Introduction to Microfabrication, John Wiley &
Sons, 2004.
[10] K. Woo, Y. Liu, E. Nam, and D. Ham, “Fast-lock hybrid
PLL combining fractional-N and integer-N modes of differing
bandwidths,” IEEE Journal of Solid-State Circuits, vol. 43, no. 2,
pp. 379–389, 2008.
[11] N. Fatahi and H. Nabovati, “Design of low noise fractional-
N frequency synthesizer using sigma-delta modulation tech-
nique,” in Proceedings of the 27th International Conference on
Microelectronics (MIEL ’10), pp. 369–372, IEEE, May 2010.
[12] S. Borkar, “Obeying Moore’s law beyond 0.18 micron,” in
Proceedings of the 13th Annual IEEE International ASIC/SOC
Conference, pp. 26–31, September 2000.
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 380362, 5 pages
http://dx.doi.org/10.1155/2014/380362
Research Article
Performance Analysis of Modified Drain Gating Techniques for
Low Power and High Speed Arithmetic Circuits
Copyright © 2014 Shikha Panwar et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper presents several high performance and low power techniques for CMOS circuits. In these design methodologies, drain
gating technique and its variations are modified by adding an additional NMOS sleep transistor at the output node which helps in
faster discharge and thereby providing higher speed. In order to achieve high performance, the proposed design techniques trade
power for performance in the delay critical sections of the circuit. Intensive simulations are performed using Cadence Virtuoso in
a 45 nm standard CMOS technology at room temperature with supply voltage of 1.2 V. Comparative analysis of the present circuits
with standard CMOS circuits shows smaller propagation delay and lesser power consumption.
Pull-up Pull-up
network network
S S
S Pull-up S Pull-up
network network
Pull-down
Pull-down network
S network S
S Pull-down
Pull-down S
network
network
Figure 1: (a) Drain gating, (b) power gating, (c) drain-header and power-footer gating (DHPF), and (d) drain-footer and power-header
gating (DFPH).
transistor is inserted for each input of the gate both in PUN added between the power supply and the PUN, whereas
and in PDN resulting in higher delay and area. In sleepy stack, NMOS sleep transistor with input (S ) is added between
an additional sleep transistor is connected in parallel with the the PDN and ground as shown in Figure 1(b). The two
transistor stack. This reduces the leakage current but at the mixed techniques DHPF and DFPH are shown in Figures
same time delay in the circuit is increased. 1(c) and 1(d), respectively. As the name suggests, in DHPF,
LECTOR [8] and GALEOR [9] are also two leakage a PMOS sleep switch is inserted between PUN and output
tolerant techniques. LECTOR makes use of two leakage node and an NMOS sleep switch is inserted between the PDN
control transistors (LCTs) that are connected between the and ground rail. DFPH consists of an NMOS sleep switch
PUN and PDN. In the same time GALEOR technique makes between output node and PDN and a PMOS sleep switch
use of gated leakage transistors (GLTs). Both LCTs and GLTs between the power supply and the PUN. Comparative results
reduce leakage by increasing the resistance between supply in Section 4 indicate that power gating technique is the best
voltage and ground. leakage tolerant technique whereas drain gating technique
Another efficient technique to counter the leakage current has the least delay among the previously proposed circuits.
problem is drain gating and its variation [10], explained in
detail in Section 2. The modified circuits are proposed in 3. The Proposed High Speed
Section 3. Simulation results taking NAND gate, 1-bit full
adder, and 8-bit RCA (Ripple carry adder) as test bench Circuit Techniques
circuits are enumerated in Section 4 and Section 5 provides The proposed circuits are aimed at reducing the propagation
the final conclusion. delay incurred by drain gating technique and its variations.
Four different circuit techniques, namely high speed drain
2. Drain Gating Technique and gating (HS-drain gating), HS-power gating, HS-DHPF, and
Its Variant Circuits HS-DFPH as shown in Figures 2(a), 2(b), 2(c), and 2(d)
respectively, are proposed in this section. In HS-drain gating
In drain gating technique [10] shown in Figure 1(a), two technique an additional sleep transistor with sleep input (S)
sleep transistors are added between the PUN and PDN. is connected at the output node parallel to the NMOS sleep
PMOS transistor with sleep input (S) is connected between transistor (S ) and PDN. During the active mode, when the
PUN and output node, whereas NMOS transistor with logic circuit evaluates the circuits output, the added NMOS
sleep input (S ) is inserted between the output node and sleep transistor (S) provides an additional discharging path in
PDN. When the circuit is in evaluation mode, the NMOS the circuit. This added transistor helps in speedy evaluation,
and PMOS sleep transistors are turned on resulting in low hence providing higher speed. In a similar fashion, an
resistance conducting path. When the circuit is in standby, additional NMOS sleep transistor with sleep input (S) is
both transistors are switched off to reduce the standby power. added to power gating, DHPF, and DFPH circuits.
Other variant circuits of drain gating are, namely, power The proposed cicuits have been verified by taking NAND
gating, drain-header and power-footer gating (DHPF), and gate, 1-bit full adder, and 8-bit RCA as test bench circuits.
drain-footer and power-header gating (DFPH). In power Experimental results in Section 4 prove that the modified
gating technique, PMOS sleep transistor with input (S) is HS-drain gating technique has the the least delay among
VLSI Design 3
Pull-up Pull-up
network S network S
S Pull-up S Pull-up
network network
Pull-down
S Pull-down S S
S
network S
network S
S Pull-down
Pull-down S network
network
Figure 2: (a) HS-drain gating, (b) HS-power gating, (c) HS-DHPF, and (d) HS-DFPH.
Table 1: Power and delay values of NAND gate, FA, and 8-bit RCA
using various techniques.
a b c
NAND gate FA 8-bit RCA a b
Circuit
techniques Power Delay Power Delay Power Delay a
Carry
(nW) (ps) (nW) (ps) (uW) (ps) c a
Sum
Standard
22.32 45𝑒3 2.1𝑒3 30𝑒3 52.2 23.7𝑒3 b
CMOS
Drain b
12.63 25𝑒3 393 15𝑒3 7.53 8.85𝑒3 c
gating
Power
8.73 205𝑒3 238 150𝑒3 2.39 20.5𝑒3 a c
gating a
DHPF 11.08 80𝑒3 340 25𝑒3 3.02 12𝑒3
DFPH 8.71 175𝑒3 245 150𝑒3 4.02 15.5𝑒3 b
a b b
a b c
HS-drain
18.42 2.22 250 15.08 6.97 10.3
gating c
HS-power
16.57 11.76 246 49.9 2.13 21.4
gating
HS-DHPF 16.70 6.3 248 35.97 4.15 11.9 Figure 3: 1-bit CMOS full adder.
HS-DFPH 16.64 11.71 247 44.87 2.92 17
Sleep 1.00E − 07
B7 A 7 B1 A 1 B0 A 0
1.00E − 08
C1
C8 FA7 FA1 FA0 Cin
1.00E − 11
The total power consumption and propagation delay of
various existing and proposed techniques for NAND gate,
FA, and 8-bit RCA are compared in Table 1. HS-drain gating
technique has the least delay. HS-power gating, HS-DFPH, 1.00E − 12
1 2 3 4 5 6 7
and HS-DHPF suffer from 50%, 39%, and 13% propagation ∘
delay with respect to HS-drain gating technique. Standard Temperature ( C)
drain gating and its variants circuit techniques suffer from
99% propagation delay in comparison with HS-drain gating Power gating CMOS
HS-drain gating DFPH
technique. Circuits employing HS-power gating technique Drain gating
HS-power gating
have very low power consumption. Power savings of nearly HS-DFPH DHPF
85% are achieved in arithmetic architectures employing HS- HS-DHPF
power gating technique. HS-drain gating technique has the
least power saving among the proposed circuits. HS-DHPF Figure 5: Temperature versus the propagation delay for the existing
and HS-DFPH techniques optimize the power and delay in and the proposed techniques.
CMOS arithmetic circuits.
The corner analysis for the drain gating design and its
variants is plotted along with that of the modified high speed
counterparts. Figure 5 shows the temperature versus the
propagation delay graph for 8-bit RCA using the existing
techniques and the proposed techniques. 1.00E − 07
Similarly Figure 6 shows the plot of process corners
versus the propagation delay of 8-bit RCA using the existing
techniques and the proposed techniques. 1.00E − 08
and 6, we can infer that the designs made using the modified
1.00E − 09
high speed drain gating technique and its corresponding
variants have substantial reduction in the propagation delay
when compared to the designs made using the CMOS, drain
1.00E − 10
gating technique, and its variants.
5. Conclusions 1.00E − 11
Conflict of Interests
The authors declare that there is no conflict of interests
regarding the publication of this paper.
References
[1] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leak-
age current mechanisms and leakage reduction techniques in
deep-submicrometer CMOS circuits,” Proceedings of the IEEE,
vol. 91, no. 2, pp. 305–327, 2003.
[2] M. Powell, S. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar,
“Gated-Vdd: a circuit technique to reduce leakage in deep-
submicron cache memories,” in Proceedings of the IEEE Sym-
posium on Low Power Electronics and Design (ISLPED ’00), pp.
90–95, July 2000.
[3] M. Johnson, D. Somasekhar, L. Y. Chiou, and K. Roy, “Leakage
control with efficient use of transistor stacks in single threshold
CMOS,” IEEE Transactions on VLSI Systems, vol. 10, no. 1, pp.
1–5, 2002.
[4] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and
J. Yamada, “1-V power supply high-speed digital circuit technol-
ogy with multithreshold-voltage CMOS,” IEEE Journal of Solid-
State Circuits, vol. 30, no. 8, pp. 847–854, 1995.
[5] L. Wei, Z. Chen, M. C. Johnson, K. Roy, Y. Ye, and V. K. De,
“Design and optimization of dual-threshold circuits for low-
voltage low-power applications,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 7, no. 1, pp. 16–24,
1999.
[6] J. C. Park and V. J. Mooney III, “Sleepy stack leakage reduction,”
IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 14, no. 11, pp. 1250–1263, 2006.
[7] S. Narendra, V. De, D. Antoniadis, A. Chandrakasan, and S.
Borkar, “Scaling of stack effect and its application for leakage
reduction,” in Proceedings of the International Symposium on
Low Electronics and Design (ISLPED '01), pp. 195–200, Hunting-
ton Beach, Calif, USA, August 2001.
[8] N. Hanchate and N. Ranganathan, “LECTOR: a technique for
leakage reduction in CMOS dircuits,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 12, no. 2, pp. 196–
205, 2004.
[9] S. Katrue and D. Kudithipudi, “GALEOR: leakage reduction for
CMOS circuits,” in Proceedings of the 15th IEEE International
Conference on Electronics, Circuits and Systems (ICECS ’08), pp.
574–577, September 2008.
[10] J. W. Chun and C. Y. R. Chen, “A novel leakage power reduction
technique for CMOS circuit design,” in Proceedings of the
International SoC Design Conference (ISOCC ’10), pp. 119–122,
November 2010.
Hindawi Publishing Corporation
VLSI Design
Volume 2014, Article ID 529392, 12 pages
http://dx.doi.org/10.1155/2014/529392
Review Article
Gate-Level Circuit Reliability Analysis: A Survey
Copyright © 2014 R. Xiao and C. Chen. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Circuit reliability has become a growing concern in today’s nanoelectronics, which motivates strong research interest over the
years in reliability analysis and reliability-oriented circuit design. While quite a few approaches for circuit reliability analysis have
been reported, there is a lack of comparative studies on their pros and cons in terms of both accuracy and efficiency. This paper
provides an overview of some typical methods for reliability analysis with focus on gate-level circuits, large or small, with or without
reconvergent fanouts. It is intended to help the readers gain an insight into the reliability issues, and their complexity as well as
optional solutions. Understanding the reliability analysis is also a first step towards advanced circuit designs for improved reliability
in the future research.
Neumann model [6] for gate errors, any gate can be associated X1
r Z
independently with an error probability 𝜀𝑖 . In other words, the X2
gate is modeled as a binary symmetric channel that generates
a bit flip (from 0 → 1 or 1 → 0) by mistake at its output (known
as von Neumann error [6]) symmetrically with the same X1
r=1 Z
probability. Thus, each gate 𝑖 in the circuit has an independent X2 r=1
gate reliability 𝑟𝑖 = 1 − 𝜀𝑖 , which is assumed to be localized Xr
and statistically stable. Also, it is reasonable to assume that
the error probability for any gate falls within [0, 0.5] (or 𝑟𝑖 ∈
[0.5, 1]). r=1
The reliability for a combinational logic circuit (denoted rZ
by 𝑅𝐶) is defined as the probability of the correct functioning
at its outputs (i.e., the joint signal reliability of all primary X1
outputs). This reliability can be generally expressed as a X2 r=1
r=1 Z
function of gate reliabilities in the circuit (denoted by r = Xr
{𝑟1 , 𝑟2 , . . . , 𝑟𝑁𝑔 }, where 𝑁𝑔 is the number of gates), as well as
signal probabilities of all primary inputs(denoted by Pin = Figure 1: An AND gate and its equivalent circuit.
{𝑃in1 , 𝑃in2 , . . . , 𝑃in𝑁in }, where 𝑁in is the number of primary
inputs), that is,
𝑅𝐶 = Pr {all outputs are correct} = 𝑓 (r, Pin ) , dissipation. One of the key issues in this context is to select
(1)
the most critical (in terms of reliability and cost) components
where the function 𝑓 depends on the topology of the (or logic gates) in the circuit and improve the circuit reliability
circuit under consideration. Note that the primary inputs are by increasing the robustness of only a few gates. In order to
assumed to be fully reliable (𝑟𝑠 = 1 if 𝑠 is a primary input). detect these critical gates, multiple cycles of reliability analysis
Under a particular case where all primary input probabilities are usually conducted for the whole circuit. In a more general
are a constant (say 0.5), 𝑅𝐶 turns out to be a function of r only. term, accurate and efficient reliability analysis can provide a
It is worth noting that gate errors may come from either guideline for future reliability-oriented architecture design.
external noises (thermal noise, crosstalk, or radiation) [3]
or inherent device stochastic behaviors [4]. In literature, the 1.3. Complexity of Gate-Level Reliability Analysis. It is under-
term “soft error” is used to emphasize the temporariness of stood that the problem of determining whether the signal
the errors due to random external noises (e.g., glitches). In probability at a given node is nonzero is equivalent to
this paper, however, a more general term of von Neumann the Boolean satisfiability (SAT) problem [8], a problem
gate error model is used instead, as the probabilistic feature of determining whether there exists an interpretation that
of gates is expected to exist widely and independently satisfies a given Boolean formula. A Boolean formula is
throughout the circuit. This differs from single-event upsets called satisfiable if the variables of this given formula can be
due to soft errors, where external noises are usually correlated assigned in such a way as to make the formula evaluate to
temporally and spatially. In other words, our focus is the error TRUE (3). The SAT has been proved to be an NP-complete
propagation in combinational networks, where the gate-level problem (see [9]). The problem of computing all signal
logic masking is considered. For instance, some logic errors probabilities in a circuit can be formulated as a random
may not affect (or propagate to) final outputs if they occur satisfiability problem, which is to determine the probability
in a nonsensitized portion of the circuit. Identifying these that a random assignment of variables will satisfy a given
nonsensitized gates would be critical for reliability estimation Boolean formula [9]. The random satisfiability problem lies in
and improvement. a class of problems, called #P-complete, which is conjectured
to be even harder than NP-complete. In the following, we
1.2. Role of Reliability Analysis. In order to guide the IC show that the reliability evaluation problem is equivalent to
design for reliable logic operations, it is required to develop the signal probability calculation problem and thus prove that
tools that can accurately and efficiently evaluate circuit reli- it is also a #P-complete problem.
ability, which is also a first step towards reliability improve- Let us consider a two-input AND gate (𝑍 = 𝑋1 𝑋2 ) which
ment. However, reliability analysis is a nontrivial task due to has the gate reliability 𝑟, as shown in Figure 1. We first add
the large size of IC circuits as well as the complexity of signal an extra XOR gate at the output, as well as an extra input 𝑋𝑟 ,
correlation and probability/reliability propagation within the with an assumption that both the XOR gate and original AND
circuit (as will become clear later in this paper). On the gate are error-free. The signal probability of this extra input
other hand, circuit reliability can be generally improved by is equal to the original gate error rate 𝜀 (i.e., 𝑃𝑋𝑟 = Pr{𝑋𝑟 =
increasing the gate reliabilities. This can be done by using “1”} = 1 − 𝑟). This ensures that the output 𝑍 of this extra XOR
redundant components. Classic redundancy techniques such gate is equivalent to the original output of the AND gate.
as TMR [5] or NAND-multiplexing [7] achieve this by For a combinational logic circuit, we first duplicate the
systematically replicating logic gates (other than sizing up whole circuit. In the original circuit, we make each gate
the transistors) at the cost of increased area and power error-free in order to compute the correct value at primary
VLSI Design 3
outputs. For the duplicated one, we extract the reliability of respectively. The fanout behavior is represented by explicit
each gate using the aforementioned method (as a result, all fanout gates, where a 1-input 𝑚-output fanout gate is simply
gates are also error-free in the duplicated circuit and the gates’ mimicked by a 1-input 𝑚-output buffer gate. A fault-free
number is doubled). Then, we add 2-input XNOR gates for circuit has an ideal transfer matrix (ITM), where the correct
each pair of corresponding primary outputs in the original value of the output occurs with the probability of 1. This
and duplicated circuits. Thus, the output reliability can be means that, in each row of the PTM, there is single “1” for
expressed as the signal probability at the output of the XNOR the correct output value and there are “0”s for other output
gates. By doing so (i.e., duplicating the circuit and extracting combinations. The circuit reliability (i.e., the probability of
gate reliabilities), we see that the reliability estimation of outputs being correct) is evaluated by comparing its PTM and
original circuit is equivalent to the problem of computing the ITM.
signal probabilities of the transformed circuit. The process of combining gate probability matrices
For a combinational logic circuit with 𝑁in primary inputs, implicitly takes into account the signal dependency between
𝑁out primary outputs, and 𝑁𝑔 logic gates, the problem of gates by considering the underlying joint and conditional
evaluating the signal reliability of all primary outputs and probabilities within the circuit. As a result, the calculation
their joint reliability (i.e., the overall circuit reliability 𝑅𝐶) can of the circuit PTM is exact. However, the limited scalability
be solved by exhaustively calculating all 2(𝑁in +𝑁𝑔 ) scenarios. is often a price that has to be paid for this computational
In each scenario, the expected (correct) output and actual framework to capture complex circuit behaviors. Consider
output values need to be calculated with the complexity of a combinational logic circuit with 𝑁in primary inputs, 𝑁out
𝑂(𝑁𝑔 ). The total complexity is then 𝑂(𝑁𝑔 ⋅ 2𝑁in +𝑁𝑔 ). As primary outputs, and 𝑁𝑔 logic gates. The circuit PTM is a
circuits become very large, it would be difficult or even matrix with 2𝑁in rows and 2𝑁out columns (i.e., 2𝑁in × 2𝑁out ),
impossible to perform the exact analysis of the reliability due which contains the transition probability from all input com-
to the exponential complexity. Usually, some tradeoff has to binations toward all output combinations. In other words,
be made between the accuracy and efficiency for reliability its space complexity is 𝑂(2𝑁in +𝑁out ). This exponential space
analysis. requirement is the main bottleneck of PTM approach. Partic-
In order to tackle this issue, a number of different ularly, for a computer with 2 GB memory, the maximum size
approaches have been reported in literature, including prob- of the circuit that can be handled is limited to 16 input/output
abilistic transfer matrix (PTM) method [10–12], Bayesian signals. By utilizing some advanced computation methods
networks (BN) [13–15], Markov random field (MRF) [16– (such as algebraic decision diagrams (ADDs) and encoding
20], Monte Carlo (MC) simulation, testing-based method [3], [10, 11]), the signal width may be extended up to ∼50, where
stochastic computation model (SCM) [2, 21], probabilistic the signal width is defined as the largest number of signals
gate model (PGM) [22–25], observability-based analysis at any level in the circuit. Unfortunately, this limit is still
[26], Boolean difference-based error calculator (BDEC), and computationally unacceptable in the real world for large-scale
correlation coefficient method- (CCM-) based approaches benchmark circuits (e.g., C2670 which has 157 inputs and 64
[8, 26–28]. In the following, we overview some of these outputs). Nonetheless, for small circuits, the PTM is a very
approaches and analyze their pros and cons in terms of good analytical method, as it provides exact results within a
accuracy, efficiency, and flexibility with simulation results. reasonable runtime and shows the probabilistic behavior of
unreliable logic gates.
Also, this approach can serve as the foundation of many
2. Probabilistic Transfer other heuristic approaches by providing other important
Matrix (PTM) Method information such as signal probabilities and observability,
An accurate analytical model for reliability analysis problem with the capability of analyzing the effect of electrical masking
is based on the probabilistic transfer matrices (PTMs), which on error mitigation as well. For instance, in [10], the observ-
compute the circuit output reliability for all input patterns ability of a gate 𝑔 is defined as the ratio of the error probability
[10, 11]. This computational framework begins with the of the whole circuit and the error probability 𝜀𝑖 of this gate,
definition of a probability matrix which is used to represent that is, (1 − 𝑅𝐶(𝜀𝑖 ))/𝜀, where 𝑅𝐶(𝜀𝑖 ) is the circuit reliability
the probability of a logic gate’s output for each input pattern. when the only unreliable gate is 𝑖th gate (with all other gates
For instance, the probability matrix representation for a two- being error-free). Clearly, the gate with highest observability
input NAND logic gate is shown in Figure 2, where each can be regarded as the most susceptible, meaning that it will
column of the matrix M𝑔 represents the probability of the impact (or decrease) the circuit reliability the most. It should
gate output 𝑍 being “0” or “1” for all different input patterns be noted that this only represents the simplest case where only
(i.e., 𝑋1 𝑋2 = “00,” “01,” “10,” and “11”). For example, the single gate failure is considered. In most real cases, however,
element M11 = Pr{𝑍 = 0 | 𝑋1 𝑋2 = 00} = 1 − 𝑟, where 𝑟 the gate observabilities may not be independent, and thus the
is the gate reliability. In general, the probability matrix for an joint observabilities usually need to be considered instead.
𝑛-input 1-output gate is a 2𝑛 × 2 matrix. The detailed algorithm with the PTM is summarized as
For a circuit, all gate probability matrices shall be com- follows.
bined together to construct the PTM of the whole circuit.
More specifically, the serial and parallel connections of gates Step 1. Levelize the circuit; compute PTMs of each logic
𝑗
correspond to a matrix product and tensor product [10], component in each level denoted by MLv𝑖 .
4 VLSI Design
0 1⌊
00 1− r r
X1 01⌈1− r r
Z Mg =
r 10 1− r r ⌈
X2 11⌊r 1− r
(a) (b)
Figure 2: (a) A 2-input NAND gate and (b) its probability matrix M𝑔 (according to [10]).
size increases to ∼40, both runtime and memory cost will 0.01
grow dramatically, making the PTM method computationally 0.008
expensive. In order to handle large-scale circuits, a variant
0.006
PTM method was proposed in [11], where the input vector
Relative error at RC
sampling is used. The simulation results show that this does 0.004
improve efficiency with reduced memory cost, while the 0.002
accuracy remains to be seen.
0
In summary, the PTM method has two major limitations.
First, the signal width of the circuit that can be analyzed −0.002
is very limited. This is due to the fact that its space com- −0.004
plexity grows exponentially with the number of inputs and −0.006
outputs, leading to prohibitively massive matrix storage and
manipulation overhead for large-scale circuits. Secondly, the −0.008
circuit structure needs to be preprocessed (such as circuit −0.01
0 1 2 3 4 5 6 7 8 9 10
levelization and identification of the fanout nodes and wire
pairs) prior to the algorithm implementation. Also, the PTM Number of simulation runs ×104
assumes all signals are correlated, which makes the method Figure 4: The relative error of circuit reliability 𝑅𝐶 of Figure 3 versus
less efficient for circuits with no or a few reconvergent the number of MC simulation runs 𝑁MC .
fanouts.
3. Monte Carlo (MC) Simulation the MC result is a nonzero value (5.636𝑒 − 04), indicating a
low convergent rate with the MC. This is a common feature
MC is a widely known simulation-based approach, where
for stochastic computations.
experimental data are collected to characterize the behavior
of a circuit by randomly sampling its activity [2]. It is usually
used when an analytical approach is unavailable or difficult 4. Stochastic Computation Model (SCM)
to implement. The obvious drawbacks of this approach lie in
the fact that numerous pseudorandom numbers need to be Unlike the MC method which uses Bernoullisequences
generated, and a large number of simulation runs must be for simulation, the SCM approach takes non-Bernoulli
executed to reach a stable result. This makes the reliability sequences [2, 21]. In a non-Bernoulli sequence, for a given
analysis for large circuits a very time-consuming process. probability 𝑝 and a sequence length 𝑁, the number of “1”s to
As a stochastic computation framework, the MC method be generated is fixed and given by 𝑁⋅𝑝, and only the positions
makes the result gradually converge to its exact value as more of the “1”s are determined by a random permutation of
simulation runs are performed. In the process of achieving binary bits. Therefore, in SCM approach, less pseudorandom
relatively stable results, certain statistical parameters (such numbers are generated for the same length of simulation,
as standard deviation 𝜎 and/or coefficient of variance (CV) compared to MC simulation where pseudorandom numbers
which is defined as the ratio of the standard deviation and are independently generated for each gate or input to mimic
the mean, i.e., 𝜎/𝜇) are usually used as the stopping criteria. the behavior of probabilistic circuits [2].
In [2], CV = 0.001 is used to represent an acceptable level Consider a circuit with 𝑁in , 𝑁out , 𝑁𝑔 , Pin , and 𝜀 (refer
of accuracy, and the number of simulation runs required is to the previous sections for definitions of these variables). If
given by we use a sequence length of 𝑁, the total required number
of random numbers is given by (𝑁in + 𝑁𝑔 ) ⋅ 𝑁 in MC
1 − 𝑅𝐶 1 1 simulation. In contrast, for the SCM approach with the same
𝑁MC = ⋅ ≈ 106 ⋅ ( − 1) , (5)
𝑅𝐶 CV 2 𝑅 𝐶
sequence length, only 𝑁𝜀 pseudorandom numbers need to
be generated (for the positions of “1”s) for a gate with error
where 𝑅𝐶 is again the circuit reliability. Since the circuit rate 𝜀. Therefore, the total number of random numbers is
reliability usually decreases with the circuit size (𝑁𝑔 ), the reduced to (𝑁in ⋅ 𝑝in + 𝑁𝑔 ⋅ 𝜀) ⋅ 𝑁. Since the gate error
𝑁MC will increase with the circuit size for a given accuracy rate 𝜀 is usually a small value which can be viewed as a
(measured by CV). Assuming that the 𝑅𝐶 ranges from 0.1 to scale factor, the total required random number is significantly
0.9, the number of MC runs will vary around 105 ∼ 107 . reduced. In other words, for a specific level of accuracy, the
It should be mentioned that (5) only gives an approximated non-Bernoulli sequence requires a smaller sequence length
range of 𝑁MC , and its actual value is usually determined than the Bernoulli sequence does. However, how to efficiently
experimentally for real circuits. Let us take the circuit of determine the required minimum sequence length for the
Figure 3 again as an example. From (5), the required 𝑁MC SCM is still an open question. In [2], an empirical function
is ∼1.55 × 105 if 𝑅𝐶 = 0.8658. Figure 4 shows the relative (rather than an analytical expression) was used for this
error at 𝑅𝐶 against 𝑁MC . It can be seen from the figure that purpose.
after ∼104 runs, the result becomes relatively stable around its Again, we took the example circuit of Figure 3 and used
final value. However, a small random fluctuation is inevitable. the same sequence length with MC (i.e., 𝑁SCM = 𝑁MC =
Even after ∼105 simulation runs, the relative error of 105 ) with gate error rate 𝜀 = 0.001. The SCM and MC
6 VLSI Design
Table 1: Runtime comparison of MC and SCM on benchmark 5. Probabilistic Gate Model (PGM)
circuits (𝜀 = 0.01).
The PGM is another reliability analysis method which is
MC (106 runs) SCM (106 runs) based on the probabilistic models of unreliable logic gates
Circuit Size Runtime (s) [22–25]. In the simple version of PGM, the input signals
Runtime (s)
𝜀 = 0.01 𝜀 = 0.1 of each gate in the circuit are assumed to be independent.
c432 160 183 31 38 Under this assumption, the output probability of each gate
c499 202 203 37 45 can be easily calculated using the information of input signal
c880 383 373 63 77 probabilities and gate error rate. For instance, consider a
c1355 546 472 92 111 2-input NAND gate with input probabilities of 𝑋1 and 𝑋2
c1908 880 842 183 215
and gate error rate of 𝜀. Its output signal probability can be
expressed as (after [24])
c2670 1193 1151 265 311
c3540 1669 1616 409 505 𝑍 = Pr (“1” | gate faulty) ⋅ Pr (gate faulty)
c5315 2406 2548 786 961
c7552 3512 3732 1325 1495 + Pr (“1” | gate not faulty) ⋅ Pr (gate not faulty) (6)
= (1 − 𝜀) + (2𝜀 − 1) 𝑋1 𝑋2 .
0.01
This output probability 𝑍 can be used recursively as the
0.008
input information at next level of gates. One of the main
0.006 features with PGM is that the circuit reliability is analyzed by
0.004 exhaustively evaluating each input combination and output.
Relative error at RC
Simple PGM approach (103 samples) Accurate PGM approach (103 samples) [24]
Circuit Size
Average error (%) Runtime (s) Runtime (s)
Cu 43 1.37 0.0277 0.10
z4ml 45 0.94 0.0039 0.05
x2 38 0.52 0.0275 0.22
Mux 50 0.52 0.282 0.10
𝑂(𝑁𝑔 ⋅ 2𝑁in +𝑁𝑓 ) [24]. However, in many real circuits, the 0.6
number of reconvergent fanouts 𝑁𝑓 is comparable to the 0.5
number of gates (𝑁𝑔 ). Thus, the complexity of the above
accurate PGM algorithm is still an exponential function of the 0.4
circuit size, making it infeasible in general for large circuits. 0.3
In an effort to improve the efficiency of the accurate PGM
0.2
method, a modular PGM approach was also introduced in
[24]. It is based on the observation that many large circuits 0.1
contain a limited number of simple logic components that are 0
used repeatedly. With this in mind, circuits can be decom- 2 4 6 8 10 12 14 16
posed into several modules whose reliabilities are calculated Input patterns
using the accurate PGM method. The circuit output reliability
Figure 6: The conditional output reliability of Figure 3 for different
is then evaluated by combining these modules along the path
input combinations (𝑥-axis labels 1∼16 indicate 16 input patterns
from primary inputs. Unfortunately, the input sampling is still from 0000∼1111).
needed in this case for large-scale circuits.
For the example circuit of Figure 3 with 4 input signals, a
total of 16 input combinations need to be considered. We plot In order to see the performance of different PGM algo-
the conditional output reliability for each input combination rithms on large circuits, we implemented the simple PGM
in Figure 6, which shows that the output reliability varies algorithm in Matlab and tested it on ISCAS’85 benchmarks.
within a relatively small range (no more than ±10%) for The results are shown in Table 2. We also compare the simple
different input combinations. In other words, the input vector PGM with both accurate and modular PGM methods in
sampling can be implemented effectively with small errors. Tables 3 and 4, where the simulation results for both accurate
The overall output reliability is given by a weighted sum over and modular PGM methods are taken from [24].
all input combinations and is found to be 𝑅𝐶 = 0.8701 (with It can be seen from these tables that the simple PGM
the runtime of 𝑇PGM = 0.0093 s), compared to the accurate algorithm can provide highly accurate results if the circuits
value of 0.8658 given by PTM (i.e., the relative error is as low (such as C432 and C1355) have no or few reconvergent fanouts
as ∼0.5%). and/or if the fanouts originate from the primary inputs.
8 VLSI Design
For those circuits with significant fanouts (such as C2670 and If a multiple-error case is considered, the complexity of
C5315), the average (or maximum) errors for the simple PGM computing the reliability will grow exponentially with 𝑁𝑔 .
can increase significantly (in particular, the maximum error is In order to improve the efficiency in this case, the following
up to 43% for C5315, as shown in Table 2). From Table 3, the two assumptions are used in [26]: (a) the impacts of gate
accurate PGM need longer runtimes than the simple PGM failures on the primary output are decoupled, which implies
for small circuits. Results in Table 4 confirm that the modular that the output is erroneous if an odd number of gates
PGM is very efficient while the accuracy may not always be are simultaneously observable and (b) the observabilities
good enough for some circuits (with an average error of 9% of all gates are independent. As a result, the simultaneous
for C432). observability of multiple gates is simply the product of their
In summary, for all the above three different versions of individual observabilities.
PGM, the input sampling is inevitable for improved efficiency We took the example circuit of Figure 3 for illustration.
if the number of primary inputs 𝑁in is large (∼30). This is First, let us assume all four gates (G1 ∼ G4 ) in the circuit
mainly where the analysis errors come in. Thus, it can be con- are erroneous with the probabilities of 𝜀1 , 𝜀2 , 𝜀3 , and 𝜀4 ,
cluded that they represent a good model only for circuits with respectively (other cases can be analyzed similarly). Based
a small number of primary inputs, where no input sampling on the above assumption (a), we only need to consider the
is required. For the circuits without reconvergent fanouts, the cases where an odd number (1 or 3) of gates is simultaneously
input sampling in the PGM approach is unnecessary, because observable. This means that when an even number (0, 2, or
both signal probability and output reliability in this case can 4) of gates is observable, the output signal 𝑍 will has correct
be computed within 𝑂(𝑁𝑔 ) time (see [29] for details). value as gate errors are logically masked by one another.
Secondly, under the assumption (b), the probability of only
6. Observability-Based Reliability Analysis one gate being observable is given by ∑𝑖 (𝑜𝑖 ⋅∏𝑗 ≠
𝑖 (1−𝑜𝑗 )) (the
probability of three gates being simultaneously observable
Another reliability analysis method was presented in [26], can be calculated similarly). Based on these assumptions, a
which is based on the observation that an error at the closed-form expression for the circuit reliability of the circuit
output of any gate is the cumulative effect of a local error (assuming a single primary output) can be written generally
component attributed to the error probability of the gate, and as a function of error probabilities and observabilities of all
a propagated error component was attributed to the failure gates [26]; that is,
of gates in its transitive fan-in cone. In [26], the observability
of a gate (or its output signal) is the conditional circuit error 1
𝑅𝐶 = (1 + ∏ (1 − 2𝜀𝑖 𝑜𝑖 )) , (9)
probability given the single error at current gate. The value 2 𝑖
of this observability can be simply defined as 𝑜𝑖 = (1 −
𝑅𝐶(𝜀𝑖 = 1)), where 𝑅𝐶(𝜀𝑖 = 1) is the circuit reliability given which can be computed efficiently if all gate observabilities
a single error with the current gate, and can be calculated are known (however, this analysis is only suitable for small
using Boolean differences [29], symbolic techniques (such as circuits or large ones with small values of gate error prob-
BDDs), or simulation method. It can be expected that the gate abilities, which will be clear later). The gate observability
observabilities are highly related to the input probabilities. can be determined using the PTM method. For instance, the
For a single-fault case (i.e., only one gate in the circuit is observability of gate G1 in Figure 3 is calculated as the output
erroneous), the circuit reliability (assuming a single primary reliability by setting 𝑟1 = 0 and 𝑟2 = 𝑟3 = 𝑟4 = 1. The results
output) can be simply calculated by considering each fault are [𝑜1 , 𝑜2 , 𝑜3 , 𝑜4 ] = [0.25, 0.375, 0.375, 0]. We calculate the
case individually. Assume that the error rate and observability circuit reliability 𝑅𝐶 using the above expression and plot
of the 𝑖th gate are 𝜀𝑖 and 𝑜𝑖 , respectively. If gate 𝑖 is erroneous the results against the accurate values given by the PTM in
while the other gates are fault-free, the output reliability Figure 7(a) for different values of gate reliability. The relative
simply is equal to 𝑜𝑖 . Thus, the overall reliability can be easily error is shown in Figure 7(b). It can be seen clearly from these
calculated by figures that the observability-based analysis is only accurate
for small gate error rates, in which case the probability for
single gate failure is significantly higher than that for multiple
𝑅𝐶 = ∑ (𝜀𝑖 ⋅ 𝑜𝑖 ⋅ ∏ (1 − 𝜀𝑗 )) , (8) gate failures.
all 𝑖 𝑗 ≠
𝑖
To reduce the computational complexity of the above
which is exact for the single-fault case. observability-based reliability analysis, [26] also proposed
VLSI Design 9
0.25
correlation coefficient between signals 𝑖 and 𝑗 is defined as
0.2 [26]
0.15 𝑃 (𝑖0 → 1 𝑗0 → 1 )
𝐶𝑖𝑗 = ,
0.1 𝑃 (𝑖0 → 1 ) 𝑃 (𝑗0 → 1 )
0.05 𝑃 (𝑖0 → 1 𝑗1 → 0 )
𝐶𝑖𝑗̃ = ,
𝑃 (𝑖0 → 1 ) 𝑃 (𝑗1 → 0 )
0 (11)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
𝑃 (𝑖1 → 0 𝑗0 → 1 )
Gate reliability rg 𝐶̃𝑖𝑗 = ,
𝑃 (𝑖1 → 0 ) 𝑃 (𝑗0 → 1 )
(b)
𝑃 (𝑖1 → 0 𝑗1 → 0 )
Figure 7: (a) Circuit reliability 𝑅𝐶 versus gate reliability and (b) 𝐶̃𝑖𝑗̃ = ,
relative error versus gate reliability for the example circuit of 𝑃 (𝑖1 → 0 ) 𝑃 (𝑗1 → 0 )
Figure 3.
where the 𝑃(𝑖0 → 1 ) is the probability that the value of signal
𝑖 flips to 1 from its correct value 0, that is, the error
a sampling algorithm by considering the constraint that only probability of 𝑖 given that its error-free value is 0. Once
a maximum of 𝑘 gates can fail simultaneously. This algorithm the error correlations and error-free signal probabilities are
first generates a set of samples for failed gates and guarantees generated, the single-pass analysis is conducted using the
that the total number of gates with error is no more than forward topological order with the computational complexity
𝑘. Then, a single-pass reliability analysis algorithm [26] was of 𝑂(𝑁𝑔 ). Since the computation complexity of CCM is linear
used to evaluate the error probability at the primary outputs, with the number of levels (𝐿) and pseudoquadratic with the
number of gates per level (𝑁𝐿 ), the overall complexity of
leading to the computational complexity of 𝑂(𝑁𝑔 ⋅ 𝑘2 ), where
𝑁𝑔 is the number of gates with error. For a specific sample, CCM-based reliability analysis turns out to be 𝑂(𝑁𝑔1.5 ) if
the reliabilities of gates in the sampling are set to be 0 and a square circuit is assumed (i.e., 𝑁𝐿 = 𝐿 = 𝑁𝑔0.5 ). This
the rest are set to be 1. Finally, the overall circuit reliability complexity is an upper bound as not all signals are correlated
is estimated by averaging the reliabilities over all samples. in real circuits.
Therefore, this maximum-𝑘 gate failure model can be viewed In [26] which uses the CCM, an average relative error
as a hybrid method that makes a trade-off between the of up to ∼13% over all outputs was reported for circuits
accuracy of simulation-based method and the efficiency of with significant fanout (e.g., C499 and C1355) when the gate
analytical approach. It provides more accurate results than error rates range within [0, 0.5] (for other benchmark circuits,
10 VLSI Design
the error was around 2∼6%). Also, the relative errors may Small circuit with many
not be mitigated significantly by using more correlation reconvergent fanouts
coefficients. For instance, by using 0, 4, and 16 correlation Large circuit with a few
coefficients, the relative errors for C499 are only improved
Accuracy level
reconvergent fanouts
to 13.1%, 11.2%, and 11.11%, respectively [26], where the zero-
coefficient case means that all signals are treated as indepen-
dent with the computation complexity of 𝑂(𝑁𝑔 ). It is shown
in [26] that the runtime of using 4 coefficients is several orders
of magnitude longer than the zero-coefficient case (∼100 s
Large circuit with lots
versus ∼1 s, for circuit with ∼1000 gates). Therefore, it may of reconvergent fanouts
not be worthwhile to calculate more correlation coefficients Small circuit with a few
reconvergent fanouts
for slightly improved accuracy. In [30], the relative error for
large circuits (with hundreds of gates) was reported at ∼7% Computation cost
on average with the runtime of ∼10 s, which is comparable to
Figure 8: A general sketch of solution space for different circuit
those from [26].
categories.
depending on the possible correlation among individual [5] C. Chen, “Reliability-driven gate replication for nanometer-
output reliabilities. For an extreme case where all individual scale digital logic,” IEEE Transactions on Nanotechnology, vol.
output reliabilities are independent, the joint reliability will 6, no. 3, pp. 303–308, 2007.
simply be the product of all these reliabilities, which leads [6] J. von Neumann, “Probabilistic logics and the synthesis of
to a minimum value. As the correlation of output reliabilities reliable organisms from unreliable components,” in Automata
becomes strong, the joint reliability tends to rise. In general, Studies, C. E. Shannon and J. McCarthy, Eds., pp. 43–98,
the complexity of computing the joint reliability would be an Princeton University Press, Princeton, NJ, USA, 1956.
exponential function of the number of primary outputs. It is [7] J. Han and P. Jonker, “A system architecture solution for
still an open question how to estimate the joint reliability for unreliable nanoelectronic devices,” IEEE Transactions on Nan-
multiple-output circuits in an efficient way. Secondly, most otechnology, vol. 1, no. 4, pp. 201–208, 2002.
of the current reliability analysis frameworks assume that [8] S. Ercolani, M. Favalli, M. Damiani, P. Olivo, and B. Ricco, “Esti-
the reliability for an error-free output being “0” (denoted mate of signal probability in combinational logic networks,” in
Proceedings of the 1st European Test Conference, pp. 132–138,
by 𝑟0 ) is the same as that for an error-free output being “1”
Paris, France, April 1989.
(denoted by 𝑟1 ). This is the so-called symmetric reliability
[9] M. R. Garey and D. S. Johnson, Computers and Intractability:
model. However, this assumption does not always hold true
A Guide to the Theory of NP-Completeness, W. H. Freeman, San
in the real world. Thus, an asymmetric reliability model Francisco, Calif, USA, 1979.
(where 𝑟0 ≠ 𝑟1 ) would make more sense for better estimation
[10] S. Krishnaswamy, G. F. Viamontes, I. L. Markov, and J. P.
of reliability. This requires further research work that can
Hayes, “Accurate reliability evaluation and enhancement via
take the asymmetric model into consideration. Finally, there probabilistic transfer matrices,” in Proceedings of the Design,
is also plenty of room for gate-level reliability improvement Automation and Test in Europe, vol. 1, pp. 282–287, March 2005.
using reliability-critical gates as well as considering other [11] S. Krishnaswamy, G. F. Viamontes, I. L. Markov, and J. P. Hayes,
performance metrics (such as circuit area and delay and “Probabilistic transfer matrices in symbolic reliability analysis
power consumption). Unfortunately, to the best of authors’ of logic circuits,” ACM Transactions on Design Automation of
knowledge, little or limited study has been done so far in this Electronic Systems, vol. 13, no. 1, article 8, 2008.
regard. [12] W. Ibrahim, V. Beiu, and M. H. Sulieman, “On the reliability of
majority gates full adders,” IEEE Transactions on Nanotechnol-
9. Conclusion ogy, vol. 7, no. 1, pp. 56–67, 2008.
[13] T. Rejimon, K. Lingasubramanian, and S. Bhanja, “Probabilistic
We have reviewed the state-of-the-art methods for reliability error modeling for nano-domain logic circuits,” IEEE Transac-
analysis and shown their advantages and disadvantages. Some tions on Very Large Scale Integration (VLSI) Systems, vol. 17, no.
of these methods have been implemented on benchmark 1, pp. 55–65, 2009.
circuit examples to compare their performance in terms of [14] T. Rejimon and S. Bhanja, “Scalable probabilistic computing
accuracy and efficiency. While these methods seem to be models using Bayesian networks,” in Proceedings of the IEEE
effective for some specific cases/circuits, no single one of International 48th Midwest Symposium on Circuits and Systems
(MWSCAS ’05), pp. 712–715, August 2005.
them stands out as an all-time winner due to the nature and
complexity of the reliability analysis problem. Further work [15] J. T. Flaquer, J. M. Daveau, L. Naviner, and P. Roche, “Fast relia-
has also been suggested for the future research in this area. bility analysis of combinatorial logic circuits using conditional
probabilities,” Microelectronics Reliability, vol. 50, no. 9–11, pp.
1215–1218, 2010.
Conflict of Interests [16] R. I. Bahar, J. Chen, and J. Mundy, “A probabilistic-based design
for nanoscale computation,” in Nano, Quantum and Molecular
The authors declare that there is no conflict of interests Computing: Implications to High Level Design and Validation,
regarding the publication of this paper. S. Shukla and R. I. Bahar, Eds., chapter 5, Kluwer Academic,
Norwell, Mass, USA, 2004.
References [17] R. I. Bahar, J. Mundy, and J. Chen, “A probability-based design
methodology for nanoscale computation,” in Proceedings of the
[1] S. Borkar, “Designing reliable systems from unreliable compo- International Conference on Computer-Aided Design, pp. 480–
nents: The challenges of transistor variability and degradation,” 486, November 2003.
IEEE Micro, vol. 25, no. 6, pp. 10–16, 2005. [18] A. R. Kermany, N. H. Hamid, and Z. A. Burhanudin, “A study
[2] J. Han, H. Chen, J. Liang, P. Zhu, Z. Yang, and F. Lombardi, of MRF-based circuit implementation,” in Proceedings of the
“A stochastic computational approach for accurate and efficient International Conference on Electronic Design (ICED ’08), pp. 1–
reliability evaluation,” IEEE Transactions on Computers, vol. 63, 4, December 2008.
no. 6, pp. 1336–1350, 2014. [19] D. Bhaduri and S. Shukla, “NANOLAB—a tool for evaluating
[3] S. Krishnaswamy, S. M. Plaza, I. L. Markov, and J. P. Hayes, reliability of defect-tolerant nanoarchitectures,” IEEE Transac-
“Signature-based SER analysis and design of logic circuits,” tions on Nanotechnology, vol. 4, no. 4, pp. 381–394, 2005.
IEEE Transactions on Computer-Aided Design of Integrated [20] X. Lu, J. Li, and W. Zhang, “On the probabilistic characterization
Circuits and Systems, vol. 28, no. 1, pp. 74–86, 2009. of nano-based circuits,” IEEE Transactions on Nanotechnology,
[4] C. Chen and Y. Mao, “A statistical reliability model for single- vol. 8, no. 2, pp. 258–259, 2009.
electron threshold logic,” IEEE Transactions on Electron Devices, [21] H. Chen and J. Han, “Stochastic computational models for
vol. 55, no. 6, pp. 1547–1553, 2008. accurate reliability evaluation of logic circuits,” in Proceedings
12 VLSI Design
Research Article
Low-Area Wallace Multiplier
Copyright © 2014 S. Asif and Y. Kong. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Multiplication is one of the most commonly used operations in the arithmetic. Multipliers based on Wallace reduction tree provide
an area-efficient strategy for high speed multiplication. A number of modifications are proposed in the literature to optimize the
area of the Wallace multiplier. This paper proposed a reduced-area Wallace multiplier without compromising on the speed of the
original Wallace multiplier. Designs are synthesized using Synopsys Design Compiler in 90 nm process technology. Synthesis results
show that the proposed multiplier has the lowest area as compared to other tree-based multipliers. The speed of the proposed and
reference multipliers is almost the same.
Reduction tree
Final adder
2. Previous Architectures calculates the number of rows in the last group for each stage
as
This section discusses some previous Wallace tree-based
multiplier architectures. The general block diagram of tree- Last Group𝑖 = 𝑟𝑖 mod 3. (1)
based multipliers is shown in Figure 1.
The dot notation [11] is used to represent the partial An 𝑁-bit multiplier has 𝑁 rows in the first stage. The
product tree in all the architectures discussed in this section number of rows in remaining stages can be calculated by
as shown from Figures 2 to 5. The full adders and half adders using
are represented by boxes around the dot products. The box 2𝑟𝑖−1
which encloses three dot products represents a full adder, 𝑟𝑖 = ⌊ ⌋ + 𝑟𝑖−1 mod 3. (2)
3
whereas the box containing only two dot products is used
to represent a half adder. The stages are separated by a thick Reduction is performed using a full adder or a half adder
horizontal line. depending on the number of elements in that particular
column of the group. If a column has only one element then
that is passed on to the next stage without any reduction. If
2.1. Traditional Wallace (TW) Multiplier. In TW multiplier the last group of a stage contains less than three rows then no
architecture, the partial product tree is divided into groups reduction is performed on that group as shown in stage 1 of
[6]. Each stage can have one or more groups as shown in the Figure 2.
8-bit TW reduction process in Figure 2. The size of the final adder for an 𝑁-bit TW multiplier with
The groups in a stage are separated by a thin horizontal 𝑆 stages can be calculated by
line. Each group consists of three rows except the last group
where the number of rows can be less than three. Equation (1) FinalAdderTW = (2𝑁 − 1) − 𝑆. (3)
VLSI Design 3
Technology 90 nm CMOS
Supply voltage 1.2 V
Temperature 25∘ C
Normalized area
Process model Typical 1
Interconnect model Balanced tree
5. Results
Figure 6 shows the normalized area for each multiplier.
In this section, we will discuss the verification of designs for The area of all multipliers is normalized with respect to TW
correct operation, synthesis tool, and the results. multiplier by using
1.1
Normalized delay
Normalized power
1
0.9
8 16 24 32 64 0.9
8 16 24 32 64
Size of multiplier Size of multiplier
TW multiplier Dadda multiplier
TW multiplier Dadda multiplier
RCW multiplier PW multiplier
RCW multiplier PW multiplier
Figure 7: Delay of different multipliers on Synopsys 90 nm technol- Figure 8: Power consumption of different multipliers on Synopsys
ogy. 90 nm technology.
It is clear from Figure 8 that the TW multiplier has the As our future work, we plan to implement the designs
lowest power consumption as compared to the other multi- using Synopsys IC Compiler to analyze the postlayout results
pliers. One reason for this could be that the Design Compiler for area and delay. Synopsys Prime Time can be used to
was able to find the low-power cells to synthesize the TW analyze the multipliers for their power consumption.
multiplier. The regular structure of TW multiplier could
also be a reason of its low-power consumption. The power Conflict of Interests
consumption of PW is less than that of the RCW multiplier
due to the smaller final adder used in PW multiplier. It can be The authors declare that there is no conflict of interests
noted that the difference in power consumption of PW and regarding the publication of this paper.
RCW is very little for large multipliers, as expected, due to
the small difference in their area. References
[1] N. H. E. Weste and D. M. Harris, Integrated Circuit Design,
6. Conclusion and Future Work Pearson, 2010.
[2] J.-Y. Kang and J.-L. Gaudiot, “A fast and well-structured mul-
This paper presents a method to reduce the area of the tiplier,” in Proceedings of the EUROMICRO Systems on Digital
Wallace multiplier. The proposed architecture, named as PW System Design (DSD ’04), pp. 508–515, September 2004.
(proposed Wallace) multiplier, uses a smaller final adder to [3] C. R. Baugh and B. A. Wooley, “A twos complement parallel
reduce the area of a multiplier. The designs are synthesized array multiplication algorithm,” IEEE Transactions on Comput-
in Synopsys Design Compiler using 90 nm process technol- ers, vol. C-22, no. 12, pp. 1045–1047, 1973.
ogy. The synthesis results verify that the PW multiplier, as [4] S.-R. Kuang, J.-P. Wang, and C.-Y. Guo, “Modified booth mul-
expected, has the smallest area as compared to the other tipliers with a regular partial product array,” IEEE Transactions
Wallace based multipliers. The speed of the PW multiplier is on Circuits and Systems II: Express Briefs, vol. 56, no. 5, pp. 404–
almost the same as of other multipliers. 408, 2009.
6 VLSI Design
Research Article
Efficient Hardware Trojan Detection with Differential Cascade
Voltage Switch Logic
Copyright © 2014 Wafi Danesh et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Offshore fabrication, assembling and packaging challenge chip security, as original chip designs may be tampered by malicious
insertions, known as hardware Trojans (HTs). HT detection is imperative to guarantee the chip performance and safety. Existing HT
detection methods have limited capability to detect small-scale HTs and are further challenged by the increased process variation.
To increase HT detection sensitivity and reduce chip authorization time, we propose to exploit the inherent feature of differential
cascade voltage switch logic (DCVSL) to detect HTs at runtime. In normal operation, a system implemented with DCVSL always
produces complementary logic values in internal nets and final outputs. Noncomplementary values on inputs and internal nets in
DCVSL systems potentially result in abnormal power behavior and even system failures. By examining special power characteristics
of DCVSL systems upon HT insertion, we can detect HTs, even if the HT size is small. Simulation results show that the proposed
method achieves up to 100% HT detection rate. The evaluation on ISCAS benchmark circuits shows that the proposed method
obtains a HT detection rate in the range of 66% to 98%.
1. Introduction one for given inputs and work well for a functional unit
with a small set of inputs, as the probability of rare events
The growing number of ICs manufactured offshore increases is relatively high. When the circuit complexity increases, the
the threats to chip security [1–3]. Research has exposed an number of test vectors for ATPG will significantly increase
increase in existence of hardware Trojans (HTs), which are to an unaffordable degree. The benefits relative to the testing
malicious additions or modifications to the circuit design
efforts of ATPG become worse if the input nodes for the
that alter the original function. Malicious inclusions of
HT’s trigger circuit are spread out throughout the system.
hardware have the potential to degrade system performance,
The main challenge with logic testing approaches [7, 8] is the
surreptitiously delete data, leave a backdoor for secret key
leaking, or eventually destroy the chip [4, 5]. It is imperative generation of stimulus for sequential HTs. Voltage inversion
to detect HTs. technique alternates supply voltage and ground grids in
HTs can be detected by destructive approaches such CMOS-based functional blocks to change the original logic
as the chemical mechanical polishing (CMP) method. The function and thus increases the HT trigger probability [9].
CMP approach detects HTs by analyzing pictures of the Dummy flip-flops are inserted into the design to increase
demetalized chips under an electron microscope [6]. In transition probability of particular paths and reduce Trojan
addition to being expensive, this type of technique is also time activation time [10]. Alkabani [11] introduces the concept of
consuming (takes several months) and loses its efficiency creating dual circuits for a given design. By testing the dual
when the transistor density increases. Nondestructive HT with a few random input vectors, a HT inserted in the original
detection methods are broadly classified into two categories: design can be detected.
logic testing and side-channel analysis (SCA) approaches SCA approaches examine the anomalous behavior
[6]. Automatic test pattern generation (ATPG) approaches (resulting from HTs) in system parameters such as transient
examine whether the measured outputs match the expected current, power, and path delay [12–15]. A multiple-parameter
2 VLSI Design
P0 P1
(a) (b)
Figure 2: DCVSL logic gates. (a) General gate structure and (b) circuit schematic of NAND3-AND3. Current track highlighted in the figure
is for noncomplementary inputs on 𝐴 and 𝐴.
the HT detection mechanism. A current sensor is needed another one is through N3. The path through N0, N1, and N2
to convert the transient current of the DUT and produce pulls the NAND Out port low as before, which turns on P1. P1
an analog voltage that is proportional to the measured DUT then tries to pull the AND Out port high. At the same time,
current. A programmed microcontroller can sample the the path to ground through N3 tries to pull the AND Out
analog voltage signal at specific intervals using interrupts. node low. If N3 is stronger than P1 (which is typically the
When the voltage value stays approximately constant for case), the AND Out port is pulled low and this activates P0.
multiple interrupts, it indicates an abnormal short-circuit Therefore, a path from 𝑉DD to ground is created through P0,
power due to a HT creating a short-circuit path from supply N0, N1, and N2, resulting in a high and constant short-circuit
voltage 𝑉DD to ground. The microcontroller can be further power. The constant short-circuit power remains as long as
configured to set off an alarm or trigger a light-emitting diode the duration of the input vector.
to indicate HT detection to the user. Figures 3(a) and 3(b) show the power waveforms with
complementary and noncomplementary inputs, respectively.
2.2. Short-Circuit Power-Based HT Detection As shown in Figure 3(b), in the duration of the input vector
A = 𝐴 = B = C = 1 (from 7 to 8 𝜇s on the time axis), the
2.2.1. Unique Short-Circuit Power in DCVSL. Each DCVSL peak power has a constant high value. This is because the
gate needs complementary inputs and produces comple- noncomplementary input pair (𝐴 = 𝐴) makes NAND Out
mentary outputs [21], as shown in Figure 2(a). In normal and AND Out both stay at logic low. The time from 7 to
operation, short-circuit power consumption of DCVSL gate 8 𝜇s represents the high time of the shortest input pulse
is close to that of CMOS logic gate, as the time period for the 𝐴. As a result, the two PMOS transistors, P0 and P1, are
direct current path from 𝑉DD to ground is extremely short both turned on; thus, the two current paths from 𝑉DD
compared with that in switching and steady state conditions. to ground (highlighted in Figure 2(b)) exist till the input
When the input pair is noncomplementary (both inputs vector is changed. The amplitude of short-circuit power is
being either logic 0 or logic 1), a DCVSL gate loses its typically three orders of magnitude higher than the leakage
complementary nature. More specifically, the output pair may power. This significant power difference between the cases
be noncomplementary, resulting in the short-circuit power using complementary and noncomplementary inputs is large
consumption lasting for a significantly longer time than the enough for a monitoring device to indicate the presence of a
case with complementary inputs. HT.
Take a 3-input NAND-AND gate as an example. The cir- We examine the average power for complementary and
cuit schematic is shown in Figure 2(b). In normal operation noncomplementary inputs for basic DCVSL gates using a
conditions, we give the input vector of 𝐴 = 𝐵 = 𝐶 = 1 typical IBM7RF technology library. As shown in Table 1,
and 𝐴 = 𝐵 = 𝐶 = 0. The NAND Out port is pulled down the increase on the average power (averaging power for
to logic low through NMOS transistors N0, N1, and N2; this all possible input patterns) caused by noncomplementary
in turn activates PMOS transistor P1. As P1 is turned on, inputs is over three orders of magnitude. This is the basis for
the AND Out node is pulled to logic high and thus P0 is choosing DCVSL to implement functional units that facilitate
turned off. The time period when both PMOS and NMOS HT detection. If the triggered HT flips the internal node of a
transistors are on is extremely short. Let us reconsider the 3- functional unit, it will create a noncomplementary signal in
input NAND-AND gate with the same input vector, except the middle of that functional unit. Consequently, the power
that we make 𝐴 = 𝐴 = 1. Now, there exist two paths from 𝑉𝐷𝐷 consumption will stay high for a long time, which is different
to the ground terminal: one is through N0, N1, and N2 and from normal switching power.
4 VLSI Design
2.0 2.0
1.58 1.58
AND Out
AND Out
V (V)
V (V)
1.16 1.16
0.74 0.74
0.32 0.32
−1 −1
2.0 2.0
NAND Out
NAND Out
1.58 1.58
V (V)
V (V)
1.16 1.16
0.74 0.74
0.32 0.32
−1 −1
500.0 500.0
400.0 400.0
Total power
Total power
W (𝜇W)
W (𝜇W)
300.0 300.0
200.0 200.0 Constant short-circuit power
100.0 100.0
0.0 0.0
0.0 2.0 4.0 6.0 8.0 0.0 2.0 4.0 6.0 8.0
Time (𝜇s) Time (𝜇s)
(a) (b)
Figure 3: Voltage and power waveforms for DCVSL NAND3-AND3 gate. (a) Complementary inputs and (b) noncomplementary input 𝐴 = 𝐴.
2.2.2. Probability of Abnormal Short-Circuit Power. The key Define HT detection probability = number
reason for DCVSL gate having abnormal short-circuit power of abnormal peaks/total number of input
is the noncomplementary output nodes turning on the patterns
two PMOS transistors simultaneously. We assume that the
consequence of a HT insertion on DCVSL functional units Figure 4: Flowchart for analyzing the HT detection probability for
a DCVSL gate.
is flipping one of the complementary inputs. This is similar
to HT insertion in other technologies; that is, a triggered HT
is used to change the logic value of a logic gate or memory
element.
Because of electrical and logical masking, the noncom- In order to create an erroneous output in DCVSL, a HT
plementary inputs (caused by HTs) do not always yield has to make one or more of the inputs noncomplementary.
abnormal short-circuit power. As the logic gate topology This may result in an erroneous output if the effect of
varies between gates, it is difficult to obtain a closed-form the noncomplementary input is propagated and reaches
expression for the probability of abnormal power occurrence. the output port. An important point to note is that not
We summarize the general procedure for how to analyze all erroneous outputs are accompanied by abnormal power
the HT detection probability in DCVSL systems through peaks. Only if the erroneous output creates at least one path
abnormal power observation. Figure 4 is the flowchart for the from 𝑉DD to ground, will we observe the abnormal short-
analysis procedure. circuit power.
VLSI Design 5
Table 2: Probability of abnormal power and output error rate over Table 3: Number of transistors for DUTs and HTs in this work.
all possible input patterns for DCVSL logic gates.
Circuit CMOS 64-bit adder HT-1 HT-2 HT-3
DCVSL Percentage of abnormal Percentage of output error Transistor number 2560 8 28 100
gates power over all input patterns over all input patterns Circuits DCVSL 64-bit Adder C432 C1908 C3540
Inverter 50.00% 100% Transistor number 1644 2070 5516 9874
XOR2 41.66% 100% Circuits S526 S832 S1196 S1488
XOR3 33.92% 100% Transistor number 1682 1408 3056 2824
AND2 25.00% 25.00%
AND3 12.50% 12.50% Table 4: Power consumption for two 64-bit full adders and HT
insertions.
OR2 25.00% 50.00%
OR3 12.50% 25.00% Unit under test Dynamic power Leakage power
OAI21 23.21% 28.57% (mW) (nW)
AOI21 26.78% 50.00% Adder 24.6 65.78
AOI22 25.89% 45.08% CMOS-based 64-bit HT-1 0.444 0.419
full adder HT-2 0.942 1.243
OA22 25.89% 35.27%
MUX21 30.35% 55.00% HT-3 1.028 2.583
Average 27.72% 52.00% Adder 8.002 47.50
DCVSL-based HT-1 0.566 0.328
64-bit full adder HT-2 0.892 0.993
We examine the probability of abnormal power and HT-3 1.544 2.465
output error occurrence for all input patterns. Table 2 shows
the ratio of the total number of abnormal power peaks over
the total number of all input patterns for various basic DCVSL design. The fastest switching period for input is 1 𝜇s. We
gates. The average probability for power exception and output synthesized the Verilog codes of ISCAS benchmark circuits in
mismatch are 27.7% and 52%, respectively. This means our HT Synopsys Design Compiler with IBM CMOS7RF technology.
detection method has over 50% chance to detect HTs, even The synthesized netlist is modified with an in-house python-
if the HT trigger circuit is implemented with a single gate. based netlist generator, which converts CMOS netlist to
This is a significant advantage over other power-based side- DCVSL netlist. The behavior model of CMOS library is mod-
channel analysis methods, which have a lower bound on the ified according to the gate output and power performance
size of detectable HTs. obtained from simulation in Cadence Virtuoso.
Moreover, we observe that abnormal power occurs more HT detection rate is evaluated through gate-level simu-
often on the input pattern that produces the rare output lation in Cadence NCVerilog. To observe the accumulated
value. For example, an AND3 gate produces high output only HT-induced effects through the system, we inserted the HTs
when all three inputs are high; the abnormal power appears payload on the inputs of DUTs. We particularly did so to
at the exact input pattern if one of the inputs is not in the model the propagation of HT effect in a large-scale system. To
complementary form. To hide a HT, hackers often utilize compare the area and power consumption of DUT and HTs,
the rare case to trigger the HT. As discussed above, our we designed three HTs. HT-1 is OR3 trigger circuit with XOR2
approach inherently achieves a higher detection rate for the payload. HT-2 is OR(XOR(AND(x,y),z),w) trigger circuit
HT triggered by rare cases. This means a system equipped by with XOR2 payload. HT-3 is AND4 plus modulo-8 counter
our method will pose a greater challenge to attackers in order trigger circuit with XOR2 payload. The complexity of the
to conceal HTs. DUTs and HTs in this work is listed in Table 3. As can be seen,
the HTs are significantly smaller than the target design.
3. Experimental Results
3.2. Case Study on a 64-Bit Full Adder. We implemented
3.1. Experimental Setup. We evaluated the proposed method a 64-bit full adder using CMOS and DCVSL in Cadence
on the 64-bit ripple carry adder, ISCAS’85 and ISCAS’89 Virtuoso. The layout area for these two adders is shown
benchmark circuits. The schematic and layout of the 64- in Table 3. Because less PMOS transistors are needed in
bit adder were implemented in Cadence Virtuoso with the DCVSL, the area of DCVSL-based full adder is less than that
IBM CMOS7RF technology. We set all transistor lengths of CMOS full adder when optimization is applied on both
to 220 nm (minimum length in the CMOS7RF technology) implementations. HTs are rarely triggered and the leakage
and set the PMOS and NMOS transistor widths to 500 nm power for HTs is a few orders of magnitude less than the adder
and 600 nm, respectively. The average power, leakage power, switching power, as shown in Table 4.
and peak dynamic power were obtained from schematic- All possible input patterns were applied to the 64-bit
level simulations by examining all possible input patterns. ripple carry adder. We placed a HT circuit to alter one
The area for DCVSL modules was obtained from customized complementary input pin in the adder. The power over
layout in Virtuoso. Five metal layers were used in layout time waveform is shown in Figure 5. As can be seen in
6 VLSI Design
8.0 7.5
6.4 6.0
Power (mW)
Power (mW)
4.9 4.5
3.2 3.0
1.6 1.5
0.0 0.0
0.0 10.0 20.0 30.0 40.0 50.0 60.0 0.0 10.0 20.0 30.0 40.0 50.0 60.0
Time (𝜇s) Time (𝜇s)
(a) (b)
7.5
6.0
Power (mW)
4.5
3.0
1.5
0.0
0.0 10.0 20.0 30.0 40.0 50.0 60.0
Time (𝜇s)
(c)
Figure 5: Power consumption for a 64-bit DCVSL full adder. (a) No HT, (b) HT on the 49th 1-bit full adder carry in port, and (c) HT on the
2nd 1-bit full adder carry in port.
Figure 5(a), when no HT is triggered, the switching power has than CMOS. However, when the HT is triggered to change
instantaneous peaks whereas the leakage power remains flat the noncomplementary inputs for the DCVSL-based full
(close to zero). Figure 5(b) shows the power for the adder with adder, the increased short-circuit power results in a dramatic
one HT inserted at the 49th 1-bit full adder. As can be seen, the increase on the average power. Figure 6 also shows that the
power has an extra periodical increase, which is noticeably average power difference between original and HT affected
higher than the leakage power. This is the short-circuit power version is over 50X. If the HT is inserted at the early stage in
(discussed in Section 2.2) induced by the noncomplementary the functional block, the average power difference increases
inputs from HT insertion. We placed the HT payload circuit to over two orders of magnitude. This is favorable for power-
to the 2nd 1-bit full adder and observed different power based side-channel analysis HT detection methods.
behavior. As shown in Figure 5(c), the increased short- To assess the HT detection rate, we assume that HTs
circuit power appears in almost all input patterns. This is are inserted to change the complementary inputs. As input
because the 2nd 1-bit full adder with noncomplementary vectors 𝐴 and 𝐵 for a 64-bit full adder are equivalent, we
inputs yields noncomplementary outputs, and those outputs select 64-bit input 𝐴 to receive the potential impact from HTs.
are further propagated to other 1-bit full adders. Because of Besides half of the inputs, 𝐴, the carry-in bit for the first 1-
the propagation of HT effects, the power consumption is bit full adder is another potential location for HT insertion.
exceptionally higher than that in normal cases. As the proposed method is independent of the particular HT
CMOS circuits have more PMOS transistors than the trigger circuit, we flipped one of the complementary inputs to
DCVSL version. Consequently, the dynamic power consump- model the effect of HT insertion. As shown in Figure 7, for the
tion of CMOS is higher than that of DCVSL. As shown HTs on 𝐴, the HT detection rate reaches 1. Given a HT area
in Figure 6, DCVSL has less average power consumption over chip area ratio below 1%, the HT detection rate is higher
VLSI Design 7
1
1000
Average power (mW)
0.8
HT detection rate
100
0.6
0.4
10
0.2
1
No HT HT on 49th HT on 33rd HT on 2nd 0
1-bit FA 1-bit FA 1-bit FA 1 3 5 7 9 11 13 15 17 19 21 23 25
HT insertion on different internal gates
DCVSL
CMOS Figure 8: Impact of HT insertion location on HT detection rate.
Figure 6: Impact of HT location on average power of 64-bit DCVSL
adder.
difference is high enough for use in HT detection. As shown
in Figure 9(b), the HT inserted in the early 1-bit full adder
1
stage yields an abnormal energy that is up to three orders of
0.8
magnitude higher than normal leakage energy. HT insertion
HT detection rate
0
3.3. Evaluation on Benchmark Circuits. The proposed
method is further evaluated with ISCAS benchmark circuits,
Cin
A4
A9
A14
A19
A24
A29
A34
A39
A44
A49
A54
A59
10−9
10−10
Number of gates having
1.5
abnormal power
10−11
1
10−12
0 10−14
0 2 4 6 8 10 12 14 16
Cin
A4
A9
A14
A19
A24
A29
A34
A39
A44
A49
A54
A59
Time (𝜇s)
HT injection locations
HT on 1st 1-bit adder
HT on 32nd 1-bit adder
HT on 48th 1-bit adder
(a) (b)
3.64E − 04
4.0E − 04
Average power (W)
3.0E − 04
1.0E − 04 2.16E − 06
1.0E − 06
Without HT on 1st HT on HT on
HT 1-bit FA 32nd 1- 48th 1-
bit FA bit FA
(c)
Figure 9: Results for HT-induced abnormal power assessment. (a) Average number of gates experiencing high short-circuit power per HT
inserted case. (b) Abnormal energy caused by HT insertion over regular leakage energy. (c) Average power for three different HT injection
locations.
Hardware Trojan detection rate
1 1
Hardware Trojan detection rate
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0
0 5 10 15 20 25 30 35 40 0
0 5 10 15 20 25 30 35
Noncomplementary input location Noncomplementary input location
Figure 10: HT detection rate in c432. Figure 11: HT detection rate in c1908.
9
0.9
0.6
0.4 4
1
HT insertions; this feature has potential to be used in power-
based HT detection.
0
Detecting the noncomplementary final output of DUT
c432 c1908 c3540
helps to improve the HT detection rate. As shown in Fig-
ure 16, not all test cases have abnormal power behavior. We Figure 15: Average number of gates with abnormal power per each
collected the number of cases that have noncomplementary noncomplementary input pair.
outputs (i.e., output error) and observed that the cases of
noncomplementary DUT final output can achieve a HT
detection rate of 1. This outstanding performance depends
on circuit topology and the employed logic gates. Sometimes, ×103
the output error occurs at the same moment when abnormal 500
short-circuit power is observed.
Sequential circuits are more likely to be affected by HT 400
effect propagation, as latches and flip-flops have a higher
Number of detected cases
0
4. Conclusion 1 5 9 13 17 21 25 29 33
Hardware Trojans (HTs) challenge the chip security because Output error
of the increasing number of chips being fabricated, assem- Abnormal power
bled, and packaged offshore. To enforce the confidence of
chip security, efficient HT detection is imperative. HT detec- Figure 16: HT detection rate improvement by comparing comple-
tion can be performed during chip testing stage, although it mentary outputs in c432 circuit.
10 VLSI Design
Research Article
On-Chip Power Minimization Using Serialization-Widening
with Frequent Value Encoding
Copyright © 2014 Khader Mohammad et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
In chip-multiprocessors (CMP) architecture, the L2 cache is shared by the L1 cache of each processor core, resulting in a high
volume of diverse data transfer through the L1-L2 cache bus. High-performance CMP and SoC systems have a significant amount
of data transfer between the on-chip L2 cache and the L3 cache of off-chip memory through the power expensive off-chip memory
bus. This paper addresses the problem of the high-power consumption of the on-chip data buses, exploring a framework for
memory data bus power consumption minimization approach. A comprehensive analysis of the existing bus power minimization
approaches is provided based on the performance, power, and area overhead consideration. A novel approaches for reducing the
power consumption for the on-chip bus is introduced. In particular, a serialization-widening (SW) of data bus with frequent value
encoding (FVE), called the SWE approach, is proposed as the best power savings approach for the on-chip cache data bus. The
experimental results show that the SWE approach with FVE can achieve approximately 54% power savings over the conventional
bus for multicore applications using a 64-bit wide data bus in 45 nm technology.
Passing
Data
through 8-bit Signal Coupling
sequence
bus →
0000 0000 — — Figure 2: Basic structure and position of serializer and deserializer.
0011 0011 4 3
0011 0011 0 0 WC WS
Total number
4 3
of transitions
DC Serialization = 2 DS
Passing
Data
through 4-bit Signal Coupling
sequence
bus →
0000 — — Conventional bus Serialized-wider bus
0011 2 1
Figure 3: Basic structure of conventional and serialized bus lines.
0011 0 0
0011 0 0
0011 0 0
was smaller, but the throughput of the bus was halved (since
Total number
2 1 the frequency remained the same).
of transitions
In a deep submicron technology, the switching energy
(b) 16-bit data stream → 0011 1100 0011 1100 consumed due to coupling capacitance is dominant [16,
Passing 17, 24–26]. The disadvantage of bus widening is that the
Data bus occupies more area than a conventional bus. Hatta
through - bit Signal Coupling
sequence et al. [2] looked at combining bus serialization with bus
bus →
0000 0000 — — widening in order to reduce bus power without increasing
the bus area. In that study, the bus frequency was increased
0011 1100 4 2
to keep the throughput constant. Although this required
0011 1100 0 0 increasing the width of the wires, the extra spacing between
Total number the wires allowed this to be accommodated without a bus
4 2
of transitions area overhead. Hatta et al. [2] also looked at combining a
Passing serialized-widened bus with differential data encoding and
Data
through 4 Signal Coupling found that it helped on the address bus but not on the data
sequence
bit-bus → bus.
0000 — — In a serialized-widened bus, the operating frequency
0011 2 1 can be increased to keep the throughput the same as in a
1100 2 2 conventional bus. In this case, the serialized frequency is
0011 2 2
given by fS = S ⋅ fC, where S is the serialization factor
and fC is the frequency of the conventional bus. In order to
1100 2 2
implement bus serialization at a higher frequency, a serializer
Total number and deserializer are required at the sending and receiving
8 7
of transitions ends of the bus, respectively (as shown in Figure 2).
Figure 3 shows the structure of data lines of a con-
ventional bus and those of a serialized-widened bus. The
integer multiple of 2. The throughput of a bus serialized relationship of the wire width and spacing between the wires
by a factor of two is halved. To prevent a reduction in the of a conventional bus and a serialized-widened bus is
throughput, the bus frequency can be doubled. This requires
the increasing of the wire widths to support higher switching WS + DS = (WC + DC) S, (5)
speeds. The advantage of serialization is that the bus occupies
less area than a conventional bus. Serialization on its own where WC is the wire width of the conventional bus, DC is
may not necessarily reduce the switching activity and thus wire spacing between the lines in the conventional bus, WS is
the energy consumption of a bus (see Table 1). Loghi et al. the wire width of a serialized-widened bus, DS is wire spacing
[5] examined the use of bus serialization combined with data between the lines in the serialized-widened bus, and S is the
encoding for power minimization. In this case, the bus area serialization factor.
4 VLSI Design
Table 3: Architectural configuration of the simulator used in the Table 4: Benchmarks, types, and number of warm up instructions
experiment. used in the experiment.
still less than the serialized bus, CS (since the wires are more
spaced out than a serialized bus). work develops another simulator written in program C to
The relation between the switching activities is highly calculate the switching activity for the bus power estimation.
dependent on the data values passed on the bus. Therefore
a strict relation between the switching activities cannot be
4.2. Benchmark Suites. This experiment uses 6 integers and
shown. However, in general it can be expected that an
3 floating point benchmarks from SPEC2000 suite [49] and
encoded bus will have less switching than a conventional bus
3 benchmarks from MediaBench suite [47]. This selection is
(hence 𝛼E < 𝛼C). In addition the serialized-encoded bus
motivated by finding some memory intensive programs (mcf,
(SE) will also likely have a lower switching activity than a
art, gcc, gzip, and twolf) [3] and some memory nonintensive
conventional bus (hence 𝛼SE < 𝛼C). The relation between the
programs. The simulation wants to use reference inputs of the
switching activity of a serialized bus (𝛼S) and a conventional
SPEC2000 suite because of having smaller data sets of test
bus (𝛼C) is hard to predict.
or training inputs. For each of the benchmark of SPEC2000
This paper proposes data bus power reduction techniques
suite, this work divides the total run length by 5 and warm
for the SWE approach. This work compares these approaches
up for the first 3 portions with a maximum of 2 billion
with existing power reduction methods that fall under the
instructions using fast-forward mode cycle-level simulation.
different categories in Table 2. This work finds that the SWE
A 200 million instruction window is simulated using the
approach works best since this method reduces both the wire
detailed simulator. For MediaBench suite, this work simulates
capacitance and the switching activity significantly.
the whole program to generate the required traces without
any fast forwarding. Table 3 lists the reference inputs that
4. Experimental Setup are chosen from the SPEC2000 benchmark and MediaBench
suite and the number of instructions for which the simulator
This section discusses the target system of the experiment and is warmed up. Among these benchmarks, a group of bench-
the memory structure used to collect the memory traces. The marks are selected to run in multicore processor units qs in
first subsection describes the architecture of sim-outorder, Table 4. This selection gives importance to group the memory
the superscalar simulator from the Simplescalar tool suite intensive programs to get more accurate behavior of memory
[48]. In the subsection followed we discusses the benchmarks access than to group memory nonintensive programs. Table 5
suite and the input sets that are used in this paper. In the summarizes the list of benchmarks used for 8, 4, and 2 cores
last part of this section we present the switching activity processing units.
computation methodology.
4.3. Switching Activity Computation. A power simulator writ-
4.1. Simulator. This experiment uses a modified version of ten in C is integrated with the modified Simplescalar sim-
Simplescalar 3.0d’s sim-outorder simulator [48] to collect outorder simulator [48] to calculate the switching activity of
our cache request traces. The model architecture has mid- the data transitions between L1 and L2 cache through L1-
range configuration. Table 3 summarizes the architectural L2 cache bus. The simulator has several functionalities for
configuration of our simulator. The baseline configuration calculating the switching activity for all six different kinds of
parameters are typical those of a modern chip multiproces- encoding techniques listed in Table 6.
sors and out-of-order simulator. This work keeps the L1 cache During serialization-widening, the simulator uses two
size smaller to get more memory access which results in more sets of value cache (VC) for LSB and MSB data matching
accurate behavior of memory access and memory bus. This instead of using one unified VC. Figure 5 shows the different
6 VLSI Design
Average
−15
8 core 4 core 2 core
Bus-invert coding bi
−30
Transition signaling xor
Frequent value encoding with one hot-code Fv S SE
Frequent value encoding with two hot-code fv2 W WE
TUBE with one hot-code Tube E SWE
SW
TUBE with two hot-code tube2
Figure 6: Comparison of the % of power savings using the different
data bus power reduction approaches. Results are compared to a
structures of two sets of VC with serialization. The data conventional 64-bit L1-L2 cache data bus at 45 nm technology.
bus size is varied frequently to compare the effectiveness
of different possible approaches and encoding techniques
keeping the total amount of data the same. For example, if
minimizing bus switching activity, bus wire capacitance, or
a data stream of 64-bit wide requires 1 transition using 64-bit
both.
wide data bus, it requires 8 transitions using 8-bit wide data
When the three approaches for power reduction are
bus.
applied on their own, bus widening performs the best. The
serialization (S) approach performs poorly for most of the
5. Results and Analysis architecture configurations listed in Figure 6 (the bus power
is generally increased). This is primarily due to the fact that
This section presents the experimental results. It has a serialization generally increases switching activity. The bus
general comparison of the cache bus power minimization capacitance is actually reduced partially since the wires are
using the seven possible approaches listed in Table 2. It spaced out further to allow the frequency to be doubled.
further examines in detail three of the approaches that do However, this reduction in capacitance is not enough to offset
not change the bus area and finds that the SWE approach the increased switching activity. The widening (W) approach
performs the best. It also presents an in depth analysis of performs very well since it reduces the bus wire capacitance
the SWE approach performance under various architecture significantly. The disadvantage of the approach is that it
and technology configurations. At the end of this section we almost doubles the bus area. There are six different encoding
discuss the performance, power, and area overhead for the techniques (E) that are tested (see Table 6). Figure 6 shows the
proposed technique. result from the best encoding technique for each architecture
configuration. Encoding reduces switching activity without
5.1. Power Savings for Different Possible Approaches. The affecting the bus capacitance and so does minimizing the
seven possible bus power savings approaches listed in Table 2 bus power. This approach does not change the bus area or
earlier are different combinations of serialization (S), bus frequency.
widening (W), and encoding (E). Figure 6 shows the power When using combinations of the three approaches, the
savings on the L1-L2 cache data bus for the different serialized-widened-encoded (SWE) method performs the
architecture-benchmark combinations listed in Table 5 using best. The serialized-widened (SW) approach reduces the bus
these approaches. A 64-bit data bus implemented on 45 nm capacitance by widening the wire spacing, but generally
technology is assumed. The techniques reduce bus power by increases the switching activity through serialization. The net
VLSI Design 7
1.2 60
50
1
Absolute power normalized
40
20
0.4
10
0.2
0
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2
Average
0 −10
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2
Average
8 core 4 core 2 core
8 core 4 core 2 core −20
64 SW 2 64 SW 8 64 SW 2 64 SW 8
64 SW 4 32 SW 4 64 SW 4 32 SW 4
32 SW 2 16 SW 2 32 SW 2 16 SW 2
(a) (b)
Figure 7: (a) % of power savings achieved and (b) absolute power normalized to 64-bit conventional bus power using bus serialization for
64-, 32-, 16-bit wide bus for different serialization factors. The figure legend indicates the first number as bus width, S as serialization, and the
last number as the serialization factor.
60 40
35
50
Power savings (%)
30
Capacitance reduction (%)
25
40 20
15
30
10
5
20
0
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2
Average
10
8 core 4 core 2 core
0
2 4 8 bi fv2
xor tube
Figure 8: % of capacitance reduction using serialized-widened data fv tube2
bus for different serialization factors in 45 nm technology.
Figure 9: % of power savings for using different encoding tech-
niques for 64-bit wide data bus for different number processing cores
result of these two opposing effects is generally a decrease with several benchmark combinations.
in the power consumption (although there are cases where
power is actually increased). This is the approach proposed
by Hatta et al. [2] for both the address and data buses. by Hatta et al. [2]) for different bus widths and serialization
The serialized-encoded (SE) method reduces the bus power factors. The results show that the SW approach performs
mainly through a reduction in switching activity. There is well for narrow buses. Figure 7(b) shows the absolute power
also a slight reduction in capacitance due to the serialization. consumption of the SW approach with different architectural
The widened-encoded (WE) approach reduces the power by configurations normalized with a 64-bit wide conventional
minimizing both the switching activity and bus capacitance. data bus. The average power consumption of a specific bus
It however has the disadvantage in increasing the bus area. width does not vary to each other irrespective of serialization
Finally the serialized-widened-encoded (SWE) approach factors.
produces the best results for the architectures in Figure 6 by Figure 8 shows the percentage of capacitance reduction
minimizing the bus capacitance and switching activity while using the serialization-widening data bus approach for differ-
keeping the bus area constant. ent serialization factors. The figure shows that a serialization
The rest of this chapter considers primarily the SW, E, factor of 4 or 8 does not provide a significant reduction of
and SWE approaches as these do not change the bus area. capacitance over a serialization factor of 2.
Unless explicitly stated, a 45 nm technology implementation
is assumed. 5.3. Encoding (E). Figure 9 compares the power savings from
the different encoding schemes presented in Table 6 for a 64-
5.2. Serialization-Widening (SW). Figure 7(a) shows the bit L1-L2 cache data bus. Table 7 shows the power savings of
power savings of using a serialized-widened bus (as proposed the encoding techniques for various cache bus widths. For
8 VLSI Design
Table 7: % of power savings for different bus widths and encoding Table 9: % of power savings for different bus widths and encoding
techniques. techniques.
Table 8: Hit rate and number of hit in one or two transition cache 80
locations using FV and FV2 techniques for 8-core dataset 1. 70
Average
approaches with two hot-codes (FV2 and TUBE2) perform 8 core 4 core 2 core
the best. This is mainly because the wide bus allows for a
large number of entries in these encoding caches. With a 16- bi SW fv2 SW
xor SW tube SW
bit data bus width, a frequent value cache using one hot-code tube2 SW
fv SW
performs better. This is because the larger cache size of FV2
than FV increases the hit rate, but large number of them hit in Figure 10: % of power savings using SWE approach for different
the location that requires a switching activity of two instead encoding techniques for 64-bit wide data bus.
of one. Table 8 lists the hit rate and the number of one or two
transition cache location hit of FV2 and the number of one
transition cache location hit of FV for simulating 8-core set 1 Table 10: Variation of value cache table size with encoding tech-
application. It is obvious from the data of the table that FV2 niques.
performs poorly as large data matching hits in two transition
Encoding Table size Max. possible
cache locations. An improvement of this situation is to map
technique Data size Number of entry switching activity
the most frequent data value in the cache location of smaller
number of transitions. This type of encoding technique is FV 32-bit 32 1
proposed by Suresh et al. [7]. It can be easily implemented in FV2 32-bit 528 2
advance as their proposed context independent codes works 32-bit 5
for known dataset of embedded processing systems. But, it
24-bit 5
requires very complex hardware design to implement for a
TUBE 16-bit 5 1 or 2
real-time data arrangement. For the 8-bit cache bus width,
none of the cache-based approaches work well as their hit 8-bit 8
rates are low (since values get replaced too often). In this case 16-bit 16
bus-invert has the best performance. 32-bit 30
24-bit 30
5.4. Serialization-Widening with Encoding (SWE). Figure 10 TUBE2 16-bit 30 2 or 4
compares the power savings from the different encoding 8-bit 68
schemes presented in Table 6 using the serialized-widened- 16-bit 68
encoded (SWE) scheme for a 64-bit L1-L2 cache data bus.
Table 9 shows the power savings of the encoding techniques
for various cache bus widths and a serialization factor of 2. For
after serialization, and the cache hit rates become too low for
the 64-bit and 32-bit wide buses, the frequent value approach
this configuration.
(FV) performs the best. This is mainly because the wide bus
allows for a large number of entries with a higher number
of switching activity (as given example in Table 8) in these 5.5. Power Savings under Different Architecture Options.
encoding caches. With a 16-bit data bus width, a bus invert Figure 11 presents the percentage of power savings for the
performs better. This is because we end up with an 8-bit bus SWE approach using frequent value encoding (FVE) and the
VLSI Design 9
60 70
50 60
Power savings (%)
60
50
40
30
20
10
0
S2 S4 S8 S2 S4 S2
64 bit 32 bit 16 bit
FV SW
best E SW
(c) 2-core set 1
Figure 11: Comparison of % of power savings between different serialization factors with different cache bus width.
8 core 4 core 2 core technique works well for wide data bus, but poorly performs
for narrow bus.
64 C 64 SW Cache bus power consumption can be varied with bus
32 C 32 SW width, application sets, and different approaches (E, SW or
16 C 16 SW SWE). Figure 12 is a comparative view of cache bus power
64 E 64 FV SW for a 32 KB L1 cache with 64-/32-/16-bit bus size. The graph
32 E 32 FV SW shows that a 32-bit wide bus consumes more power than
16 E 16 bi SW a 64-bit wide bus for most of the application sets used in
Figure 12: Absolute power consumption for 64-bit, 32-bit, and this experiment. For a 16-bit wide bus, it consumes almost
16-bit bus, with encoding (E), serialization-widening (SW), and similar or sometimes more power than a 32-bit wide data bus.
serialization-widening with encoding (SWE) normalized to 64-bit Encoding (E) approach consumes almost the same amount of
bus width for 32 KB L1 cache. power for 64-/32-bit wide data buses. This indicates that the
power consumption of the E approach is independent of the
bus size. A 16-bit data bus requires a bit higher power than
either a 64-bit or 32-bit wide data bus using E approach. SW
best encoding for different bus widths and serialization fac- approach gives us a similar result for the 64-bit and 32-bit
tors. The amount of power savings achieved by this approach data buses. But, a 16-bit data bus requires quite less power
depends on several factors. These factors include cache data than a 64-bit or 32-bit using the SW approach. Using the SWE
10 VLSI Design
80 35
60 25
50
20
40
15
30
10
20
10 5
0 0
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2
Average
Average
8 core 4 core 2 core 8 core 4 core 2 core
64 KB 64 KB
32 KB 32 KB
16 KB 16 KB
(a) (b)
Figure 13: Comparison of (a) % of absolute power savings using different L1 cache sizes for a 64-bit wide data bus using serialization-widening
with frequent value encoding and (b) % of relative power savings using a 64-bit wide bus compared to a 32-bit wide bus (both of the bus used
serialization-widening with frequent value encoding).
to 70 nm technology
60 1
50
Power savings (%)
0.8
40
0.6
30
20 0.4
10 0.2
0 0
70 45 35 25 20 70 45 35 25 20 70 45 35 25 20
FVE SW FVE SW 70 45 35 25 20
(nm) (nm)
C
FVE SW
(a) (b)
Figure 15: (a) % of power savings using different technologies for a 64-bit data bus experimenting on application set 1 in 8 processing cores,
(b) absolute power consumption of the same set (8 core set 1) for different technologies (power consumption values are normalized to 70 nm
technology).
reduction for different technology as shown in Figure 14. 5.9. Widened-Encoded Data Bus of 32 Bit Wide. A widened-
Figure 15(a) presents a comparison of the power savings encoded (WE) 32-bit wide data bus requires the same area
using encoding, serialization-widening, and serialization- as the SWE approach of a 64-bit wide. The results of
widening with FVE. The results shown in Figure 15 uses a 64- Figure 6 show that WE approach works very close to SWE
bit wide data bus for application set 1 in 8 processing cores. approach in power savings. But, the 64-bit wide WE data
The amount of power savings is in similar fashion for different bus requires double area. This motivates us to compare the
technologies, but the absolute power consumption reduces power consumption of the WE approach having a 32-bit data
with shrinking the technology as shown in Figure 15(b). bus compared to the SWE approach having a 64-bit wide
This is because the swing voltage reduces with shrinking data bus. Figure 17(a) gives the absolute power consumption
the technology [27]. Although shrinking the technology normalized with the 64-bit wide conventional data bus. The
increases the capacitance (capacitance, 𝐶 = 𝜀 ∗ 𝐴/𝑑), results show that these two approaches consume almost the
serialization-widening gives us the advantage of using extra same amount of power. The benefit of using WE approach is
space between the wires which reduces the overall capaci- that it does not require higher operating frequency. But, it has
tance compared to the conventional bus and finally reduces to pay performance loss in terms of IPC for using narrow data
the total power budget. Using this advantage, the proposed bus. The experimental results having the performance loss are
approach improves the power savings significantly. shown in Figure 17(b).
0.9 5
Absolute power normalized
0.8 4.5
4
Performance loss in
0.7
0.5 2.5
0.4 2
0.3 1.5
0.2 1
0.1 0.5
0
0 Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2
Average
Set 1 Set 2 Set 1 Set 2 Set 3 Set 1 Set 2
Average
8 core 4 core 2 core 8 core 4 core 2 core
64 SWE
32 WE
(a) (b)
Figure 17: (a) Comparison of absolute power consumption of SWE approach (64 bit wide data bus) and WE approach (32-bit wide data bus)
and (b) % of performance loss of using 32-bit wide data bus instead of 64-bit wide data bus.
Average
Average
8 core 4 core 2 core 8 core 4 core 2 core
2 cycles
1 cycle
(a) (b)
Figure 19: (a) % of performance degradation in term of instruction per cycle (IPC) for using 64-bit serialized bus with encoding for 2/1 cycle
performance penalty instead of using conventional bus and (b) % of power savings.
work in this paper primarily involved the software simulation in Proceedings of the 35th Annual ACM/IEEE International
of the proposed techniques for bus power minimization Symposium on Microarchitecture, pp. 345–355, 2002.
considering performance overhead. As far as future work, we [10] N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. V.
will continue to evaluate the proposed approach with lower Venkatrao, “A limit study on the potential of compression for
process node (14 and 10 nm) for reliability especially with new improving memory system performance, power consumption,
process. and cost,” Journal of Instruction-Level Parallelism, vol. 7, pp. 1–37,
2005.
[11] A. Park and M. Farrens, “Address compression through base
Conflict of Interests register caching,” in Proceedings of the Annual ACM/IEEE
International Symposium on Microarchitecture, pp. 193–199,
The authors declare that there is no conflict of interests November 1990.
regarding the publication of this paper. [12] M. Farrens and A. Park, “Dynamic base register caching: a
technique for reducing address bus width,” in Proceedings of
the 18th International Symposium on Computer Architecture, pp.
References 128–137, May 1991.
[1] M. R. Stan and K. Skadron, “Power-aware computing,” IEEE [13] D. Citron and L. Rudolph, “Creating a wider bus using caching
Computer, vol. 36, no. 12, pp. 35–38, 2003. techniques,” in Proceedings of the International Symposium on
High Performance Computer Architecture, pp. 90–99, January
[2] N. Hatta, N. D. Barli, C. Iwama et al., “Bus serialization for 1995.
reducing power consumption,” Proceedings of SWoPP, 2004.
[14] K. Sunderasan and N. Mahapatra, “Code compression tech-
[3] B. Jacob and V. Cuppu, “Organizational design trade-offs at niques for embedded systems and their effectiveness,” in Pro-
the DRAM, memory bus and memory controller level: initial ceedings of the IEEE Computer Society Annual Symposium on
results,” Tech. Rep. UMD-SCA-TR-1999-2, University of Mary- VLSI, pp. 262–263, February 2003.
land Systems & Computer Architecture Group, 1999. [15] L. Li, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and I.
[4] Rambus Inc, Rambus Signaling Technologies: RSL, QRSL and Kadayif, “CCC: crossbar connected caches for reducing energy
SerDes Technology Overview, Rambus Inc, 2000. consumption of on-chip multiprocessors,” in Proceedings of the
[5] M. Loghi, M. Poncino, and L. Benini, “Cycle-accurate power Euromicro Symposium on Digital Systems Design (DSD ’03),
analysis for multiprocessor systems-on-a-chip,” in Proceedings 2003.
of the ACM Great lakes Symposium on VLSI, pp. 401–406, April [16] P. P. Sotiriadis and A. P. Chandrakasan, “A bus energy model for
2004. deep submicron technology,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 10, no. 3, pp. 341–349, 2002.
[6] K. Mohanram and S. Rixner, “Context-independent codes for
off-chip interconnects,” in Power-Aware Computer Systems, vol. [17] P. P. Sotiriadis and A. Chandrakasan, “Low power bus coding
3471 of Lecture Notes in Computer Science, pp. 107–119, 2005. techniques considering inter-wire capacitances,” in Proceedings
of the IEEE 22nd Annual Custom Integrated Circuits Conference
[7] D. C. Suresh, B. Agrawal, W. A. Najjar, and J. Yang, “VALVE: (CICC ’00), pp. 507–510, May 2000.
variable Length Value Encoder for off-chip data buses,” in
[18] J. Henkel and H. Lekatsas, “A2BC: adaptive address bus coding
Proceedings of the International Conference on Computer Design
for low power deep sub-micron designs,” in Proceedings of the
(ICCD ’05), pp. 631–633, San Jose, Calif, USA, October 2005.
IEEE 38th Design Automation Conference, pp. 744–749, June
[8] M. R. Stan and W. P. Burleson, “Coding a terminated bus for 2001.
low power,” in Proceedings of the 5th Great Lakes Symposium on [19] T. Lindkvist, “Additional knowledge of bus invert coding
VLSI, pp. 70–73, March 1995. schemes,” in Proceedings of the IEEE 5th International Workshop
[9] K. Basu, A. Choudhury, J. Pisharath, and M. Kandemir, “Power on System-on-Chip for Real-Time Applications (IWSOC ’05), pp.
protocol: reducing power dissipation on off-chip data buses,” 301–303, Alberta, Canada, July 2005.
14 VLSI Design
[20] T. Lindkvist, J. Löfvenberg, and O. Gustafsson, “Deep sub- [37] T. Lang, E. Musoll, and J. Cortadella, “Extension of the working-
micron bus invert coding,” in Proceedings of the 6th Nordic zone-encoding method to reduce the energy on the micro-
Signal Processing Symposium (NORSIG ’04), pp. 133–136, Espoo, processor data bus,” in Proceedings of the IEEE International
Finland, June 2004. Conference on Computer Design, pp. 414–419, October 1998.
[21] K.-W. Kim, K.-H. Baek, N. Shanbhag, C. L. Liu, and S.-M. [38] L. Benini, G. de Micheli, E. Macii, M. Poncino, and S. Quer,
Kang, “Coupling-driven signal encoding scheme for low-power “System-level power optimization of special purpose applica-
interface design,” in Proceedings of the IEEE/ACM International tions: the beach solution,” in Proceedings of the International
Conference on Computer-Aided Design, pp. 318–321, San Jose, Symposium on Low Power Electronics and Design, pp. 24–29,
Calif, USA, 2000. Monterey, Calif, USA, August 1997.
[22] S. Komatsu, M. Ikeda, and K. Asada, “Bus power encoding with [39] L. Benini, G. DeMicheli, E. Macii, M. Poncino, and C. Silvano,
coupling-driven adaptive code-book method for low power “Address bus encoding techniques for system level power
data transmission,” in Proceedings of the European Solid-State optimization,” in Proceeding of the Design Automation and Test
Circuits Conference, 2001. in Europe, pp. 861–866, Paris, France, February 1998.
[23] J.-H. Chern, J. Huang, L. Arledge, P.-C. Li, and P. Yang, [40] N. Chang, K. Kim, and J. Cho, “Bus encoding for low-power
“Multilevel metal capacitance models for CAD design synthesis high-performance memory systems,” in Proceedings of the 37th
systems,” Electron Device Letters, vol. 13, no. 1, pp. 32–34, 1992. Design Automation Conference (DAC ’00), pp. 800–805, June
[24] K. Mohammad, A. Dodin, B. Liu, and S. Agaian, “Reduced 2000.
voltage scaling in clock distribution networks,” VLSI Design, vol. [41] W.-C. Cheng and M. Pedram, “Power-optimal encoding for
2009, Article ID 679853, 7 pages, 2009. DRAM address bus,” in Proceedings of the Symposium on Low
[25] K. Mohammad, B. Liu, and S. Agaian, “Energy efficient swing Power Electronics and Design (ISLPED ’00), pp. 250–252, July
signal generation circuits for clock distribution networks sys- 2000.
tems,” in Proceedings of the IEEE International Conference on [42] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, “A coding
Man and Cybernetics, pp. 3495–3498, 2009. framework for low-power address and data busses,” IEEE
[26] K. Mohammad, S. Agaian, and F. Hudson, “Efficient FPGA Transactions on Very Large Scale Integration (VLSI) Systems, vol.
implementation of convolution,” in Proceedings of the IEEE 7, no. 2, pp. 212–221, 1999.
International Conference on Systems, Man and Cybernetics, San [43] E. Musoll, T. Lang, and J. Cortadella, “Exploiting the locality
Antonio, Tex, USA, October 2009, paper ID 3922. of memory references to reduce the address bus energy,” in
[27] “International Technology Roadmap for Semiconductors,” Proceedings of the International Symposium on Low Power
http://www.itrs.net. Electronics and Design, pp. 202–207, Monterey, Calif, USA,
[28] H. Kawaguchi and T. Sakurai, “Delay and noise formulas for August 1997.
capacitively coupled distributed RC lines,” in Proceedings of the [44] Y. Shin, S.-I. Chae, and K. Choi, “Partial bus-invert coding for
3rd Conference of the Asia and South Pacific Design Automation power optimization of system level bus,” in Proceedings of the
(ASP-DAC ’98), pp. 35–43, February 1998. International Symposium on Low Power Electronics and Design,
[29] C.-L. Su, C.-Y. Tsui, and A. M. Despain, “Saving power in the pp. 127–129, August 1998.
control path of embedded processors,” IEEE Design and Test of [45] M. R. Stan and W. P. Burleson, “Two-dimensional codes for low
Computers, vol. 11, no. 4, pp. 24–30, 1994. power,” in Proceedings of the International Symposium on Low
[30] M. R. Stan and W. P. Burleson, “Bus-invert coding for low- Power Electronics and Design, pp. 335–340, August 1996.
power I/O,” IEEE Transactions on VLSI Systems, vol. 3, no. 1, pp.
[46] S. Yoo and K. Choi, “Interleaving partial bus-invert coding for
49–58, 1995.
low power reconfiguration of FPGAs,” in Proceedings of the 6th
[31] L. Benini, G. de Micheli, E. Macii, D. Sciuto, and C. Sil- International Conference on VLSI and CAD, pp. 549–552, 1999.
vano, “Asymptotic zero-transition activity encoding for address
[47] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Media-
busses in low-power microprocessor-based systems,” in Pro-
Bench: a tool for evaluating and synthesizing multimedia and
ceedings of the 7th Great Lakes Symposium on VLSI, pp. 77–82,
communications systems,” in Proceedings of the 30th Annual
March 1997.
IEEE/ACM International Symposium on Microarchitecture, pp.
[32] C. Liu, A. Sivasubramaniam, and M. Kandemir, “Optimizing 330–335, December 1997.
bus energy consumption of on-chip multiprocessors using
[48] SimpleScalar Simulator, “SimpleScalar LLC,” http://www.sim-
frequent values,” in Proceedings of the 12th Euromicro Conference
plescalar.com/.
on Parallel, Distributed and Network-based Proceedings (PDP
’04), pp. 340–347, February 2004. [49] SPEC, “SPEC CPU2000 Benchmark Suite Ver 1.2,” http://www
[33] J. Yang and R. Gupta, “Frequent value locality and its applica- .spec.org/osg/cpu2000/.
tions,” ACM Transactions on Embedded Computing Systems, vol.
1, no. 1, pp. 79–105, 2002.
[34] J. Yang, R. Gupta, and C. Zhang, “Frequent value encoding for
low power data buses,” ACM Transactions on Design Automa-
tion of Electronic Systems, vol. 9, no. 3, pp. 354–384, 2004.
[35] D. C. Suresh, B. Agrawal, J. Yang, and W. Najjar, “A tunable
bus encoder for off-chip data buses,” in Proceedings of the
International Symposium on Low Power Electronics and Design,
pp. 319–322, San Diego, Calif, USA, August 2005.
[36] W.-C. Cheng and M. Pedram, “Memory bus encoding for low
power: a tutorial,” in Proceedings of the International Symposium
on Quality Electronic Design (ISQED ’01), p. 1999, 2001.