0% found this document useful (0 votes)
87 views6 pages

Clock Gating Optimization With Delay-Matching: Shih-Jung Hsu Rung-Bin Lin

Uploaded by

Nguyen Van Toan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views6 pages

Clock Gating Optimization With Delay-Matching: Shih-Jung Hsu Rung-Bin Lin

Uploaded by

Nguyen Van Toan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Clock Gating Optimization with Delay-Matching

Shih-Jung Hsu Rung-Bin Lin


Computer Science and Engineering Computer Science and Engineering
Yuan Ze University Yuan Ze University
Chung-Li, Taiwan Chung-Li, Taiwan

Abstract—Clock gating is an effective method of reducing power


dissipation of a high-performance circuit. However, deployment
of gated cells increases the difficulty of optimizing a clock tree. In
this paper, we propose a delay-matching approach to addressing
this problem. Delay-matching uses gated cells whose timing
characteristics are similar to that of their clock buffer (inverter)
counterparts. It attains better slew and much smaller latency
with comparable clock skew and less area when compared to
type-matching. The skew of a delay-matching gated tree, just like
the one generated by type-matching, is insensitive to process and
operating corner variations. Besides, delay-matching ECO of a
gated tree excels in preserving the original timing characteristics
of the gated tree.

Keywords- Clock gating; low power design; clock tree Figure 1. A gated clock design.
I. INTRODUCTION z Optimizing a homogeneous gated tree where the clock
Clock gating is an effective method of reducing power paths from the root to all the sinks are of the same depth.
dissipation of a high-performance circuit [1-14]. As shown in Experimental results based on UMC 90nm process
Fig. 1, En1 and En2 input to the AND gates can be employed technology show that delay-matching attains better clock
to prevent clock signals from reaching the downstream flip- slew and much smaller clock latency with comparable
flops and hence reduce the switching activity of flip-flops. clock skew at the typical, worst, and best corners when
However, deployment of gated cells such as AND gates compared with type-matching approach. Delay-matching
increases the difficulty of synthesizing a low-skew gated tree. achieves all of these using less area than type-matching
Arguing that different types of logic gates employed in a gated does. Besides, the skew of a delay-matching gated tree,
clock tree are detrimental to clock balancing based on load- just like the one generated by type-matching, is insensitive
matching mechanism, the work in [14] proposes using type- to process and operating corner variations. Moreover, the
matching gates to grow a clock tree based on load-matching slews of a delay-matching gated tree are less sensitive to
mechanism. Type-matching prescribes that all the logic gates
process and operating corner variations than that of a type-
on the same level must be of the same type. The logic gates on
matching gated tree.
the same level can be either all inverters, buffers, NAND,
z Optimizing a non-homogeneous gated clock tree. Our
AND, NOR, or OR gates. Although type-matching approach
can result in lower clock skew, it may incur excessive clock results show that delay-matching obtains smaller latency
latency and bad slew at buffers and clock sinks due to having and better slew but incurs larger area and skew.
only a limited number of gated cell types in a typical standard z Performing ECO (Engineering Change Order) of a gated
cell library. It may also cause a design to use more cell areas. tree. Our results show that delay-matching ECO excels in
preserving the original timing characteristics of a gated
In this paper, we propose delay-matching concept to tree.
address some problems about type-matching. Delay-matching Note that delay-matching is a general concept. Its
is achieved using gated cells whose timing characteristics are applications are not limited to the occasions presented in this
similar to that of their clock buffer (inverter) counterparts. work. One of its disadvantages is that we need to re-engineer a
Hence, a traditional clock tree synthesizer can be employed to cell library to include delay-matching cells.
obtain a delay-balanced clock tree just treating a gated cell as a
clock buffer (inverter). Delay-matching concept can thus be The rest of this article is organized as follows. Section II
easily implemented and integrated into an industrial design presents some basics and problems about type-matching clock
flow. To facilitate such a task, we also propose an approach to gating. Section III describes our delay-matching concept and
designing delay-matching cells and integrating them into a how we design delay-matching cells. Section IV presents how
commercial standard cell library. We apply delay-matching to delay-matching is employed to optimize a gated tree and shows
three occasions of clock gating. some experimental results. The last section draws conclusions.

978-3-9810801-7-9/DATE11/©2011 EDAA.
II. TYPE-MATCHING CLOCK GATING
Type-matching concept is proposed in [14] to reduce clock
skew in a gated clock tree. Its main motivation is to address the
problem of large clock skew caused by different timing
characteristics of gated cells and buffers. Without showing the
control logic, Fig. 2 presents a gated clock tree based on type-
matching concept for the clock design in Fig. 1. As one can
see, the type-matching tree is augmented with four gates T1,
T2, T3, and T4. The gates on each level of the clock tree must
be of the same type and the same driving capability. For
example, G1, T2, T3, and T4 must be of the same type and the
same driving capability. A traditional clock tree synthesizer can Figure 4. Type-matching gated tree for the clock design in Fig. 2 with T1
be employed to minimize the skew of a type-matching tree. As being replaced by an OR gate with control input En3.
shown in Fig. 3, a non-type-matching tree may use different A. Type-Matching with High-Drive Gated Cells
types of cells on a level. It is harder to minimize the skew of
such a tree due to a discrepancy between AND’s and buffer’s As discussed above, the gated cells in a typical standard cell
timing characteristics. Note that if a clock tree has inverters, library have only a limited number of driving classes. Hence, in
NAND gates will be employed to grow a type-matching tree. our work we consider extending a typical cell library to include
One problem pointed out in [14] is that if a gated clock design large-drive gated cells. These large-drive gated cells can be
employs both AND and OR gates on the same level, NAND used to replace large-drive clock buffers and inverters.
gates should be used to implement both AND and OR gates. However, they are not delay-matching. They are designed only
Fig. 4 shows such an example for the case if T1 in Fig. 2 is to match the driving capabilities of clock buffers and inverters.
replaced by an OR gate with a control input En3. In Fig. 4, T5 High-drive type-matching can improve clock latency and slews
and T6 implement an AND gate and T7 and T8 implement an at buffers and sinks. However, it still causes a design to use a
OR gate to form a type-matching gated tree. However, not large total cell area.
mentioned in [14] is that doing so may increase clock latency, III. DELAY-MATCHING CLOCK GATING
power consumption, and chip area.
A. Delay-Matching Concept
Type-matching has yet another problem if a typical cell
library is used. A typical cell library consists of gated cells only The basic idea of delay-matching is to employ a set of gated
up to 5X or 6X driving capability. To achieve good slew, the cells which have the same timing characteristics as their clock
fanout driven by a gated cell must be kept small. This will buffer (inverter) counterparts. Take the clock tree in Fig. 3 for
increase the size of a clock tree considerably. Moreover, the example. Let B2, B3, and B4 be instances instantiated from the
clock tree tends to have more buffering levels so that it is less same clock buffer, said CKBUF4X, and B1 be an instance
likely to have a small latency. instantiated from a buffer, said CKBUF8X. Let G1 be an
instance of CKAND4X and G2 be an instance of CKAND8X.
If CKAND4X is a delay-matching cell for CKBUF4X and
DFF

CKAND8X is a delay-matching cell for CKBUF8X, Fig. 3


DFF

gives a delay-matching gated clock tree. Clearly, if we count


only the gate and buffer delays on the paths to sinks, this clock
DFF
DFF

tree is supposedly delay-balanced. Its slew and latency would


be similar to that of the clock tree without clock gating.
DFF
DFF

To realize a delay-matching tree, a gated cell should have


the following properties.
DFF
DFF

z Taking AND gate for example, if there is a clock buffer


with kX driving capability in a standard cell library, there
Figure 2. Clock gating with type-matching cells. will be a two-input AND gate with kX driving capability.
This AND gate is called CKANDkX and its clock buffer
counterpart is called CKBUFkX.
z The clock input capacitance of CKANDkX is similar to
that of CKBUFkX.
z The rise time, fall time, rise delay, and fall delay of
CKANDkX are similar to that of CKBUFkX.
The first property is to provide a variety of gated ANDs,
some of them with large driving capability. The second and
third properties make our gated ANDs look like clock buffers
so that load balancing and thus delay balancing during growing
Figure 3. Clock gating without type-matching, or clock gating with delay- a clock tree can be achieved with ease.
matching if delays of G1, B2, B3, and B4 are the same and delays of G2 and
B1 are also the same.
Similarly, CKORkX for its CKBUFkX counterpart and
CKNANDkX and CKNORkX for its CKINVkX counterpart
should also possess these properties.
B. Designing of Delay-Matching Cells
To design a delay-matching cell whose timing
characteristics are similar to its counterpart, we must make the
cell satisfy the last two properties presented above. Taking
CKNAND1X as shown in Fig. 5 for example, we would like to
make CKNAND1X similar to CKINV1X. Given that
CKINV1X has a PMOS with channel width 1860 and
an NMOS with channel width 545 and their timing
characteristics, we would like to determine the widths of P1, Figure 5. Transistor sizing of CKNAND1X (right) for matching CKINV1X
P2, N1, and N2 of CKNAND1X. Since we assume that clock How similar are our gated cells to their counterparts in
control signal will arrive at input En much earlier than clock terms of timing behavior? TABLE I shows a comparison of
signal, we should make P2 as small as possible to reduce cell clock input capacitance of gated cells with that of their buffer
area. In our case, the width of P2 is set to 360nm. As for the (inverter) counterparts. As one can see, the input capacitance of
widths of P1, N1, and N2, we perform HSPICE simulation to CKNANDkX and CKANDkX is quite close to that of
determine their values. Our approach has the following three CKINVkX and CKBUFkX, respectively. This is a direct
steps. consequence of forcing the equivalence of timing
z For each pair of input slew (totally seven slews) at Clk characteristics between gated cells and their counterparts.
and an output load (totally seven loads) at Y, HSPICE TABLE II shows the average and maximum timing
simulations are performed to determine the widths of P1, discrepancies between CKINVkX and CKNANDkX. The
N1, and N2 such that the timing characteristics of second to fifth columns show the discrepancies for the case of
counting the data of all 49 pairs and the sixth to 9th columns
CKNAND1X are similar to that of CKINV1X. We do this
show the discrepancies for the case of counting only the data of
for all the 49 pairs of input slews and output loads. We
20 pairs with larger slews and loads. As one can see, timing
hence obtain 49 width values of P1, N1, and N2, discrepancy is on average small except for some cases.
respectively. Note that HSPICE allows us to perform such Especially, the discrepancy in rise delay (RD) between
simulation easily. This requires us to guess an initial width CKINVkX and CKNANDkX is large. The large discrepancy is
and a possible width range for each transistor at each mainly due to some data points which have small denominators
simulation. in calculating discrepancies so that the average discrepancy is
z For each transistor, we find its average, maximum, and widened by these data points considerably. Similar results are
observed for CKBUFkX and CKANDkX.
minimum widths from the 49 data sets. We use the
average width as the width of the transistor and perform It is worthwhile to mention that our delay-matching cells
timing characterization of CKNAND1X with such width. CKNANDkX has similar timing characteristics as that of
We then find out the timing discrepancy TD, a sum of the CKNORkX because CKNANDkX and CKNORkX are both
absolute timing differences between CKNAND1X and designed to have similar timing characteristics as that of
CKBUF1X for all the 49 pairs. CKINVkX. By the same reason, both CKANDkX and
CKORkX have similar timing characteristics. For example, the
z With initial width set equal to the average width and its timing discrepancies between our CKAND1X and CKOR1X
possible width range defined by the minimum and on the 49 pairs of slew and load are on average 0.83% for rise
maximum widths obtained from the previous iteration, we delay, 2.68% for rise time, 0.85% for fall delay, and 2.89 for
repeatedly perform the above two steps until TD cannot be fall time. The maximum discrepancies are 9.13%, 19.81%,
further reduced. The transistor widths are then determined 7.39%, and 14.89%, respectively.
by the iteration with the smallest TD. TABLE I. INPUT CAPACITANCE OF GATED CELLS AND THEIR
COUNTERPARTS (PF).
In the above procedure, we consider all the 49 pairs for
determining transistor’s width. However, we find that with so- CKINV CKNAND CKNOR CKBUF CKAND CKOR
obtained transistor sizes, a gated cell tends to have large timing 1X 0.00391 0.00395 0.00455 0.0039 0.00418 0.00478
discrepancies. Based on our observation, the transistor widths
2X 0.00704 0.00735 0.00841 0.0039 0.00442 0.00479
for the pairs with smaller slews and smaller loads tend to be
much larger than the others. Hence, we employ only the 20 3X 0.01015 0.01067 0.01234 0.00703 0.00738 0.008
pairs formed by larger input slews and output loads to 4X 0.01337 0.0138 0.01625 0.00703 0.0076 0.00837
determine transistor widths. The timing data presented 6X 0.01963 0.01988 0.0237 0.01015 0.01073 0.01211
hereafter are obtained using the transistor widths determined in 8X 0.02588 0.02694 0.03095 0.01337 0.01392 0.01567
such a way. Note that the above task for determining transistor 12X 0.03849 0.03983 0.0465 0.01962 0.02052 0.02198
widths of a delay-matching cell can be automated easily and
16X 0.05113 0.05193 0.06348 0.02588 0.02748 0.02867
efficiently.
20X 0.06374 0.0668 0.07749 0.03223 0.03373 0.03641
TABLE II. TIMING DISCREPANCIES (AVERAGE%/MAX%) BETWEEN
CKINVKX AND CKNANDKX. (RD: RISE DELAY, RT: RISE TIME, FD: FALL

DFF
DFF
DELAY, FT: FALL TIME).
49 pairs 20 pairs
RD RT FD FT RD RT FD FT

DFF
DFF
1X 6/51 6/30 3/11 4/21 4/7 3/10 2/6 4/12
2X 90/2707 5/30 3/20 4/18 4/24 4/13 2/4 3/16
3X 39/661 6/40 5/17 4/23 13/185 5/40 4/10 5/10

DFF
DFF
4X 81/2741 8/104 4/18 3/17 6/55 3/8 3/8 3/15
6X 28/339 6/42 4/17 4/19 19/271 3/11 3/6 5/19

DFF
8X 33/716 6/41 3/8 3/22 43/716 5/29 2/5 5/22

DFF
12X 42/834 7/51 3/14 3/23 8/90 5/18 3/14 5/23
16X 80/2559 7/41 4/11 3/22 136/2559 4/18 2/6 4/22 Figure 6. Delay-matching with different types of gated cells
20X 42/509 5/26 3/15 3/22 15/204 4/12 2/15 4/22

The good thing for the similarity of timing characteristics


between CKANDkX and CKORkX is that we can easily
address the problem caused by having both AND and OR gates
on the same level of a clock tree. Let us redraw the clock tree
in Fig. 4 by replacing T5 and T6 by an AND gate G2, T7 and
T8 by an OR gate G3, and the type-matching AND gates by
buffers, as shown in Fig. 6. Now, we obtain a gated clock tree
Figure 7. Layouts of CKNAND1X (left) and CKNOR1X(right).
with different gated cells, i.e., G2 and G3, on the same level.
Because G2 and G3 have similar timing characteristics and G1,
B1, B2, and B3 also have similar timing characteristics due to
delay-matching, we obtain a delay-balanced gated clock tree.
Once the transistor sizes of a delay-matching cell are
determined, layout design then follows. The cell height and
power/ground rail widths of a delay-matching cell are made the
same as those of the other cells in an industrial standard cell
library based on UMC 90um technology. Figure 7 shows
layouts of some delay-matching cells. The delay-matching cells
are characterized and added into the underlying standard cell
library. Such a library is called delay-matching library.
C. Discussion
There are still some problems needed to be addressed for Figure 8. A gated clock tree with opposite triggering phases.
delay-matching and type-matching approaches. As shown in patterns for generating clock gating signals. Here, we apply
Fig. 8, the second level of the clock tree employs an inverter, a delay-matching to three occasions of clock gating.
buffer, and an OR gate. With type-matching, the OR gate and
buffer each can be replaced by two serially connected NAND A. Delay-matching on a Homogeneous Gated Tree
gates. However, this can not be done for the inverter. Hence, A homogeneous gated tree is a clock tree where each path
the clock tree in Fig. 8 can not be transformed into a type- from the root to a sink traverses the same number of gates,
matching gated tree. Similarly, with delay-matching, the OR including gated cells, clock buffers, and clock inverters, as
gate and buffer have similar timing characteristics, but the shown in Fig. 3. For delay-matching, we replace the gated cells
inverter and buffer don’t. Hence, the clock tree in Fig. 8 can like G1 and G2 in Fig. 3 with our delay-matching cells. Once
not be transformed into a delay-matching gated tree, either. we have a delay-matching gated tree, we perform clock tree
However, if we can design an inverter whose timing placement and routing to complete the clock tree design using
characteristics are made similar to that of a buffer, we can still some commercial tools.
transform the tree into a delay-matching gated tree. Based on
For the purpose of comparisons, we implement the
our experiments, we can achieve on average 10%, 20%, 17%,
and 23% timing discrepancy between buffers and inverters for following clock gating methods.
rise delay, rise time, fall delay, and fall time, respectively. NG: non-gating.
Here, we architect a three-stage inverter so that its timing NML: Normal clock gating using an industrial standard cell
characteristics can be made as closer to that of its buffer library based on UMC 90nm process technology. It is neither
counterpart as possible. delay-matching nor type-matching.
DM: Delay-matching clock gating.
IV. SYNTHESIS OF DELAY-MATCHING GATED CLOCK TREE SmallT: Type-matching using small-drive gated cells.
BigT: Type-matching using also large-drive gated cells.
Advantageously, we do not need any special algorithm for z Synthesis of a DM gated tree
synthesizing a delay-matching gated clock tree. A gated clock We first obtain a design without clock gating using
tree synthesis algorithm like that presented in [8-11] should be Synopsys Design Compiler. We then place the design and
sufficient. For simplicity, we assume that we have a clock perform clock tree synthesis with skew and slew constraints
control logic unit synthesized based on flip-flops activity using Cadence SOC Encounter. To this point, we obtain a
buffered and placed clock tree without clock gating. We approaches. At typical corner, NML is obviously not viable.
randomly select a certain percentage of clock buffers DM (delay-matching) has its latency, skew, and slew close to
(inverters) in the clock tree and change them into that of NG (non-gating). Note that maintaining clock latency is
corresponding gated cells. For simplicity, we wire together all important for re-spinning a design to implement clock gating.
the clock control signals on the same level. At this moment, we Delay-matching achieves substantial power saving with respect
should obtain a clock tree like the one shown in Fig. 3. We to NG at the expense of more cell areas. Power data include
place gated cells at the locations where their individual switching and leakage powers of clock tree but do not include
counterparts are located originally. Because gated cells are switching and leakage powers of flip-flops. Note that a more
larger than their buffer (inverters) counterparts, we perform cell significant power saving should be obtained from eliminating
legalization using Cadence SOC Encounter to remove cell unnecessary switching of flip-flops and the logic driven by flip-
overlaps. We then use Encounter to complete clock tree and flops. To further verify the data obtained by Encounter, we also
signal routing and obtain clock skew, latency, and slew at use Encounter to generate a SPICE netlist of a clock tree and
buffers and sinks. use HSPICE to simulate the clock tree to obtain power, latency,
z Synthesis of a BigT gated tree skew, and slews at buffers and sinks. These data are presented
We use a cell library augmented with large-drive gated cells in the right-most five columns of TABLE III.
for synthesizing a BigT gated tree. For simplicity, rather than Compared to type-matching approach SmallT, DM obtains
using the type-matching algorithm presented in [14], we adopt a much better latency and better slews at buffers and sinks with
an approach similar to our delay-matching one. The way of a slightly worse skew. Also, DM uses smaller cell area and
obtaining a type-matching gated tree using BigT is similar to incurs smaller wire length. The reason why SmallT uses more
that of delay-matching except that, after obtaining a clock tree cell area and larger wire length is because SmallT creates a
like the one shown in Fig. 3, we replace buffers (inverters) with large gated tree (due to smaller fanout driven by per driver) in
type-matching AND (NAND) gates to obtain a type-matching order to satisfy the slew constraint at buffers and sinks. This
tree like the one shown in Fig. 2. also causes type-matching to create a deep gated tree so that its
z Synthesis of a SamllT gated tree latency is significantly larger than that of a delay-matching
We use a cell library that only has clock buffers and gated tree.
inverters of driving capabilities 1X, 2X, 3X, 4X, and 6X for We observe from TABLE III that using BigT improves
clock tree synthesis. Similar to delay-matching approach, we latency and slews even though the latency and slews are still
also randomly replace a certain percentage of buffers and worse than that of DM. However, BigT incurs larger skew and
inverters with gated cells and obtain a gated clock tree like the uses considerably more cell area than DM. Since the
one shown in Fig. 3. We then perform type-matching synthesis source/drain capacitance of a large-drive gated cell is larger
to obtain a gated clock tree like the one shown in Fig. 2. than that of its buffer (inverter) counterpart, the latency of a
We evaluate the above clock gating approaches using three BigT gated tree is still larger than that of a delay-matching
large ISCAS89 benchmark circuits (s35932, s38417, s38584) gated tree.
and eight large ITC99 benchmark circuits (b14, b15, b17~b22). TABLE III also shows some results at the best and worst
These benchmarks have a few to 100 thousand equivalent corners. As one can see, the skews of both delay-matching
NAND gates. Remember that we randomly replace a certain gated tree and type-matching gated tree are not sensitive to
percentage of clock buffers (inverters) with gated cells. Here, corner variations. However, it seems that the slews of a delay-
we try 15%, 30%, and 50%. We consider typical, best, and matching gated tree are less sensitive to corner variations than
worst corners. The typical/best/worst corner uses a that of a type-matching gated tree. The reason for this may be
typical/fast/worst SPICE model at 25/-40/125 degree C and that a delay-matching gated tree provides a larger drive than a
100%/90%/110% VDD supply voltage. SmallT gated tree does. Or each of its buffers (inverters) drives
TABLE III shows the average values of various a smaller load than that of a BigT gated tree.
performance indices obtained by the above clock-gating
TABLE III. RESULTS FOR HOMOGENEOUS GATED TREE SYNTHESES USING DIFFERENT GATING METHODS (BS: WORST SLEW AT BUFFERS (INVERTERS); SS:
WORST SLEW AT CLOCK SINKS; CTW: TOTAL WIRE LENGTH OF CLOCK ROUTING; CTA: TOTAL CELL AREA OF A GATED CLOCK TREE).

Gating Latency Skew CTA Power HSPICE


BS (ps) SS (ps) CTW (um)
methods (ps) (ps) (um^2) (mW) Power Latency Skew BS SS
NG 146.32 15.97 19.40 18.95 12251 1264 2.31 2.17 168.06 17.37 24.60 22.25
NML 206.78 80.78 31.63 32.46 12271 1267 1.70 1.67 228.02 82.42 41.89 35.71
Typical
DM 150.10 19.84 21.45 20.18 12458 1673 1.86 1.77 173.14 22.16 27.64 24.20
corner
SmallT 280.47 14.45 38.10 23.00 13387 2558 1.90 1.96 295.35 15.88 42.23 25.83
BigT 189.48 22.01 27.02 25.09 12517 2349 2.03 2.09 209.42 24.64 37.23 28.05
NG 112.12 16.42 16.33 15.83 12251 1264 2.87 3.06 140.83 18.26 24.12 20.19
NML 152.95 59.76 24.52 25.59 12271 1267 2.13 2.37 185.52 66.93 36.51 30.70
Best
DM 116.63 21.03 18.85 17.75 12458 1673 2.39 2.50 146.35 23.46 27.79 22.30
corner
SmallT 202.27 13.58 27.90 18.44 13387 2558 2.58 3.05 233.46 15.37 37.39 22.26
BigT 146.07 23.84 22.39 21.29 12517 2349 2.73 3.28 176.86 26.95 35.55 25.96
NG 210.94 16.57 24.82 24.11 12251 1264 1.98 1.58 231.83 17.42 29.34 27.24
NML 310.65 123.56 44.25 45.32 12271 1267 1.46 1.22 324.09 118.03 52.85 46.44
Worst
DM 214.59 19.91 26.44 25.35 12458 1673 1.56 1.27 236.43 22.06 31.25 28.98
corner
SmallT 440.69 17.49 56.38 33.45 13387 2558 1.60 1.34 438.26 18.99 54.87 36.04
BigT 276.08 21.73 35.74 33.50 12517 2349 1.83 1.44 288.92 23.91 43.12 35.25
B. Delay-matching on a Non-homogeneous Gated Tree TABLE IV. RESULTS OBTAINED BY USING DIFFERENT LIBRARIES TO
SYNTHESIZE NON-HOMOGENEOUS GATED TREES.
In this section, we employ a delay-matching library (DM)
Latency Skew CTW CTA
and a conventional cell library (UMC90) respectively to Library BS (ps) SS (ps)
(ps) (ps) (um) (um^2)
synthesize a gated clock tree using Synopsys Design Complier.
Typical UMC90 199.6 17.4 26.0 21.5 20993 2203
The so-obtained gated tree is usually non-homogeneous, i.e., corner DM 176.1 25.7 23.3 19.5 20028 4790
not every path from the root to a sink traversing the same
Best UMC90 148.7 18.6 21.5 17.0 23541 2459
number of gates. The tree is still yet to be buffered, placed and
corner DM 143.4 30.2 20.2 17.5 22382 5556
routed to meet the skew, slew, and latency constraints. We
carry out these tasks using SOC Encounter. TABLE IV shows Worst UMC90 335.0 28.4 36.0 32.4 23541 2459
that delay-matching achieves smaller latency and better slew corner DM 271.7 39.3 30.7 27.4 22382 5556
but larger area and skew. These data are averages of the results TABLE V. ECO RESULTS OBTAINED BY USING A DELAY-MATCHING
for all the benchmarks mentioned in Section IV.A. LIBRARY AND AN INDUSTRIAL CELL LIBRARY.
Library Latency Skew BS SS CTW CTA
C. ECO of Gated Clock Tree with Delay-Matching (ps) (ps) (ps) (ps) (um) (um^2)
We can use delay-matching cells to perform ECO for a 143.3 17.2 21.2 16.7 20993 2203
UMC90
gated clock tree so that the skew, latency, and slew at buffers Typical 154.9 27.3 30.3 17.0 20996 2178
and sinks of the tree can be maintained. As shown in Fig. 9, we corner
DM
137.3 27.4 19.6 16.8 20028 4790
can replace a CKBUFkX B1 with a delay-matching cell 138.3 28.4 19.8 16.8 20045 4816
143.3 17.2 21.2 16.7 20993 2203
CKANDkX G1 or vice versa. Since B1 and G1 have similar Best
UMC90
154.9 27.3 30.3 17.0 20996 2178
timing characteristics, the skew, latency, and slews of these two corner 137.3 27.4 19.6 16.8 20028 4790
gated trees will not be much different. To validate this DM
138.3 28.4 19.8 16.8 20045 4816
argument, we perform ECO for a non-homogeneous gated tree 321.5 26.2 36.4 31.7 20993 2203
UMC90
by replacing one to three buffers by their delay-matching cell Worst 359.4 64.1 56.8 32.6 20996 2178
counterparts. TABLE V shows the ECO results obtained by corner
DM
261.6 34.5 30.1 26.3 20028 4790
using our delay-matching library (DM) and an industrial cell 262.0 35.0 30.4 26.4 20045 4816
library (UMC90) based on UMC 90nm process technology. REFERENCES
UMC90 library is neither for delay-matching nor for type-
[1] L. Benini and G.D. Micheli, “Automatic Synthesis of Low-Power Gated-
matching. For each library at each corner, there are two rows of Clock Finite-State Machines,” IEEE Trans. on CAD, Vol. 15, No. 6, pp.
data. The first row gives the data before ECO whereas the 630-643, 1996.
second row gives the data after ECO. The data in TABLE V [2] N. Raghavan, V. Akella, and S. Bakshi, “Automatic Insertion of Gated
are averages of the results for all benchmarks mentioned in Clocks at Register Transfer Level,” International Conference on VLSI
Section IV.A. As one can see, performing ECO for a gated Design, pp. 48-54, 1999.
clock tree using a delay-matching cell library can greatly [3] D. Borkovic and K.S. McElvain, “Reducing Clock Skew in Clock
maintain the performance characteristics of the original gated Gating Circuits,” United States Patent, Patent No. 7082582, 2006.
tree. Note that it is important to maintain clock latency of a [4] Q. Wu, M. Pedram, and X. Wu, “Clock Gating and Its Application to
Low Power Design of Sequential Circuits,” IEEE Trans. on Circuits and
design with functional ECO (non-performance related ECO). Systems – I: Fundamental, Theory, and Applications, Vol. 47, No. 103,
V. CONCLUSIONS pp. 415-420, 2000.
[5] L. Li, K. Choi, S. Park, M. K. Chung, “Novel RT Level Methodology
In this paper we present a delay-matching approach to for Low Power by Using Wasting Toggle Rate based Clock Gating,”
optimizing a gated clock tree for power reduction. We also International SoC Design Conference, pp. 484-487, 2009.
develop a method for designing delay-matching cells. [6] E. Arbel, C. Eisner, and O. Rokhlenko, “Resurrecting Infeasible Clock-
Compared with type-matching, delay-matching attains better Gating Functions,” DAC, pp. 160-165, 2009.
slew and clock latency with comparable clock skew while [7] A. Farrahi, C. Chen, A. Srivastava, G. Tellez, and M. Sarrafzadeh,
using much less cell area. Besides, the skew of a delay- “Activity Driven Clock Design,” IEEE Trans. on CAD, Vol. 20, No. 6,
matching gated tree, just like the one generated by type- pp. 705-714, 2001.
matching, is insensitive to process and operating corner [8] W. C. Chao and W. K. Mak, “Low-Power Gated and Buffered Clock
variations. Meanwhile, delay-matching ECO excels in Network Construction,” ACM TODAES, Vol. 13, No. 1, Article 20,
January 2008.
maintaining the original timing characteristics of a gated tree.
[9] D. Garret, M. Stan, and A. Dean, “Challenges in Clock Gating for a Low
The disadvantage of delay-matching is that we have to re- Power ASIC Methodology,” ISLPED, pp. 176-181, 2002.
engineer a cell library to include delay-matching cells. [10] W. Shen, Y. Cai, X. Hong, and J. Hu, ”An Effective Gated Clock Tree
Fortunately, this task can be automated easily and efficiently. Design Based on Activity and Register Aware Placement,” IEEE
Delay-matching is a general concept so that its applications Transactions on VLSI Systems, Vol. 18, No. 12, pp. 1639-1648, 2009.
should not be limited to those presented in this work. [11] D. J. Hathaway, “Method for Making Integrated Circuits Having Gated
Clock Trees,” United States Patent, Patent No. 6536024, 2003.
[12] J. Oh and M. Pedram, “Gated Clock Routing for Low-Power
Microprocessor Design,” IEEE Trans. on CAD, Vol. 20, No. 6, pp. 715-
722, 2001.
[13] C. C. Cheung and K. D. Au, “Clock Gating Cell for Used in a Cell
Library,” United States Patent, Patent No. 6552572, 2003.
[14] C. M. Chang, S. H. Huang, Y. K. Ho, J. Z. Lin, H. P. Wang, and Y. S.
Lu, “Type-Matching Clock Tree for Zero Skew Clock Gating,” DAC,
714-719, 2008.
Figure 9. ECO by replacing buffer B1 with gated cell G1 or vice versa.

You might also like