0% found this document useful (0 votes)
13 views

Fight the Power PPT

Power Optimization

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Fight the Power PPT

Power Optimization

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Fight the Power

Power reduction ideas for ASIC designers and tool providers

Serag GadelRab
David Bond
David Reynolds
Tundra Semiconductor Corporation
603 March Road, Ottawa, Ontario, K2K 2M5, Canada
Introduction
• POWER is a big headache!
• Average power reduction Longer battery life
– Embedded systems don’t run on batteries!
• Need to cut Maximum power!
• Increased maximum power causes:
– Increased costs:
± Heat-sinks
± Bigger power supplies
± More cooling equipment
– Longer Project Times
± Complex thermal design
± Complex mechanical design
± Reduced Reliability due to junction temperature

• Can your system afford the ASIC maximum power?


GadelRab, Bond & Reynolds Page 2
Outline
• Power Calculation
• Core Power – where does it come from?
• Conserving Core Power:
– Reducing SRAM & Clock Tree Power
– Tips for ASIC designers
– Feature requests for vendors
• Does clock gating reduce maximum power?
• Recommendations and conclusions

GadelRab, Bond & Reynolds Page 3


Power Calculation!

• Next paper in session addresses power calculations


• The hard-copy of our paper describes power
estimation for:
– Core Power
– I/O Power
– Macro Power
• Read it at your leisure!

• Lets jump into how to save core power!

GadelRab, Bond & Reynolds Page 4


What consumes Core Power?
• Depends on your ASIC…
• Typical example:
Component Max Power Percentage
PLL/Macros 0.1 7%
Clock Trees 0.75 52%
Standard Cells 0.1 7%
Interconnect 0.09 6%
RAMS (incl. leakage) 0.24 17%
Logic Leakage 0.16 11%

Total 1.45 100%

GadelRab, Bond & Reynolds Page 5


Segment SRAMs to save power:

Two Memories, Each Memory:


One Memory: M/2 Rows and N Columns
M Rows and N Columns
D

SRAM
Q1
D Q CE
Q
Q2
A
SRAM SRAM
CE
CE
Enable
A[msb-1:0]
Enable
A[msb]

GadelRab, Bond & Reynolds Page 6


It works…but returns diminish!
• Single Port 16KB memory
• 32 Bits @ 266MHz & 64 Bits @ 133 MHz
• Power savings 18-25%
Power/Area T rade-off

20 5

18 32-Bit Current 4.5

16 4

14 3.5

Norm alized Area I


Current (m A)

12 64-Bit Current 3

ncrease
10 2.5
8 64b Area 2

6 1.5

4 32b Area 1

2 0.5

0 0
1 2 4 8
Me m ory Insta nce s to Form a 16KB Block

GadelRab, Bond & Reynolds Page 7


Similar results for dual-port RAM
• Same trends, less power savings
Power Area T rade-Off

50 5
45 4.5
40 32-Bit Data Current 4

Norm alized Area Increase


35 3.5
Current (m A)

30 3
25 64-Bit Data Current 2.5
20 2
15 64-Bit Data Area 1.5
10 1
5
32-Bit Data Area 0.5
0 0
1 2 4 8
Me m ory Insta nce s to form a 16KB m e m ory

GadelRab, Bond & Reynolds Page 8


Some disappointments….
• Got power savings (18-25%)
– Not as much as we expected!
• We suspect that
– Segmentation reduces row-decode power
– Pre-charge/sense-amplifier power does not scale
± Compiler uses same analog circuits?
± Explains why 4X smaller memory = 18-25% power savings

GadelRab, Bond & Reynolds Page 9


Cut SRAM power by increasing width

• Increased width = less power


– Same total storage
– Frequency scaled
Choose Data Path Width Wisely

25 100

90
20 80

70

Power Increase (% )
32-Bit Data Path
Current (m A)

15 60

64-Bit Data Path 50

10 40
30
5 20
% Power Difference
10
0 0
16 8 4 2
Me m ory Size (KB)

GadelRab, Bond & Reynolds Page 10


But I don’t want to re-design entire data-path!

• …and wider path = more logic leakage!


• Deterministic access patterns? Try this:

Data Data
Path Path
Width FIFO/Memory Width
with
Row Width =
2 x Data Path Width

GadelRab, Bond & Reynolds Page 11


Cut SRAM power by turning it OFF
• Not just “De-select”!
• Data/address must not toggle
• Why not put an AND gate at SRAM inputs?
Deselect is Not Enough!

20
18
16 32-Bit Data Path
14
Current (m A)

12
64-Bit Data Path
10
8
6
4
2 Deselect Current (32b/64b)
0 Standby Current (32b/64b)
16 8 4 2
Me m ory Size (KB)

GadelRab, Bond & Reynolds Page 12


Cut RAM power by Auditing
• Is memory Deselected?
• Is memory in Standby?
• Is this the “right” memory:
– Must it be dual/double ported?
– Is the memory too large?
• Poor design choices are forgotten in “fog of battle”
• And remember….

READ THE MEMORY COMPILER MANUAL!

GadelRab, Bond & Reynolds Page 13


What we want from compiler vendors

• Better power-scaling with size!


• Compilers that segment RAMs internally
– Manual segmentation of memories rats-nests
– Much P&R pain

GadelRab, Bond & Reynolds Page 14


Clock tree eats too much power!
• Tree power = 40-60% of core power

Component % of Total Capacitance


Endpoint Capacitance (sequential cells) 35%
Tree driver input capacitance 25%
Route Capacitance 40%
Total 100%

• What can an ASIC designer do?


– Cut the Endpoint Capacitance!

reduce the number of flops in your design!


GadelRab, Bond & Reynolds Page 15
Cut flops to save core power!
• Use algorithms that need less flops
– Silicon may be cheap
… But power is expensive!
• Are these flops there to meet timing?
– Pipeline for performance only!
– Otherwise Use multi-cycle paths!
• Beware of constant propagation.
– DC/PC may not remove FF inputs tied to constants:
± Library problem?

± Multi-module separation?

± Cascaded flops?

• Audit to ensure that you are not “over-clocking” your blocks!


GadelRab, Bond & Reynolds Page 16
Multi-cycle vendor requests
• Designers told to avoid MCP:
– “Hard” to design properly
– “Difficult” to describe in different tool flows
– Incompatible description between tools
– Special handling at gate-level simulation time
• We need from vendors
– MCP “validation” tools
± Is there a MUX to prevent meta-stability?

– Better inter-vendor and intra-vendor tool support


– Improved “portability” of MCP description format
– “Streamlined” handling of MCP in gate-level simulations

GadelRab, Bond & Reynolds Page 17


Yeah, but what about clock gating?

• Two types:
– Tool-based clock gating at tree leaves
– Manual clock gating at tree base
• Tool-based gating at tree leaves
– “Hides” end-point capacitance from driver
– Adds tree driver power to save average end-point power
• Manual gating at tree-base (root-gating)
– Shuts the clock down to entire (sub) block
– Cuts all tree power
– Little overhead

GadelRab, Bond & Reynolds Page 18


How does leaf-gating work
• Power Compiler searches for this in your RTL:

Previous Value

MUX D Q

New Value

MUX Control
Logic

• MUX is removed and the clock-gate is added!


GadelRab, Bond & Reynolds Page 19
Leaf-gating may INCREASE power!
• Power Compiler finds flops using threshold:

set_clock_gating_style <-minimum_bitwidth value>

• What if this flop-bank toggles 90% of the time?


– Do you save 10% of the power? NO!

• Delay of clock-gating cells vs. tree buffers


– CTS inserts buffers in non-gated branches to balance skew
– Increases the OVERALL clock tree power.
• Leaf-gating increases constant clock tree power BAD!
– Hope to recover more power by hiding leaf capacitances

GadelRab, Bond & Reynolds Page 20


Magnitude of clock tree power increase

• 4650 flops running at 150MHz


• Power due to routing and drivers (not leaves):
– Prior to clock gating 56mW
– After clock gating 73mW -- 30% increase!

• Can you recover the 30% increase in power?


– Must be saved during maximum power consumption

• Bitwidth-Threshold selection of clock-gating flops


does not guarantee overall power savings!

GadelRab, Bond & Reynolds Page 21


Wish list for clock gating/synthesis
• Need a leaf-gating selection method that uses:
– Maximum activity threshold
± Use actual toggle rate -- not just bitwidth
± I want to turn off bitwidth!

• Leaf-gating decision must:


– Include effect of gating on overall tree power consumption
– Understand effect of clock gating on clock tree skew compensation
± Some “empirical” global model is needed

• Better clock-tree synthesis algorithms:


– Timing aware tree synthesis need not skew-match all clock-gated nodes!
• Method to evaluate effectiveness of clock gating early in the design cycle:
– Not after P&R!
• Pragma to force some flops to be clock gated
– More portable between projects!

GadelRab, Bond & Reynolds Page 22


Manual Root Gating
• Insert one clock gate at base of tree!
– Does not increase clock tree size
– Saves all power in the clock tree

• Works well if one can shut off entire block

• What if you can not?


– Divide your block into:
± Operational Registers
± Configuration Registers
– Turn on Configuration Registers tree on CPU accesses
“Deep-Branch” clock gating
GadelRab, Bond & Reynolds Page 23
Deep Branch Clock Gating
Configuration and
Operational Registers
are in one synchronous
domain

Configuration Operational
Registers Registers
Clock Clock
Operational Tree
Tree
Logic Controls Branch
Branch
Clock Gating

Root of Tree

GadelRab, Bond & Reynolds Page 24


Effect of clock gating on power grid
• Root-Gating/Deep-Branch gating:
– Turing large tree ON/OFF transients on power grid
– If power grid does not absorb transients logic errors!
• Leaf-Gating:
– Generate transients on power grid if many flops are gated
– Large number of flops turn on/off synchronously
± Transients become a “periodic” signal BAD!
– “Transient” signal injected to analog macros through grid
± Extremely hard to quantify and combat!

GadelRab, Bond & Reynolds Page 25


Need tools leaf-gating grid effects
• Designers:
– Beware of gating grid interplay
± Even for leaf-gating!

± Currently using “home-brewed” concoction to identify


and remedy the problem
• Synopsys:
– Need tools to evaluate effect of leaf-gating on power-grid
noise
– Must be based on
± Synchronous digital toggle rate simulations

± Clock-tree skew figures

GadelRab, Bond & Reynolds Page 26


Take home advice for designers:
• Memories
– Segment large RAMs
– Use wide D/Q buses
– Ensure the RAM is OFF
– AUDIT!!!
• Clock Trees
– Be frugal with flops
– Use multi-cycle paths
– Audit to ensure constant propagation
– Audit to eliminate over-clocking
– Do manual clock gating where possible (incl. deep-branch gating)
– Beware of automatic bitwidth-based leaf-gating

GadelRab, Bond & Reynolds Page 27


Laundry list for vendors
• Memory compilers
– Create segmented memory cores
– Better scaling of power with array size
• Synopsys
– Better support for multi-cycle paths
– Use maximum toggle rate as a leaf-gating criterion
– Clock gating methodology based on MAXIMUM (not average) power
– Ability to turn-off bitwidth-threshold
– Pragma to force leaf-gating
– Estimates for leaf-gating effectiveness early in design cycle
– Tools to evaluate/eliminate periodic transients on power grid
– Clock trees with fewer driver cells
– Timing-based clock tree synthesis
– The moon ☺

GadelRab, Bond & Reynolds Page 28


GadelRab, Bond & Reynolds Page 29

You might also like