Clock Concurrent Optimization
Paul Cunningham, Marc Swinnen, Steev Wilcox
Electronic Design Processes
April 10, 2009
© Azuro, Inc. 2009 1
The Clock Timing Gap
© Azuro, Inc. 2009 2
Traditional Design Flows
RTL
Synthesis
Initial Placement Chip speed measured using “ideal” clocks
Physical Optimization
TRADTIIONAL PURPOSE OF CTS:
CTS Make propagated clocks look like ideal clocks
by building “balanced” clock networks
Post-CTS Optimization
Routing Chip speed measured using “propagated” clocks
Post-Route Optimization
Final Layout
© Azuro, Inc. 2009 3
Reality Today
RTL (e.g. Verilog)
Synthesis
Initial Placement Ideal Timing
Clock Gating
Clock Muxing
Physical Optimization
Clock Generators
– Especially for hold
MANY
ITERATIONS
CTS BIG DIFFERENCE!! Complex Scan Chains
OCV derates and CPPR
Post-CTS Optimization Multi-corner
– Especially for hold
Multi-mode
Routing
Propagated Timing
Post-Route Optimization
Final Layout
© Azuro, Inc. 2009 4
Technology Trends
Opening the Clock Timing Gap
© Azuro, Inc. 2009 5
Trends Driving the Clock Timing Gap
clock (T) clock (T)
Clock Timing
Gap CPPRBC
“Skew” does not include
? OCV effects CPPRAB
2X to 5X
skew OCV ± 10%
clock period
OCV ± 10%
A B C
D ≈ clock period
D
Traditional Optimization OCV
D < T - skew OCV affects each pair of FFs differently (CPPR)
OCV effect can be very big - e.g. 10% of 3T
CTS cannot predict OCV impact
So, “skew=0” does not mean FFs are really balanced
© Azuro, Inc. 2009 6
Trends Driving the Clock Timing Gap
clock (T) clock (T)
Clock Timing
Gap
“Skew “does not include
? enable
2T … 5T
gate offsets CG
skew
CG
offset
Traditional Optimization Clock Gating
D < T - skew Clock gates are supposed to have a very big skew
Traditional optimization tries to prevent this by
‘cloning’ the gates and pushing them down the tree
Traditional approach cannot correctly optimize or time
CG enable paths
© Azuro, Inc. 2009 7
Trends Driving the Clock Timing Gap Clk-A
clock (T)
Clock Timing Clk-B
Gap 1,000 FFs
10,000 FFs
? “Skew “does not include
interclock skew
skew
Are all FFs
“balanced”?
Clk-C Clk-D
1,000 FFs
AOI2
D 2,000 FFs
Traditional Optimization Clock Complexity
500 FFs
D < T - skew Clock balancing becomes very difficult, or even
theoretically impossible
Requires extensive manual intervention
Final clock implementation is very different from
original, ideal assumptions
© Azuro, Inc. 2009 8
The Clock Timing Gap is Growing
Pre-CTS Timing Report CTS Post-CTS Timing Report
Propagated clocks timing and ideal clocks timing are diverging
Number of Paths The clock timing gap is
growing exponentially
180nm, σ = 7% of T
65nm, σ = 27% of T
45nm, σ = 50% of T
Difference in Pre- to Post-CTS Timing (% of period T)
© Azuro, Inc. 2009 9
CMPLX
Ideal vs. Propagated Clocks Timing Gap
Difference between ideal and propagated timing across 60 chips
– Top 10% worst violating paths
– Difference measured as a %age of clock period
60%
with ocv
inter-clock
50%
reg-to-cg
reg-to-reg
40%
30%
20%
10%
0%
180nm 130nm 65nm 45nm
© Azuro, Inc. 2009 10
Key Limitation of Traditional Flows
RTL
Synthesis
Initial Placement
“Ideal clocks”
Big decisions about chip
speed vs. area vs. power
Physical Optimization made here using ideal clocks
CTS Two worlds tearing apart
(more than 50% at 40nm!!)
Post-CTS Optimization
Downstream steps don’t have
the freedom to correct all “Propagated clocks”
Routing the mistakes made pre-CTS
in the flow
Post-Route Optimization
Final Layout
© Azuro, Inc. 2009 11
The Key Problems
Physical timing optimization today is all based on ideal clocks timing
– Timing opt is based on wrong information (like wire load models in the past)
– Cannot see the real timing situation
Clock balancing is not achievable, not necessary, and not helpful
– Even if CTS skew=0, Propagated timing ≠ Ideal timing
– Clock balancing imposes severe restrictions on timing optimization – for no benefit
© Azuro, Inc. 2009 12
Solution: Clock Concurrent Optimization
RTL
Synthesis
Pretend clocks
“Ideal clocks”
Initial Placement
Clock Concurrent Optimization Build clocks and optimize logic
at the same time
Real clocks
“Propagated clocks”
Routing
Post-Route Optimization
Final Layout
© Azuro, Inc. 2009 13
Clock Concurrent Optimization
© Azuro, Inc. 2009 14
Clock Concurrent Technology
Traditional Physical Optimization Clock Concurrent Optimization
clock T clock T
Extend physical
optimization into
the clocks
L C
skew
Gmax Gmax
More
degrees of
freedom
Gmax < T - skew L + Gmax < T + C
variable fixed fixed variable variable fixed variable
© Azuro, Inc. 2009 15
Time Borrowing in Clock Concurrent Opt.
clock
? ?
slack
Using CC-Opt, slack can flow across register boundaries
© Azuro, Inc. 2009 16
Logic Chains Limit Time Borrowing
clock
Looping Chain
IO Chain
© Azuro, Inc. 2009 17
Speed is Not Limited by the Critical Path
The “critical path” does NOT limit the chip speed
CC-Opt can easily move slack along a chain to where it is needed
critical path
slack
CC-Opt will optimize “non-critical” paths to create spare slack
© Azuro, Inc. 2009 18
Speed is Limited by the Critical Chain
The “CRITICAL CHAIN” is the focus of CC-Opt
– Critical chain is the chain with the longest delay/stage
Logic delay
11 13 8
Delay 11+9+19+8+13
= = 12
Stage 5
9 19
traditional critical path
15 11
Delay 15+16+11
= = 14
Stage 3
16
critical chain
© Azuro, Inc. 2009 19
CC-Opt Benefits
© Azuro, Inc. 2009 20
Summary of CC-Opt
RTL (e.g. Verilog)
Build clocks directly for timing not
Synthesis skew balancing
Ideal – Consider setup and hold timing
Timing Initial Placement
– Understand OCV timing
– Understand clock gate timing
– Understand clock mux timing
– Understand clock generator timing
– Understand multi-corner
Clock Concurrent Optimization – Understand multi-mode
Eliminate need to configure any
skew groups
– Skew groups are just a work-around for a
Routing broken flow!
Propagated
Timing
Post-Route Optimization
Final Layout
© Azuro, Inc. 2009 21
Key Benefits of CC-Opt.
Up to 20% increase in clock speed
– Fundamentally more degrees of freedom during optimization
All the benefits of useful skew and more!
– Directly targets propagated timing
Accelerated timing closure
– No requirement to configure any skew groups
– Automatically handles clock muxing, clock gating, clock generators,
OCV, multi-corner (setup & hold), and multi-mode
Reduced iterations to the frontend
– No need manually “retime” logic across register boundaries
Reduced IR-drop
– Clocks are not balanced!
Reduced power
– Clock buffers are only used where it is necessary for timing
© Azuro, Inc. 2009 22
Rubix™ - An Implementation
© Azuro, Inc. 2009 23
Rubix™ Flow and Key Features
RTL
Full industry standard STA
– SDC constraint format
– Multi-corner and multi-mode
Synthesis
– OCV derates and CPPR
Global routing
Placement – Ability to export “route guides”
Verilog Placed SDC
netlist DEF Physical Optimization
Phys. Opt. – Timing-driven incremental placement
– Timing-driven high-fanout net buffering
– Cell sizing and logic transformations
CTS RUBIX™ – Legalization
Clocks
Post-CTS Opt.
– Comprehensive skew group support
Verilog Placed
(can mix and overlap with timing windows
netlist DEF
based CTS)
Routing
Multi-voltage
– Clock buffering and net buffering across
voltage islands
PRO
Timing driven scan-chain
reordering
– Setup and hold aware
GDSII
© Azuro, Inc. 2009 24
Thanks!
For more information see CC-Opt White Paper at
www.azuro.com
© Azuro, Inc. 2009 25