### slides - Electrical and Computer Engineering

```A Fully Polynomial Time Approximation Scheme for
Timing Driven Minimum Cost Buffer Insertion
Professor Shiyan Hu, Ph.D.
Department of Electrical and Computer Engineering
Michigan Technological University
Moore’s law
Twice the
number of
transistors,
approximately
every two
years
2
Interconnect Delay Dominates Gate Delay
3
Technology Scaling
130nm





65nm
Global interconnect lengths does not shrink
Local interconnect lengths shrink
Delay ∝ RC
Resistance R = rL/S, where S is reduced
Capacitance C slightly changes
4
Interconnect Delay Scaling
 Scaling factor s=0.7 per generation
 Emore Delay of a wire of length l
tint = (rl)(cl)/2= rcl2/2
(first order)
 Local interconnects
tint : (r/s2)(c)(ls)2/2 = rcl2/2
– Local interconnect delay is roughly unchanged
 Global interconnects
tint : (r/s2)(c)(l)2/2= rcl2
– Global interconnect delay doubles which is unsustainable
 Interconnect delay increasingly more dominant
5
Timing Driven Buffer Insertion
6
Buffers Reduce RC Wire Delay
x
x/2
R
rx/2
cx/4 cx/4
x/2
C
R
rx/2
cx/4 cx/4
C
∆t
∆t = t_buf – t_unbuf = RC + tb –
rcx2/4
x
7
Intuitive Analysis
L
Interconnect Elmore delay = rcL2/2
l=2
l
Interconnect D elay 
l
l
1
L /2
2
1
 rc 2  2 rc
2
L
 rcL
2
Since there are L/2 buffers
(Of course, we need to consider buffer delay)
8
Detailed Analysis
 The delay of a wire of length L is T=rcL2/2
L
r,c – Resistance, cap. per unit length
Rd – On resistance of inverter
Cg – Gate input capacitance
l
 Assume N identical buffers with equal inter-buffer length l (L = Nl). To minimize
delay


T  N R d C g  cl   rl C g  cl / 2 
1


 L  rcl / 2  rC g  R d c    R d C g 
l


dT
dl
 0
 rc
Rd C g 
L
 2 0
2
l opt 

l opt 
2 Rd C g
rc
9
 Substituting lopt back into the interconnect delay expression:
T opt


1
R d C g 
 L  rcl opt  rC g  R d c  
l opt




 L  rc



2 Rd C g
rc
 rC g  R d c  






Rd C g
2 Rd C g
rc


T opt  L 2 R d C g rc  rC g  R d c 
This is why buffer insertion is highly effective and thus
widely used for reducing circuit delay.
10
25% Gates are Buffers
Saxena, et al.
11
ITRS Projections
12
Problem Formulation
1. Steiner
Tree
2. n candidate
buffer
locations
T
Minimal cost (area/power) solution
13
Solution Characterization

To model effect to
downstream, a
candidate solution
is associated with
•
v: a node
•
C: downstream
capacitance
•
Q: required arrival
time
•
W: cumulative
buffer cost
14
Candidate Buffering Solutions
15
Dynamic Programming (DP)
 Start from sinks
 Candidate solutions
are generated
 Three operations
– Insert Buffer
– Merge
Candidate solutions are
propagated toward the source
 Solution Pruning
16
(v2, c2, w2, q2)
x
(v1, c1, w1, q1)
 c2 = c1 + cx
 q2 = q1 - (rcx2/2 + rxc1)
 r: wire resistance per unit length
 c: wire capacitance per unit length
17
Solution Propagation: Insert Buffer
(v1, c1b, w1b, q1b)
(v1, c1, w1, q1)
 q1b = q1 - d(b)
 c1b = C(b)
 w1b = w1 + w(b)
 d(b): buffer delay
18
Solution Propagation: Merge
(v, cl , wl , ql)
(v, cr, wr, qr)
 cmerge = cl + cr
 wmerge = wl + wr
 qmerge = min(ql , qr)
19
Example of Solution Propagation
2
2
(v2, 3, 16, 0)
(v, C, Q, W) • r = 1, c = 1
(v1, 1, 20, 0) • Rb = 1, Cb = 1, tb = 1
• Rd = 1
(v2, 1, 12, 1)
v1
v1
Insert buffer
(v3, 5, 8, 0)
(v3, 3, 8, 1)
v1
slack = 3
v1
slack = 5
20
Solution Propagation
(1)
(2)
(3)
21
Exponential Runtime
16
solutions
8
solutions
4
solutions
2
solutions
n candidate buffer locations lead to 2n solutions
22
Too Many Solutions
 Needs solution pruning for acceleration
 Two candidate solutions
– (v, c1, q1,w1)
– (v, c2, q2,w2)
 Solution 1 is inferior to Solution 2 if
– c1  c2 : larger load
– and q1  q2 : tighter timing
– and w1 w2: larger cost
23
Car Race - Speed
END
Car Speed <=> RAT
24
25
(larger RAT, smaller
capacitance):
Good
END
(smaller RAT, larger
capacitance):
Inferior
26
Faster & Larger Load: Result 1
END
27
Faster & Larger Load: Result 2
END
Who will be the winner?
Cannot tell at this moment,
so keep both of them.
28
Pruning
(Q1,C1,W1)
inferior/dominated
if C1  C2,W1  W2
and Q1  Q2
Non-dominated solutions are
maintained: for the same Q and W,
pick min C

(Q2,C2,W2depends
)
# of solutions
on # of
distinct W and Q, but not their values

29
Generating Candidates
(1)
(2)
(3)
30
Pruning Candidates
(3)
(a)
(b)
Both (a) and (b) look the same to the source.
Remove the one with the worse slack and cost
(4)
31
Candidate Example Continued
(4)
(5)
32
Candidate Example Continued
After pruning
(5)
At driver, compute the candidate solution satisfying the timing target with
minimum cost. The result is optimal.
33
Branch Merge
Left
Candidates
Right
Candidates
34
Pruning During Branch Merge
(n1n2) solutions With
after
pruning
each branch merge.
Worst-case ((n/m)m)
solutions.
35
Selected Milestone Works on Timing Buffering
Is it possible to design a provably good algorithm running in polynomial
time with theoretical guarantee on the error to the optimal solution?
1990 1991 ……. 1996 ……. 2003
2004 ……. 2008 2009
This is a major open problem for a decade!
36
Bridging The Gap
We are
bridging
the gap!
A Fully Polynomial Time
Approximation Scheme
(FPTAS)
 Provably good
 Computes a solution
with cost at most (1+ɛ)
of the optimal cost for
any ɛ>0
 Runs in time
polynomial in n
(nodes), b (buffer
types) and 1/ɛ
 Best solution for an
NP-hard problem in
theory
 Highly practical
37
The Rough Picture
W*: the cost of optimal solution
Make guess on W*
Not
Good
Check it
Good (close to W*)
Return the solution
Key 1: Efficient checking
Key 2: Smart guess
38
Key 1: Efficient Checking
Benefit of guess
 Only maintain
the solutions
with cost no
greater than the
guessed cost
 This is the first
reason for
acceleratation
39
The Oracle

Oracle (x): the checker, able to decide whether x>W* or not
– Without knowing W*
40
Construction of Oracle(x)
Dynamic
Programming
Only interested in
whether there is a
solution with cost
up to x satisfying
timing constraint
Scale and round
each buffer cost

=
/
Perform DP to
scaled problem
with cost upper
bound n/ɛ. Time
polynomial in n/ɛ
41
Scaling and Rounding
0
xɛ/n
2xɛ/n
Buffer
cost
3xɛ/n
4xɛ/n
42
Scaling and Rounding
Rounding error at each buffer
#
distinct
buffer costs
xɛ/n,
total rounding
error isxɛ.
at
mostxɛ/n:
O(n/ε)
since
• Larger
larger
error,only
fewer distinct
costs
faster
solutions
with
W and
bounded
• Smaller
xɛ/n:
smaller error,
by
n/ɛ are
propagated.
more distinct costs and slower
• Rounding is the second
reason for acceleration
0
1
2
3
4
Buffer
cost
43
Oracle Construction
Run dynamic programming with cost  n/ɛ
Yes, there is a solution
satisfying timing constraint
No, no such solution
With cost rounded
and scaled back, the
solution has cost at
most n/ɛ • xɛ/n + xɛ=
(1+ɛ)x > W*
With cost rounded
and scaled back,
the solution has
cost at least n/ɛ •
xɛ/n = x  W*
44
Rounding on Q
 # solutions bounded by # distinct W and Q
 # W = O(n/ɛ1), ɛ1 is used for W
– Rounding before DP
 #Q
– Round up Q to nearest value in {0, ɛ2T/m , 2ɛ2T/m,
3ɛ2T/m,…,T }, in branch merge (m is # sinks)
– Rounding during DP
– # Q = O(m/ɛ2), ɛ2 is used for Q
– Rounding error bounded by ɛ2T/m per branch merge, by
ɛ2T for the whole tree
 # non-dominated solutions is O(mn/ɛ1ɛ2)
0
ɛ2T/m
2ɛ2T/m
3ɛ2T/m
4ɛ2T/m
45
Q-W Rounding Before Branch Merge
Q
T
4ɛ2T/m
3ɛ2T/m
2ɛ2T/m
ɛ2T/m
0
1
2
3
4
n/ɛ1
W
46
Buffer Insertion Runtime
O(
O(
mn
 1 2
n
1
) solutions
after a branch merge
) W - bins
A buffer insertion
At most O (
O(
mnb
 1 2
mn
 1 2
2

n b
1
introduces
O(
nb
1
) non - dominated
solutions
with b buffer typ
es
2

n b
1
) non - dominated
solutions
in a single branch
2
) time for each node. No cross W - bin pruning.
47
Branch Merge Runtime - 1
When merging Wl=2 with
Wr=1, previously we need to
combinations, now only linear
# of combinations.
Target Q=0
48
Branch Merge Runtime - 2
Target Q= ɛ2T/m
49
Branch Merge Runtime - 3
Target Q= 2ɛ2T/m
50
Branch Merge Runtime - 4
For merged W  a, try all W l  W r  a, where each takes
For a  0,1,...,
Including
O(
mn
 1 2
n
1
, total runtime
time for putting
) solutions
is O (
solutions
mn
am
2
) time
2
 2
2
1
O(
)
into bins, it is O (
mn
 1 2
2

n b
1

mn
2
 2
2
1
)
after a branch merge
51
Timing-Cost Approximate DP
 Lemma: a buffering solution with cost at most (1+ɛ1)W*
and with timing at most (1+ɛ2)T can be computed in time
2
O(
m n
 1 2
2

mn b
1
2

m n
2
1  2
2
2

mn b
 1 2
3

n b
1
2
)
52
Key 2: Geometric Sequence Based Guess
 U (L): upper (lower) bound on W*
 Naive binary search style approach
Set U and L on W*
x=(U+L)/2
Oracle (x)
W*<(1+ɛ)x
U= (1+ɛ)x
W*  x
L= x
 Runtime (# iterations) depends on the initial bounds U and L
53




Rounding factor xɛ1/n for W
Larger ɛ1: faster with rough estimation
Smaller ɛ1: slower with accurate estimation
Adapt ɛ1 according to U and L
54
U/L Related Scale and Round
Buffer
cost
U/L 0
xɛ/n
xɛ/n
55
Conceptually
 Begin with large ɛ1 and progressively reduce it
(towards ɛ) according to U/L as x approaches W*
 Fix ɛ2=ɛ in rounding T for limiting timing violation
• Set ɛ1 as a geometric sequence of …, 8, 4, 2, 1,
1/2, …, ɛ
• Suppose that one run of DP takes O(n/ɛ1) time.
Total runtime is bounded by the last run as O(…
+ n/8 + n/4 + n/2 + … + n/ɛ) = O(n/ɛ).
56
Oracle Query Till U/L<2
W u ,i  W l ,i
*
i 
'
W
*
W l ,i 1
W
 *
W
 l ,i
2
m n
2
2
O(
m n
2
m n
 1 2




*
u ,i
1

1 i  t
'
i
'
W
 *
W
 u ,i
*
l ,i
*
)  O(
2




2

1 i  t
1 / 2 ( 4 / 3 )
2

m n
m n
1  2
2
W
*
l ,i
W
*
u ,i

W
*
l ,i
*
W u ,i
2
)  O(
2
2

4/3
m n
2
W
 *
W
 u ,t




*
l ,t
W
  W *
1 i  t 
u ,i
*
l ,i
( 4 / 3)




t i
1 / 2 ( 4 / 3 )
t i
)
j
)  O(
2




*
l ,i
W u ,i
2
mn b
1
W

W
  W *
0 jt 
u ,i

*
1 i
3/4
*
l ,i
2
O(
 1, x 
*
W l ,i
*
u ,i 1
O(
*
W u ,i
mn b
 1 2
m n
2
3

n b
1
 0 . 59
0 jt
1 / 2 ( 4 / 3 )
j
)  O (
2
m n
2
)
2
)
57
Mathematically
58
When U/L<2
Scale and round each cost by Lɛ/n

W=2n/ɛ

Run DP

At least one feasible
solution, otherwise
no solution with
cost 2n/ɛ • Lɛ/n =
2L  U
Lɛ/n rounding error
per buffer and Lɛ in
a solution
A single DP runtime
Pick min cost solution satisfying
timing at driver
59
The Algorithmic Flow
Set U and L of W*
Update U or L
Set x=[UL/(1+ ɛ1)]1/2
Oracle (x)
U/L<2
Compute final solution
60
Main Theorem
 Theorem: a (1+ ɛ) approximation to the timing
constrained minimum cost buffering problem can
be computed in O(m2n2b/ɛ3+ n3b2/ɛ) time for 0<ɛ<1
and in O(m2n2b/ɛ+mn2b+n3b) time for ɛ  1
61
Experiments
 Experimental Setup
– 1000 industrial nets
– 48 industrial buffer types including non-inverting
buffers and inverting buffers
 Compared to Dynamic Programming
which is the state of the art technique and
is widely used in industry
62
Cost Ratio Compared to DP
FPTAS
FPTAS
Buffer Cost Ratio
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0.01 0.05 0.1
0.2
0.3
0.4
0.5
Approximation
63
Speedup Compared to DP
FPTAS
FPTAS
6
Speedup
5
4
3
2
1
0
0.01 0.05 0.1
0.2
0.3
0.4
0.5
Approximation
64
Observations





FPTAS always achieves the theoretical guarantee
Larger ɛ leads to more speedup
On average about 5x faster than dynamic programming
Can run 4.6x faster with 0.57% solution degradation
<5% nets with timing violations which can be fixed by a simple
timing recovery procedure
65
Our Bridge
NP-Hardness
Complexity
Exponential
Time Algorithm
66
Conclusion
 Propose a (1+ ɛ) approximation for timing constrained
minimum cost buffering for any ɛ > 0 (DAC’09)
– Runs in O(m2n2b/ɛ3+ n3b2/ɛ) time
– Timing-cost approximate dynamic programming
– Double-ɛ geometric sequence based oracle search
– 5x speedup in experiments
– Few percent additional buffers as guaranteed
theoretically
 The first provably good approximation algorithm on this
problem which is a major open problem in the field
67
Thanks
```