### 17: Adders CMOS VLSI Design CMOS VLSI Design 4th Ed.

```Lecture 17:
Outline









Datapath
Computer Arithmetic Principles
CMOS VLSI Design 4th Ed.
2
A Generic Digital Processor
CMOS VLSI Design 4th Ed.
3
Building Blocks for Digital Architectures
 Arithmetic unit
– Bit sliced data path – adder, multiplier, shifter,
comparator, etc.
 Memory
– RAM, ROM, buffers, shift registers
 Control
– Finite state machine (PLA, random logic)
– Counters
 Interconnect
– Switches, arbiters, bus
CMOS VLSI Design 4th Ed.
4
An Intel Microprocessor
CMOS VLSI Design 4th Ed.
5
Bit-Sliced Design
CMOS VLSI Design 4th Ed.
6
Bit-Sliced Datapath
CMOS VLSI Design 4th Ed.
7
Itanium Integer Datapath
CMOS VLSI Design 4th Ed.
8
Motivation
 Arithmetic units are, among others, core of every data path and
 Data path is at the core of
– microprocessors (CPU)
– signal processors (DSP)
– data processing application specific IC’s (ASIC) and
programmable IC’s (FPGA)
 Standard arithmetic units available from libraries
 Design of arithmetic units necessary for
– non-standard operations
– high performance components
– library development
CMOS VLSI Design 4th Ed.
9
Naming Conventions
 Signal busses: A (1-D), Ai, (2-D), ai:k (sub-bus, 1-D)
 Signals: a, ai (1-D), ai,k (2-D), Ai:k (group signal)
 Circuit complexity measures: A (Area), T (cycle time,
delay), AT (area-time product), L (latency, number of
cycles).
 Arithmetic operators: +, -, •, /, log (=log2)
 Logic operators: OR, AND, XOR, NOT, …
CMOS VLSI Design 4th Ed.
10
Circuit Complexity Measures
 Unit gate model
– Inverter, buffer: A = 0, T = 0
– Simple monotonic 2-input gates (AND, OR,
NAND, NOR): A = 1, T = 1
– Simple non-monotonic 2-input gates (XOR,
XNOR): A = 2, T = 2
– Simple m-input gates: A = m – 1, T = log m
– Wiring not considered
– Only for estimation purposes

CMOS VLSI Design 4th Ed.
11
Recursive Function Evaluation
 Given: inputs ai, outputs zi, function f (graph sym. •)
 Non-recursive functions (n.)
– Output zi is a function of input ai
zi  f ai, x; i  0,...,n 1
– Parallel structure

A  On, T  O1

CMOS VLSI Design 4th Ed.
12
Recursive Function Evaluation
 Recursive functions (r.)
– Output zi is a function of all inputs ak, k ≤ i
• with a single output z = zn-1 (r.s.):
ti  f ai ,t i1; i  0,...,n 1
t1  0/1, z  t n 1
– f is non-associative (r.s.n)
» serial structure

A  On, T  On
– f is associative (r.s.a)
» serial or single-tree structure

A  On, T  Ologn
CMOS VLSI Design 4th Ed.
13
Recursive Function Evaluation
– Output zi is a function of all inputs ak, k ≤ i
• multiple outputs zi (r.m.) (=> prefix problem)
zi  f ai,zi1; i  0,...,n 1, z1  0/1
– f is non-associative (r.m.n)
» serial structure
A  On, T  On


– f is associative (r.m.a)
» Serial or multi-tree structure
A  On 2 , T  Ologn
» Shared tree structure

A  Onlogn, T  Ologn
CMOS VLSI Design 4th Ed.
14
Arithmetic Operations
 Overview
CMOS VLSI Design 4th Ed.
15
Overview of Arithmetic Operations
 Direct implementation of dedicated units
– always: 1 – 5
– in most cases: 6
– sometimes: 7, 8
 Sequential implementation using simpler units and several
clock cycles (decomposition)
– sometimes: 6
– in most cases: 7, 8, 9
 Table look-up techniques using ROMs
– universal: simple application to all operations
– efficient only for single-operand operations of high
complexity (8 - 12) and small word length.
CMOS VLSI Design 4th Ed.
16
Overview of Arithmetic Operations
 Approximation using simpler units: 7 – 12
– Taylor series expansion
– polynomial and rational approximations
– convergence of recursive equation systems
– CORDIC (COordinate Rotation DIgital Computer)
CMOS VLSI Design 4th Ed.
17
Binary Number Systems
 Radix-2, binary number system (BNS): irredundant,
weighted, positional, monotonic.
 n-bit number is an ordered sequence of bits (binary
A  an1,an2,...,a0 2, ai 0,1
digits)
 Simple and efficient implementation in digital circuits
 MSB/LSB
(most/least significant bit): an-1/a0

 Represents an integer or fixed point number, exact.
 Fixed point numbers:
am1,...,a0 . a1,...,amn 
m-bit integer

n-m bit fraction
CMOS VLSI Design 4th Ed.
18
Binary Number Systems
 Unsigned: positive or natural numbers
n 1
– Value:
A  an 12n 1  ... a1 2  a0   ai 2i
i0
– Range: 0, 2n 1
Two’s (2’s) complement: standard representation of
 or integer numbers
signed
n 2
– Value
n 1
i
A  an 1 2
– Range
2
  ai 2
i0
, 2n 1 1
n 1
CMOS VLSI Design 4th Ed.
19
Binary Number Systems
– Complement: A  2n  A  A 1, where A  an 1,an 2,...,a0 
– Sign: an-1
– Properties:
 asymmetric range, compatible with
unsigned numbers in many arithmetic operations.
(same treatment of positive and negative
numbers)
 One’s (1’s) complement: similar to 2’s complement
n 2
– Value: A  a 2n 1 1  a 2i
n 1
– Range:



 
i
i0

2 n 1 1, 2 n 1 1
CMOS VLSI Design 4th Ed.
20
Binary Number Systems
n
– Complement: A  2  A 1  A
– Sign: an-1
– Properties: double representation of zero,
symmetric
range, modulo (2n-1) number system.

 Sign-magnitude: alternative representation of signed
numbers
n 2
a
– Value: A  1   ai 2i
n1
i0


n 1
n 1

2
1
,
2
1


– Range:
– 
Complement: A  a ,a ,...,a 
n1 n2
0
CMOS VLSI Design 4th Ed.
21
Binary Number Systems
 Sign: an-1
 Properties: double representation of zero, symmetric
range, different treatment of positive and negative
numbers in arithmetic operations, no MSB toggles at
sign changes around 0 (=> low power)
CMOS VLSI Design 4th Ed.
22
Gray Numbers
 Gray numbers (code): binary, irredundant, nonweighted, non-monotonic.
– Property: unit-distance coding. Exactly one-bit
– Applications: counters with low output toggle rate
(low power busses), representation of continuous
signals for low-error sampling (no false numbers
due to switching of different bits at different
times).
– Non-monotonic numbers: difficult arithmetic
CMOS VLSI Design 4th Ed.
23
Gray Numbers
– Binary - Gray conversion
gi  bi1  bi , bn  0;
i  0,...,n 1 (n.)
– Gray – binary conversion

bi  bi1  gi , bn  0
i  n 1,...,0 (r.m.a)

CMOS VLSI Design 4th Ed.
24
Redundant Number Systems
 Non-binary, redundant, weighted number systems.
representations of the same number => redundancy.
 No carry propagation in adders => more efficient
implementation of adder-based units (multipliers, dividers, etc.)
 Redundancy => no direct implementation of relational
operators => conversion to irredundant numbers.
 Several bits used to represent one digit => higher storage
requirements.
 Expensive conversion to irredundant numbers. Not necessary if
redundant input operators are allowed.
CMOS VLSI Design 4th Ed.
25
Delayed-Carry Representation
 Delayed-carry or half adder representation
ri  0,1,2 , c i ,si ,ai ,bi  0,1 ,
ri  c i1,si   2c i1  si  ai  bi , c i1si  0
n 1
R   ri 2 i  C,S   C  S  A  B
i0
 1 digit holds the sum of 2 bits (no carry out)
Example: 01 + 01 = (0,0) (1,0) = 2
CMOS VLSI Design 4th Ed.
26
Carry-Save Representation
ri  0,1,2,3 , c i ,si ,ai ,bi ,di  0,1 ,
ri  c i1,si   2c i1  si  ai  bi  di  ai  ri
n 1
R   ri 2 i  C,S   C  S  A  R
i0
 One digit holds the sum of 3 bits or 1 digit and 1 bit.
No carry-out digit, carry is saved.

 Standard redundant number system for fast addition.
CMOS VLSI Design 4th Ed.
27
Signed-Digit Representation
 Signed-digit (SD) or redundant digit (RD) number
representation.
n 1
ri,si ,ti  1,0,1  1,0,1 , R   ri 2i
i0
 No carry propagation in S = R + T

ri  t i  c i1,ui   2c i1  ui , c i1,ui  1,0,1
c i1,ui  is redundant (e.g.,0 1  01  1 1
i c i ,ui  s.t. si  1,0,1
 One digit holds the sum of two digits. No carry-out.

CMOS VLSI Design 4th Ed.
28
Signed-Digit Representation
 Minimal SD representation: minimal number of nonzero digits.
0111
1010001 0
– Applications: sequential multiplication (less
cycles), filters with constant coefficients (less
 hardware).
– Example:
7  0111, 11 11 , 101 1 , 1001 , 11 111, 
minimal

CMOS VLSI Design 4th Ed.
29
Signed-Digit Representation
 Canonical SD representation: minimal SD. Not two
non-zero digits in sequence.
01
1 101001 0
 SD -> binary: carry propagation necessary => adder.
 Other applications: high speed multipliers.
 to carry-save, simple use for signed
 Similar
numbers.
CMOS VLSI Design 4th Ed.
30
Residue Number Systems
 Non-binary, irredundant, non-weighted number
system.
 Carry-free and fast additions and multiplications.
 Complex and slow other arithmetic operations (e.g.
comparison, sign, and overflow detection) because
digits are not weighted. Conversion to weighted
 Codes for error correction and detection.
 Possible applications (but hardly used)
– Digital filters
– Error detection and correction
CMOS VLSI Design 4th Ed.
31
Residue Number Systems
 Base is n-tuple of integers (mn-1, mn-2, …, m0),
residues (or moduli). These mi are pairwise prime.
A  an 1,an 2,,a0 m
n1 ,m ns
,,m 0
ai  0,1,,mi 1
n 1
Range : M   mi , anywhere inZ
i0
ai  Amodmi  A m i , A  mi  qi  ai
 Arithmetic operations: each digit computed
separately.

CMOS VLSI Design 4th Ed.
32
Residue Number Systems
 
z i  Z m i  f A m  f A m i
i
A  B mi  A mi  B mi
A B m i  A m i  B m i
ai m  m i  ai m
i
ai1 m  aim i 2 m
i
mi
mi
 f ai 
mi
 ai  bi m
 ai  bi m
mi
i
i
i
(Fermats' T heorem)
i
 Best moduli mi are 2k and 2k – 1.
– High storage efficiency with k bits.

CMOS VLSI Design 4th Ed.
33
Residue Number Systems
 Example:
m1,m0   3,2, M  6
5 6  A  a1,a0   5 3 , 5 2   2,1
4  5 6  1,0  2,1  1 2 3 , 0 1 2   0,1  3 6
4 5 6  1,0 2,1  1 2 3 , 0 1 2   2,0  2 6

CMOS VLSI Design 4th Ed.
34
Floating-Point Numbers
 Larger range, smaller precision than fixed-point
representation, inexact, real numbers.
 Double-number form => discontinuous precision.
 S | biased exponent E | unsigned norm mantissa M
F  1  M E  1  M 2E bias
S
S
 Basic arithmetic operations
 A B  1SA SB M A  M B   E A E B



A  B  1 A  M A  1 B  M B  E A  E B    E A

S
S
CMOS VLSI Design 4th Ed.
35
Floating-Point Numbers
 Basic arithmetic operations based in fixed point add,
multiply, and shift operations. Post-normalization
required.
 Applications:
– Processors: real floating point formats (e.g. IEEE
standard), large range due to universal use.
– ASICs: usually simplified floating-point formats
with small exponents, smaller range. Used for
range extension of normal fixed-point numbers.
 IEEE floating point format:
CMOS VLSI Design 4th Ed.
36
Logarithmic Number System
 Alternative representation to floating point (mantissa
+ integer exponent -> only fixed point exponent).
 Single number form => continuous precision =>
higher accuracy, more reliable.
S
|
biased fixed point exponentE
L  1  E  1  2E bias
S
S
 Basic arithmetic operations:
– (A
 < B) = (EA < EB) additionally consider sign
– A + B by approximation or addition in
conventional number system and double
conversion.
CMOS VLSI Design 4th Ed.
37
Logarithmic Number System
 Basic arithmetic operations
A B  1 A
S SB
A  1  
y
SA
 EA EB
yE A
,
y
A  1  
SA
EA
y
– Simpler multiplication, exponentiation. More

– Expensive conversion: (anti)logarithms probably
by table look-up.
– Applications: real-time digital filters.
CMOS VLSI Design 4th Ed.
38
Antitetrational Number System
 Tetration (t.x = 2{ 2
2...
2
and antitetration (a.t.x)
x times
 Larger range, but smaller precision than logarithmic
representation. Otherwise, analogous.
 Note that
 all these systems can be mixed in
composite arithmetic.
 Choice of number representation should be hidden
from the user. The compiler should handle it.
 Rational numbers can also be represented in
floating slash notation.
CMOS VLSI Design 4th Ed.
39
Round-Off Schemes
 Intermediate results with d additional lower bits. This
results in higher accuracy. A  an1,,a0,a1,,ad 
 Rounding: keeping error e small during final word
length reduction: R  rn1,,r0   A  e
cost.
 Truncation
 RTRUNC  an1,,a0 
– bias  1  1
= average error e
d 1
2 2
 to nearest (normal rounding)
 Round

RROUND  an 1,, a0  , A A 
1
 A  0.12
2
CMOS VLSI Design 4th Ed.
40
Round-Off Schemes
 Round to nearest
1
– The error is bias d 1 nearly symmetric
2
– + 0.12 can often be included in a previous operation.
 Round to nearest even/odd

RROUND if a1,, ad   00
 RROUND EVEN  

an 1,, a1,0 otherwise
– bias = 0 (symmetric)
– Mandatory in IEEE floating-point standard
 3guard bits for rounding after floating point operations: guard
bit G (postnormalization), round bit R (round to nearest ), sticky
bit S (round to nearest even)
CMOS VLSI Design 4th Ed.
41
CMOS VLSI Design 4th Ed.
42
A
S  A B
B
S  A B C
Cout
Cout  A B
S
A
B
Cout
Cout  MAJ ( A, B, C )
C
S
A
B
Cout
S
A
B
C
Cout S
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
1
0
1
1
0
0
1
0
1
0
0
1
1
1
1
0
0
1
1
1
0
1
0
0
0
1
1
0
1
1
0
1
1
0
1
0
1
1
1
1
1
CMOS VLSI Design 4th Ed.
43
 Add up m bits of same magnitude
 Output the sum as a k-bit number (k  log m 1)
 Or count 1’s at inputs => (m,k) counter –
combinational counter.

 A half adder is a (2,2) counter
cout ,s  2cout  s  a  b

s  a  b (sum)
c out  ab (carry- out)
A  3 , T  21


CMOS VLSI Design 4th Ed.
44
CMOS VLSI Design 4th Ed.
45
 A full-adder is a (3,2) counter.
cout ,s  2cout  s  a  b  cin
g  ab
A  7 , T  42
(generat e) c 0  ab
p  a  b (propagate) c1  a  b

s  a  b  c in  p  c in

c out  ab acin  bcin  ab a  bc in
 g  pcin  pg  pcin  pa  pcin
 c inc 0  c inc1

CMOS VLSI Design 4th Ed.
46
PGK
 For a full adder, define what happens to carries
(in terms of A and B)
– Generate: Cout = 1 independent of C
• G=A•B
– Propagate: Cout = C
• P=AB
– Kill: Cout = 0 independent of C
• K = ~A • ~B
CMOS VLSI Design 4th Ed.
47
 Brute force implementation from eqns
S  A B C
Cout  MAJ ( A, B, C )
A
A
A
S
C
A
B
B
C
C
MAJ
C
B
C
B
B
B
C
A
C
A
A
A
C
B
S
Cout
B
A
B
A
B
C
A
B
C
B
B
Cout
B
A
CMOS VLSI Design 4th Ed.
48
 Factor S in terms of Cout
S = ABC + (A + B + C)(~Cout)
 Critical path is usually C to Cout in ripple adder
MINORITY
A
B
C
Cout
S
S
Cout
CMOS VLSI Design 4th Ed.
49
 Same circuit with sized transistors
CMOS VLSI Design 4th Ed.
50
Layout
 Clever layout circumvents usual line of diffusion
– Use wide transistors on critical path
– Eliminate output inverters
CMOS VLSI Design 4th Ed.
51
 Complementary Pass Transistor Logic (CPL)
– Slightly faster, but more area
B
B
B
C
B
C
B
C
B
C
A
S
B
C
B
C
B
C
B
C
Cout
A
A
S
A
Cout
B
B
CMOS VLSI Design 4th Ed.
52
 Transmission gates
CMOS VLSI Design 4th Ed.
53
 Dual-rail domino
– Very fast, but large and power hungry
– Used in very fast multipliers

C_h
A_h
B_h
A_h
C_l
B_h
A_l
C_h
B_h
A_h
C_l
B_l
Cout _l
A_l
B_l
B_l

S_l

Cout _h
S_h
C_h
B_h
A_l
CMOS VLSI Design 4th Ed.
54
(m,k) Counters
k 1
m 1
j 0
i0
sk 1,,s0    s j 2 j   ai
  Associativity of addition allows conversion from
linear to tree structure => faster at the same number
of FAs.
log m
A  7 m2 k   7m  logm
k 1
TLIN  4m  2logm , TTREE  4 log3 m   2logm

CMOS VLSI Design 4th Ed.
55
(7,3) Counter
 Example
A  28 , T 14

A  28 , T 10

CMOS VLSI Design 4th Ed.
56
 Add two n-bit operands A and B and an optional
carry in cin by performing carry propagation.
 Sum (cout, S) is an irredundant (n+1) bit number
cout ,S  cout 2n  S  A  B  cin
2c i1  si  ai  bi  c i
i  0,1,,n 1
c 0  c in , c out
 c n (r.m.a)

CMOS VLSI Design 4th Ed.
57
– Each sum bit depends on all previous carries
– How do we compute all these carries quickly?
AN...1 BN...1
Cout
Cout
+
SN...1
Cin
Cin
00000
1111
+0000
1111
CMOS VLSI Design 4th Ed.
Cout
11111
1111
+0000
0000
Cin
carries
A4...1
B4...1
S4...1
58
 Serial arrangement of n full adders.
 Simplest, smallest, and slowest CPA structure.
A  7n , T  2n , AT  14 n 2

CMOS VLSI Design 4th Ed.
59
– Critical path goes from Cin to Cout
– Design full adder to have fast carry delay
A4
B4
Cout
B3
C3
S4
A3
A2
B2
C2
S3
A1
B1
Cin
C1
S2
CMOS VLSI Design 4th Ed.
S1
60
 Note that worst case delay is linear with number of
bits.
tadder  N 1tcarry  tsum
 Goal: Make the fastest possible carry path circuit.

CMOS VLSI Design 4th Ed.
61
CMOS VLSI Design 4th Ed.
62
Inversion Property
CMOS VLSI Design 4th Ed.
63
Inversions
 Critical path passes through majority gate
– Built from minority + inverter
– Eliminate inverter and use inverting full adder
A4
B4
Cout
B3
C3
S4
A3
A2
B2
C2
S3
A1
B1
Cin
C1
S2
S1
CMOS VLSI Design 4th Ed.
64
CMOS VLSI Design 4th Ed.
65
CMOS VLSI Design 4th Ed.
66
 The NMOS and PMOS chains are completely
symmetrical. A maximum of two series transistors
can be observed in the carry generation circuit.
 When laying out the cell, the most critical issue is
the minimization of the capacitance at node Co. The
reduction of the diffusion capacitances is particularly
important.
 The capacitance at node Co is composed of four
diffusion capacitances, two internal gate
capacitances, and six gate capacitances in the
CMOS VLSI Design 4th Ed.
67
 The transistors connected to Ci are placed closest to
the input.
 Only the transistors in the carry stage have to be
optimized for optimal speed. All transistors in the
sum stage can be minimal size.
CMOS VLSI Design 4th Ed.
68
Transmission Gate FA
CMOS VLSI Design 4th Ed.
69
Carry Propagation Speed-up
 Concatenation of partial CPA’s with fast cin -> cout.
 Fast carry look-ahead logic for entire range of bits.
CMOS VLSI Design 4th Ed.
70
Generate / Propagate
 Equations often factored into G and P
 Generate and propagate for groups spanning i:j
Gi: j  Gi:k  Pi:k
Pi: j  Pi:k
Gk 1: j
Pk 1: j
 Base case
Gi:i  Gi  Ai
Bi
Pi:i  Pi  Ai  Bi
0 GCP
0:00:0 in
G0:0  G0  Cin
P0:0  P0  0
 Sum:
Si  Pi  Gi 1:0
CMOS VLSI Design 4th Ed.
71
PG Logic
A4
B4
A3
B3
A2
B2
A1
B1
Cin
1: Bitwise PG logic
G4
P4
G3
P3
G2
P2
G1
P1
G0
P0
2: Group PG logic
G3:0
G2:0
G1:0
G0:0
C3
C2
C1
C0
3: Sum logic
C4
Cout
S4
S3
S2
S1
CMOS VLSI Design 4th Ed.
72
PG Logic
CMOS VLSI Design 4th Ed.
73
Carry-Ripple Revisited
Gi:0  Gi  Pi
Gi 1:0
A4
B4
G4
P4
A3
B3
G3
P3
A2
B2
G2
P2
A1
B1
G1
P1
Cin
G0
G3:0
G2:0
G1:0
G0:0
C3
C2
C1
C0
P0
C4
Cout
S4
S3
S2
S1
CMOS VLSI Design 4th Ed.
74
Carry-Ripple PG Diagram
Bit Position
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
tripple  t pg  ( N  1)tAO  txor
Delay
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
75
PG Diagram Notation
Black cell
i:k
Gray cell
k-1:j
i:k
i:j
Gi:k
Pi:k
Gk-1:j
Pk-1:j
Buffer
k-1:j
i:j
i:j
i:j
Gi:j
Gi:k
Pi:k
Gk-1:j
Pi:j
CMOS VLSI Design 4th Ed.
Gi:j
Gi:j
Gi:j
Pi:j
Pi:j
76
Manchester Carry Chain
CMOS VLSI Design 4th Ed.
77
Manchester Carry Chain
CMOS VLSI Design 4th Ed.
78
Manchester Carry Chain
CMOS VLSI Design 4th Ed.
79
 Carry-ripple is slow through all N stages
 Carry-skip allows carry to skip over groups of n bits
– Decision based on n-bit propagate signal
Cout
A16:13 B16:13
A12:9 B12:9
A8:5 B8:5
A4:1
P16:13
P12:9
P8:5
P4:1
1
0
C12
+
S16:13
1
0
C8
+
S12:9
1
0
C4
+
S8:5
CMOS VLSI Design 4th Ed.
B4:1
1
0
+
Cin
S4:1
80
CMOS VLSI Design 4th Ed.
81
CMOS VLSI Design 4th Ed.
82
CMOS VLSI Design 4th Ed.
83
Carry-Skip PG Diagram
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
16:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
For k n-bit groups (N = nk)
tskip  t pg   2  n  1  ( k  1)  t AO  txor
CMOS VLSI Design 4th Ed.
84
Variable Group Size
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
16:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
Delay grows as O(sqrt(N))
CMOS VLSI Design 4th Ed.
85
 Partial CPA with fast ck -> ci
c i  Pi1:kci  Pi1:kc k (bit groupai1,,ak )
Pi1:k  pi1 pi2 pk (group propagate)
 If Pi-1:k = 0 : ck does not become c’i and c’i is
selected, becoming ci.

 If Pi-1:k = 0 : ck becomes c’i, but c’i is skipped.
 Path ck -> c’i -> ci never sensitized => fast ck -> ci
 False path => inherent logic redundancy =>
problems in circuit optimization, timing analysis, and
testing.
CMOS VLSI Design 4th Ed.
86
 Variable group sizes are faster.
– Use larger groups in the middle
– Minimize delays a0 -> ck -> si-1 and ak -> ci -> sn-1
 Partial CPA type is RCA or CSKA (multilevel CSKA)
 Medium speed-up at small hardware overhead (+
AND/bit +MUX/group)
A  8n , T  4n
1
2
, AT  32n
3
2

CMOS VLSI Design 4th Ed.
87
CSKA + Manchester
CMOS VLSI Design 4th Ed.
88
 Trick for critical paths dependent on late input X
– Precompute two possible outputs for X = 0, 1
– Select proper output when X arrives
 Carry-select adder precomputes n-bit sums
– For both possible carries into n-bit group
A16:13 B16:13
0
+
Cout
1
+
B8:5
C8
1
+
CMOS VLSI Design 4th Ed.
B4:1
C4
+
Cin
0
S12:9
A4:1
0
+
1
0
1
0
1
S16:13
A8:5
0
+
C12
1
+
A12:9 B12:9
S8:5
S4:1
89
 Partial CPA with fast ck -> ci and ck -> si-1:k
0
si1:k  c k si1:k
 c k s1i1:k
c i  c kc i0  c kc1i
 Two CPA’s compute two possible results (cin = 0/1),
group carry-in ck selects correct one afterwards.
 Variable
 group sizes are faster; use larger groups at
end (MSB). Balance delays a0 -> ck and ak -> ci0
 Partial CPA type is RCA, CSLA (multilevel CSLA) or
CLA.
CMOS VLSI Design 4th Ed.
90
 High speed-up at high hardware overhead.
– + MUX/bit + (CPA + MUX)/group
A 14n , T  2.8n
1
2
, AT  39n
3
2

CMOS VLSI Design 4th Ed.
91
CMOS VLSI Design 4th Ed.
92
CMOS VLSI Design 4th Ed.
93
Linear Carry-Select
CMOS VLSI Design 4th Ed.
94
Square-Root Carry-Select
CMOS VLSI Design 4th Ed.
95
Delay Comparison
CMOS VLSI Design 4th Ed.
96
 Partial CPA with fast ck -> ci and ck -> si-1:k
si1:k  si1:k  c k , c i  ci  Pi1:kc k
Pi1:k  pi1 pi2 pk (group propagate)
 Result is incremented after addition if ck = 1
 Variable group sizes are faster, use larger groups at
end (MSB). Balance delays a0 -> ck and ak -> c’i
 Partial CPA could be RCA, CIA (multilevel CIA) or
CLA.
 High speed-up at medium hardware overhead
(+AND/bit + (incrementer + AND/OR)/group).
 Logic of CPA and incrementer could be merged.
CMOS VLSI Design 4th Ed.
97

A 10n , T  2.8n
2
1
, AT  28n
CMOS VLSI Design 4th Ed.
3
2
98
 Example: gate-level schematic of carry-increment
– Only two different logic cells (bit-slices): IHA and
IFA
CMOS VLSI Design 4th Ed.
99
 Factor initial PG and final XOR out of carry-select
15
14
13
12
11
10
13:12
8
7
6
9:8
14:12
15:12
9
4
3
2
1
0
5:4
10:8
11:8
5
6:4
7:4
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
tincrement  t pg   n  1  (k  1)  t AO  txor
CMOS VLSI Design 4th Ed.
100
Variable Group Size
 Also buffer
noncritical
signals
15
14
13
12
11
10
9
12:11
8
7
6
8:7
13:11
4
5:4
9:7
14:11
5
3
2
1
0
3:2
6:4
10:7
15:11
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
15
14
13
12
11
10
9
12:11
7
6
8:7
13:11
14:11
8
9:7
10:7
5
5:4
6:4
4
3
3:2
2
1
0
1:0
3:0
6:0
15:11
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
101
 Optimized multilevel CSLA with logn levels
0
1
 Correct sum bits si1:k or sik:1 are conditionally
selected through logn levels of multiplexers.
 Bit groups of size 2l at level l.
 Higher parallelism,
more balanced signal paths.

 Highest speed-up at highest hardware overhead
(2RCA + more than logn MUX/bit)
CMOS VLSI Design 4th Ed.
102

A  3n logn , T  2logn , AT  6n log2 n
CMOS VLSI Design 4th Ed.
103
CMOS VLSI Design 4th Ed.
104
CMOS VLSI Design 4th Ed.
105
 Carries look ahead before sum bits are computed
c 0  c0
c1  g0  p0c0
c 2  g1  p1g0  p1 p0c0
c 3  g2  p2 g1  p2 p1g0  p2 p1 p0c0
g3  g3  p3 g2  p3 p2 g1  p3 p2 p1g0
p3  p3 p2 p1 p0
1
 Hierarchical arrangement using logn levels: g3, p3 
2
passed up, c’ passed down between
levels.
0
 High 

CMOS VLSI Design 4th Ed.

106
A  14n , T  4 log n , AT  56n log n

CMOS VLSI Design 4th Ed.
107
CMOS VLSI Design 4th Ed.
108
in parallel.
 Uses higher-valency cells with more than two inputs.
A16:13 B16:13
Cout
G16:13
P16:13
+
S16:13
C12
A12:9 B12:9
G12:9
P12:9
+
S12:9
A8:5 B8:5
C8
A4:1
C4
G8:5
P8:5
B4:1
G4:1
P4:1
+
+
S8:5
S4:1
CMOS VLSI Design 4th Ed.
Cin
109
CLA PG Diagram
CMOS VLSI Design 4th Ed.
110
CMOS VLSI Design 4th Ed.
111
CMOS VLSI Design 4th Ed.
112
CMOS VLSI Design 4th Ed.
113
Higher-Valency Cells
i:k k-1:l l-1:m m-1:j
i:j
Gi:k
Pi:k
Gk-1:l
Pk-1:l
Gl-1:m
Pl-1:m
Gm-1:j
Gi:j
Pi:j
Pm-1:j
CMOS VLSI Design 4th Ed.
114
Higher Valency PG Diagram
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
16:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
115
– Recursive lookahead gives O(log N) delay
 Many variations on tree adders
CMOS VLSI Design 4th Ed.
116
 Universal adder architecture comprising RCA, CIA,
CLA, and more (entire range of area-delay trade-offs
from slowest RCA to fastest CLA).
step.
 Carries calculated using parallel-prefix algorithms
– High regularity: suitable for synthesis and layout
– High flexibility: special adders, other arthmetic
operations, exchangeable prefix algorithms.
– High performance: smallest and fastest adders
CMOS VLSI Design 4th Ed.
117
A  5n  3A , T  4  2T

CMOS VLSI Design 4th Ed.
118
Prefix Problem
 Inputs (xn-1,…,x0) outputs (yn-1,…,y0), associative
binary operator •
yn 1,, y0   xn 1   x0,, x1  x0, x0 
or
y 0  x 0 , y i  x i  y i1 ; i 1,,n 1 (r.m.a)
 Associativity of • => tree structures for evaluation





x 3  x 2  x1  x 0  x 3  x 2  x1  x 0 
14 2 43  14 2 43 14 2 43

1
1
1
y1 Y1:0
Y3:2
y1 Y1:0


1 4 44 2 4 4 43
1 4 4 2 4 43
2
2
y 3 Y3:0
y 2 Y2:0
1 4 44 2 4 4 43
3
y 3 Y3:0
CMOS VLSI Design 4th Ed.
119
Prefix Problem
l
 Group variables Yi:k : covers bits (xk,…,xi) at level l.
 Carry-propagation is prefix problem: Yi:kl  Gi:kl ,Pi:kl 
G
G

0
i:i
,Pi:i0   gi , pi 
l
i:k




l 1
l 1
,Pi:kl   Gi:l 1j 1,Pi:l 1
; k ji
j 1  G j:k ,P j:k
l 1
l 1 l 1
 Gi:l 1j 1  Pi:l 1
G
,P
j 1 j:k
j:k P j:k


m
c i1  Gi:0
; i  0,,n 1 , l  1,,m
 Parallel-prefix algorithms:
– Multi-tree structures T = O(n) -> O(logn)
– Sharing subtrees A = O(n2) -> O(nlogn)
– Different algorithms trading area vs delay. Also consider
wirng and fanout.

CMOS VLSI Design 4th Ed.
120
Prefix Algorithms
 Algorithms visualized by directed acyclic graphs
(DAG) with array structure (n bits x m levels).
 Graph vertex symbols
 Performance measures:
– A• : graph size (number of black nodes)
– T• : graph depth (number of black nodes on
critical path)
CMOS VLSI Design 4th Ed.
121
Prefix Algorithms
 Serial prefix algorithm (RCA)
A  n 1 , T  n 1 , FOmax  2

CMOS VLSI Design 4th Ed.
122
Prefix Algorithms
 Sklansky parallel-prefix algorithm (PPA-SK)
– Tree-like collection, parallel redistribution of
carries
1
1
A  nlogn , T  logn , FOmax  n
2
2

CMOS VLSI Design 4th Ed.
123
Sklansky
15 14 13 12 11 10
15:14
13:12
11:10
15:12 14:12
15:8
14:8
11:8 10:8
13:8
9
9:8
8
7
6
7:6
7:4
5
5:4
6:4
4
3
2
3:2
3:0
1
0
1:0
2:0
12:8
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
124
Prefix Algorithms
 Brent-Kung parallel-prefix algorithm (PPA-BK)
– Traditional CLA is PPA-BK with 4-bit groups
– Tree-like redistribution of carries (fan-out tree)
A  2n  logn  2 , T  2logn  2
FOmax  logn

CMOS VLSI Design 4th Ed.
125
Brent-Kung
15 14 13 12 11 10
15:14
13:12
15:12
11:10
9
9:8
11:8
8
7
7:6
6
5
5:4
7:4
15:8
4
3
3:2
2
1
0
1:0
3:0
7:0
11:0
13:0
9:0
5:0
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
126
Prefix Algorithms
 Kogge-Stone parallel-prefix algorithm (PPA-KS)
– very high wiring requirements
A  n log n  n 1 , T  log n , FOmax  2

CMOS VLSI Design 4th Ed.
127
Kogge-Stone
15:14 14:13 13:12 12:11 11:10 10:9
9:8
8:7
7:6
6:5
5:4
4:3
3:2
2:1
3:0
2:0
15:12 14:11 13:10
12:9
11:8 10:7
9:6
8:5
7:4
6:3
5:2
4:1
13:6
12:5
11:4 10:3
9:2
8:1
7:0
6:0
5:0
4:0
15:8
14:7
1
2
3
4
5
6
7
8
9
15 14 13 12 11 10
0
1:0
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
128
Prefix Algorithms
 Carry-increment parallel-prefix algorithm
1
1
A  2n 1.4n 2 , T 1.4n 2 , FOmax 1.4n
1
2

CMOS VLSI Design 4th Ed.
129
Prefix Algorithms
 Mixed serial/parallel-prefix algorithm (RCA+PPA)
– Linear size-depth trade-off using parameter k:
0  k  n  2log n  2

– k = 0 : serial prefix graph
– k  n  2log n 1 : Brent-Kung parallel-prefix

graph
– Fills the gap between RCA and PPA-BK (CLA) in
steps of single •-operations.
CMOS VLSI Design 4th Ed.
130

Prefix Algorithms
A  n 1 k , T  n 1  k , FOmax  var.
CMOS VLSI Design 4th Ed.
131
Prefix Algorithms
 Example: 4-bit PPA-SK
– Efficient AND-OR-prefix circuit for the generate
and AND-prefix circuit for the propagate signals
– Optimization: alternatingly AOI/OAI- resp. NAND/NOR-gates (inverting gatesare smaller and
faster).
– Can also be realized using two MUX-prefix
circuits
CMOS VLSI Design 4th Ed.
132
Prefix Algorithms
CMOS VLSI Design 4th Ed.
133
Prefix Algorithms
 Prefix adders can be synthesized by human or
computer as well.
 Starting from a serial structure, one can use
compression rules and expansion rules to obtain
new graphs.
 Can generate all previous graphs except PPA-KS.
CMOS VLSI Design 4th Ed.
134
 Ideal N-bit tree adder would have
– L = log N logic levels
– Fanout never exceeding 2
– No more than one wiring track between levels
 Describe adder with 3-D taxonomy (l, f, t)
– Logic levels:
L+l
– Fanout:
2f + 1
– Wiring tracks:
2t
 Known tree adders sit on plane defined by
l + f + t = L-1
CMOS VLSI Design 4th Ed.
135
l (Logic Levels)
3 (7)
Brent-Kung
f (Fanout)
2 (6)
Sklansky
3 (9)
1 (5)
2 (5)
1 (3)
0 (2)
0 (4)
0 (1)
1 (2)
2 (4)
Kogge-Stone
3 (8)
t (Wire Tracks)
CMOS VLSI Design 4th Ed.
136
Han-Carlson
15 14 13 12 11 10
9
8
7
6
5
4
3
15:14
13:12
11:10
9:8
7:6
5:4
3:2
15:12
13:10
11:8
9:6
7:4
5:2
3:0
15:8
13:6
11:4
9:2
7:0
5:0
2
1
0
1:0
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
137
Knowles [2, 1, 1, 1]
15 14 13 12 11 10
9
8
7
6
5
4
3
2
15:14 14:13 13:12 12:11 11:10 10:9
9:8
8:7
7:6
6:5
5:4
4:3
3:2
2:1
15:12 14:11 13:10
3:0
2:0
15:8
14:7
13:6
12:9
11:8 10:7
9:6
8:5
7:4
6:3
5:2
4:1
12:5
11:4 10:3
9:2
8:1
7:0
6:0
5:0
4:0
1
0
1:0
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
138
15 14 13 12 11 10
15:14
13:12
15:12
11:10
9
9:8
11:8
15:8
13:8
15:8
13:0
8
7
7:6
5
5:4
7:4
7:0
11:0
6
4
3
3:2
2
1
0
1:0
3:0
5:0
9:0
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
139
Taxonomy Revisited
(b) Sklansky
15
15 14 13 12 11 10
9
8
7
6
5
4
3
2
1
14
15:14
15:14
13:12
11:10
9:8
7:6
5:4
3:2
13
12
15:8
14:8
11:8 10:8
13:8
7:4
6:4
3:0
BrentKung
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
f (Fanout)
13:12
11:10
4
3
2
15:14 14:13 13:12 12:11 11:10 10:9
9:8
8:7
7:6
6:5
5:4
4:3
3:2
2:1
15:12 14:11 13:10
9:6
8:5
7:4
6:3
5:2
4:1
3:0
2:0
12:9
11:8 10:7
1
15:8
13:8
15:8
13:0
15:14
0 (2)
0
1:0
0 (4)
0 (1)
13:12
11:10
15:12
14:7
13:6
12:5
11:4 10:3
9:2
8:1
7:0
6:0
5:0
4
3
2
7:6
5:4
3:2
7:0
Knowles
[4,2,1,1]
9:0
8:0
7:0
6:0
5:0
4:0
7
6
5
4
3
2
3:0
2:0
9
8
1
0
HanCarlson
9:8
7:6
5:4
3:2
7:4
15:8
1:0
3:0
7:0
11:0
9:0
5:0
2 (4)
(d) Han-Carlson
15 14 13
(c) Kogge-Stone
15 14 13 12 11 10
13:6
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
HanCarlson
Knowles
[2,1,1,1]
14:7
9
8
7
6
5
4
3
2
9:8
8:7
7:6
6:5
5:4
4:3
3:2
2:1
12:9
11:8 10:7
9:6
8:5
7:4
6:3
5:2
4:1
3:0
2:0
12:5
11:4 10:3
9:2
8:1
7:0
6:0
5:0
4:0
1
12 11 10
9
8
7
6
5
4
3
2
1
0
0
1:0
Kogge3 (8)
Stone
15:14
13:12
11:10
9:8
7:6
5:4
3:2
15:12
13:10
11:8
9:6
7:4
5:2
3:0
15:8
13:6
11:4
9:2
7:0
5:0
1:0
t (Wire Tracks)
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
0:0
5:0
New
(1,1,1)
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
15:8
1:0
1:0
1 (2)
15:12 14:11 13:10
0
3:0
4:0
15:14 14:13 13:12 12:11 11:10 10:9
1
9:0
11:8
13:0
15:8
5
7:4
11:0
15 14 13 12 11 10
1 (3)
5
9:8
11:8
15:0 14:0 13:0 12:0 11:0 10:0
2 (5)
6
6
1 (5)
(e) Knowles [2,1,1,1]
7
7
(a) Brent-Kung
2 (6)
3 (9)
8
8
3 (7)
Sklansky
9
9
l (Logic Levels)
2:0
12:8
15 14 13 12 11 10
10
1:0
15:12
15:12 14:12
11
0
15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
CMOS VLSI Design 4th Ed.
140




– Multilevel versions of adders possible
• CSKA, CSLA, CIA
– Arbitrary combination of speed-up techniques possible.
– Often used combinations: CLA – CSLA
– Influence of logic styles (dynamic logic, pass transistor logic)
– Efficient transistor level implementation of ripple-carry chains
(Manchester chain)
– Combinations of speed-up techniques make sense.
• Much higher design effort
– Many efficient implementations exist in the literature.
CMOS VLSI Design 4th Ed.
141
 Higher valency is a poor choice in static CMOS logic
since each stage has higher delay.
 However, if the stages are built using domino logic, it
could prove to be an advantage.
 Nodes with large fanouts or long wires could use
buffers.
 The prefix trees can also be internally pipelined.
CMOS VLSI Design 4th Ed.
142
Transistor Level
CMOS VLSI Design 4th Ed.
143
Transistor Level
CMOS VLSI Design 4th Ed.
144
Transistor Level
CMOS VLSI Design 4th Ed.
145
CMOS VLSI Design 4th Ed.
146
Sparse Trees
 Building a prefix tree to compute carries in every bit
is expensive in terms of power.
 An alternative is to compute carries into short groups
such as s = 2,3,8, or 16 bits.
 Meanwhile, pairs of s-bit adders precompute the
sums assuming both carries-in of 0 and 1 to each
group.
 It is a hybrid between a prefix adder and carry select
CMOS VLSI Design 4th Ed.
147
 Sparse tree adder with s = 3
CMOS VLSI Design 4th Ed.
148
Carry-Select Implementation
CMOS VLSI Design 4th Ed.
149
 Intel Valency-2 Sklansky sparse tree adder with s=4
CMOS VLSI Design 4th Ed.
150
 Valency-3 Kogge-Stone sparse tree adder with s=3
CMOS VLSI Design 4th Ed.
151
 Ling discovered a technique to remove one series
transistor from the critical group generate path at the
expense of another XOR gate in the sum
precomputation.
 Define a pseudo-generate Hi:j = Gi + Gi-1:j This is a
simpler computation.
Ki Hi: j  KiGi  KiGi1: j  Gi  KiGi1: j  Gi: j
 Define a pseudo-propagate signal I that is a shifted
version of propagate. I i: j  K i 1: j 1

H i: j  H i: k  I i: k H k 1: j
I i: j  I i: k I k 1: j
CMOS VLSI Design 4th Ed.
152
 Finally, the sums are computed by
Si  Pi  K i1H i1:0 
Si  H i1:0 Pi  K i1 H i1:0 Pi 

CMOS VLSI Design 4th Ed.
153
CMOS VLSI Design 4th Ed.
154
Comparison
 Standard-cell implementation, 0.8mm technology
CMOS VLSI Design 4th Ed.
155
Comparison
CMOS VLSI Design 4th Ed.
156
Summary
Choose the best one for your application.
Architecture
Classification Logic
Levels
Max
Tracks
Fanout
Cells
Carry-Ripple
N-1
1
1
N
Carry-Skip n=4
N/4 + 5
2
1
1.25N
Carry-Inc. n=4
N/4 + 2
4
1
2N
Brent-Kung
(L-1, 0, 0)
2log2N – 1
2
1
2N
Sklansky
(0, L-1, 0)
log2N
N/2 + 1
1
0.5 Nlog2N
Kogge-Stone
(0, 0, L-1)
log2N
2
N/2
Nlog2N
CMOS VLSI Design 4th Ed.
157
CMOS VLSI Design 4th Ed.
158