### Uncle * An RTL Approach to Asynchronous Design

```Uncle – An RTL Approach to
Asynchronous Design
Presentor : Chi-Chuan Chuang
Date : 2012.12.20
Outline

Introduction
◦ C-element
◦ Null convention logic (NCL)
◦ NCL asynchronous systems

UNCLE synthesis flow
◦
◦
◦
◦
◦


From RTL to gates
Ack generation
Net buffering
Latch balancing
Relaxation, cell merging
Comparisons
Conclusion
C-element
Commonly used asynchronous
logic component
 Hysteresis
 Implementations

◦ Semi-static : with two cross-coupled inverters
◦ Static : doesn’t rely on feedback inverters
◦ Gate-level : depends on which gate used
C-element (cont.)

Semi-static
C-element (cont.)

Static

Gate-level
Null convention logic
Dual-rail
 Delay-insensitive logic style
 Based on threshold logic
 Use 27 fundamental threshold gates with
2~4 inputs
 Hysteresis state-holding capability

Null convention logic (cont.)

Definitions of threshold gate
◦
◦
◦
◦
set : equation determines the gate function
hold1: all input Ored together
reset : complement of hold1
hold0 : complement of set
Z = set + Z − ∙ hold1
 Z ′ = reset + Z −′ ∙ hold1

An example of implement TH23




set = AB + BC + AC
hold1 = A + B + C
reset = ABC
hold0 = AB + BC + AC
Null convention logic (cont.)

Compare between two types of DR AND2
27 Basic NCL macros
NCL asynchronous systems

Data-driven approach
◦ Use NCL gates for both registers and
control

Control-driven approach
◦ Uses Balsa-style registers and control
Data-driven approach

Using dual-rail latch with acknowledge
signals ki, ko to control the datapath
Dual-rail latches

Dual-rail latches
◦
◦
◦
◦
◦
◦

C_0 = C-element with async reset to 0
C_1 = C-element with async reset to 1
t_d/f_d = dual-rail in
ko = ackout
t_q/f_q = dual-rail out
ki = ackin
Types of latch
◦ drlatn
◦ drlatr
◦ drlats
Dual-rail latches (cont.)

drlatn

drlatr

drlats
Data-driven approach (cont.)

Finite state machine
◦ The middle half-latch contains initial data
◦ All ports and registers are read and written
every cycle
Control-driven Approach
 Control network is separate from the
datapath
to the register

UNCLE synthesis flow
Both data-driven and control-driven are
supported
 lower-level synthesis tool
 Verilog as its input language

From RTL to Gates
RTL is transformed to a gate level netlist
using commercial synthesis tools
 The target library read by the tool
contains:

◦ AND2, XOR2, OR2, inverter
◦ D-flip-flop (DFF), D-latch (DLAT)
◦ Gates for special (T- elements, S-elements…)
◦ Complex gates that have been mapped into NCL
Gates have unit delays for timing
 Area is proportional to transistor counts

Ack Generation

Data-driven
◦ Each latch receive an ack signal from each
destination latch of its output

Control-driven
◦ Each control element receive an ack signal from
each destination latch

A simple Ack merging algorithm:
◦ any latches having at least one common
destination have their ack networks merged

An ack checker step is included at the end of
the flow to check ack network validity
Net Buffering
Timing data is non-linear delay model
(NLDM)
 The signal net target transition time used
for all examples in this paper is
approximately equivalent to a 1X inverter
driving four separate 4X inverter
 Gate sizing
 Build a buffer tree with invertors

Latch Balancing
For the data-driven style that moves halflatches in the netlist to balance data
delays with ack delays
 Ack delay

◦ Depends on the number of destination that
sets the completion network depth

Data delay
◦ depends on the data logic complexity.
Latch Balancing (cont.)
Latch Balancing (cont.)
Generally results in more transistors as
the datapath width increases moving
towards the source registers
 Requiring more latches, with a increase in
the ack network size
 Implement by iterative heuristic algorithm

Latch Balancing (cont.)
Latch Balancing (cont.)
Several sorting/pruning stages based on
data/ack/cycle delays are used to find
latch that are most likely to improve
performance if pushed
 Chosen latches are pushed one gate level,
and affected ack networks are rebuilt
 Latches only feed primary outputs are
ineligible

Latch Balancing (cont.)
Works appropriately for FSMs
 Has problems with linear pipelines if
latches are pushed in one direction only

Relaxation and Cell Merging

Relaxation is a technique that
◦ Look for redundant paths from a PI to a PO
◦ Finds gates that don’t have to be fully
expanded to dual-rail versions, but can be
implemented by eager versions that require
fewer transistors

Cell Merging
◦ A cell merging step is performed in which
adjacent gates with no fanout are merged into
more complex gates
◦ Area-driven
Example RTL Statements
Comparison

GCD16 with different Uncle version
Uncle ver.
DD
DD/NB
DD/LB/NB
CD
CD/NB
transistors
16192
16226
20128
8658
8662
*
1.87
1.87
2.32
1.00
1.00
cyc. time
(ns)
105.7
86.0
64.9
75.7
62.4
*
1.69
1.38
1.04
1.21
1.00
energy (pJ)
32.4
35.3
49.7
10.2
10.8
*
3.17
3.44
4.85
1.00
1.05
Conditional port activity caused data-driven designs to be large, slow.
Latch balancing helped DD performance. Control driven produced
best results
DD:data driven, CD:ctrl-driven, LB:latch balanced, NB:net buffered, *:ratio to best
Comparison (cont.)

GCD16 between Uncle and Balsa
transistors
*
cyc. time (ns)
energy (pJ)
Balsa
Uncle
(CD/NB)
Balsa
Uncle
(CD/NB)
Balsa
Uncle
(CD/NB)
11455
8662
85.2
62.4
13.7
10.8
1.32
1.00
1.37
1.00
1.27
1.00
increasing transistor count
improved performance
Comparison (cont.)

Viterbi decoder design
◦ Branch Metric Unit (BMU)
 Just combinational logic
 With a half latch at the output for UNCLE ack
◦ Path Metric Unit (PMU)
 It’s a set of parallel accumulator-like registers resulting
in many parallel three half-latch loops
◦ History Unit (HU)
 It has three 16-entry register files(4-bit, 2-bit, and 1-bit)
 An outer loop writes the registers, and can conditionally
trigger an inner while loop that contains register
read/write operations and executes a variable number
of iterations
Comparison (cont.)

Viterbi’s Branch Metric Unit comparison
◦ Combination only
transistors
*
cyc. time (ns)
energy (pJ)
Balsa
Uncle
(CD/NB)
Balsa
Uncle
(CD/NB)
Balsa
Uncle
(CD/NB)
9040
5338
9.30
8.87
2.33
1.35
1.69
1.00
1.05
1.00
1.73
1.00
Uncle version just combinational logic with half-latch on
output
Balsa version used loop splitting to split combinational
logic into concurrent blocks that increased parallelism of
internal computations at the cost of more transistors.
Comparison (cont.)

Uncle’s Viterbi Path Metric Unit (PMU)
Uncle ver.
DD/NB
DD/NB/LB
DD/NB/LB+
CD/NB
transistors
20184
21778
24561
18838
*
1.07
1.16
1.30
1.00
cyc. time (ns)
13.4
13.4
6.9
13.3
*
1.93
1.93
1.00
1.91
energy (pJ)
5.1
5.7
6.8
4.6
*
1.12
1.24
1.48
1.00
LB+=latch-balanced, two set of half-latches added to RTL (one in FSM loop, and
one on output port)
Comparison (cont.)

Viterbi’s Path Metric Unit comparison
transistors
*
cyc. time (ns)
energy (pJ)
Balsa
Uncle
(DD/NB/
LB+)
Balsa
Uncle
(DD/NB/
LB+)
Balsa
Uncle
(DD/NB/
LB+)
38328
24561
9.39
6.94
9.73
6.81
1.56
1.00
1.35
1.00
1.43
1.00
Comparison (cont.)

Viterbi’s History Unit comparison
V1
V2
Balsa
Uncle
CD/NB
Uncle
CD
transistors
21819
16471
16425
*
1.33
1.00
1.00
cyc. time (ns)
10.8
6.8
8.4
*
1.60
1.00
1.25
energy (pJ)
1.34
1.17
1.07
*
1.26
1.09
1.00
cyc. time (ns)
230.7
161.3
192.0
*
1.43
1.00
1.19
energy (pJ)
2.54
19.6
18.7
*
1.36
1.05
1.00
Comparison (cont.)

Viterbi comparison between Balsa and
Uncle
transistors
*
cyc. time (ns)
energy (pJ)
Balsa
Uncle
(DD/NB/
LB+)
Balsa
Uncle
(DD/NB/
LB+)
Balsa
Uncle
(DD/NB/
LB+)
71370
46752
22.0
17.3
15.0
10.5
1.53
1.00
1.27
1.00
1.43
1.00
The Uncle decoder uses the DD/NB/LB+ PMU RTL
Comparison (cont.)
Balsa
Uncle
Combinational
synthesis
Yes
Yes
Control synthesis
Yes
Data-driven only
Logic Style
Different dual-rail styles, NCL only
bundled data
Behavioral
simulation
Yes
Limited
Area
optimizations
No
Relaxation, limited cell
merging, ack sharing
Area
optimizations
Relaxation, limited cell
merging, ack sharing
RTL style allow area/perf.
net buffering
Timing model
Fixed delay
NLDM
Conclusion
Requires more effort by the designer than
Balsa, But can have a higher quality design
 If performance of the always active
module is our goal, data-driven style
would be better
 Control-driven style better for modules
with conditional port activity.

Appendix : Teak
Teak is a successor toolset to Balsa that
uses a data-driven style
 One of Teak’s goals is to automatically
insert latch stages and balance delays for
optimum throughput.
 Teak is a fairly new tool with only one
public release

Reference




Uncle – An RTL Approach to Asynchronous Design
ASYNC12 powerpoint about Uncle – An RTL Approach
To Asynchronous Design
Design of Asynchronous Circuits Using Synchronous