Industrial Research Requirements and Challenges

Logic Emulation and Prototyping:
It’s the Interconnect
(Rent rules)
Mike Butts
RAMP at Stanford, August 2010
In the beginning
• I’ve always been a computer architect.
• Before the ASIC (early 1980’s) we built computers with off-the-shelf chips.
– Am2901 bit slices, PALs, 7400 logic. Just hook up some parts and run it now.
Full-speed wire-wrapped prototypes. When it ran it shipped.
Design Verification: It doesn’t crash.
Debug visibility: scope, maybe LA.
Design revision: wire-wrap gun.
Project time: months, not years.
Example: Kurzweil 1978
– Nova clone for Kurzweil Reading Machine
– 2901s, 74F TTL, 16Kb DRAMs, 4 MHz clock
– When the prototype ran the reading machine
app for three days without crashing,
I released the design to manufacturing.
Mike Butts - RAMP - August, 2010
Then came the ASIC Tapeout
• Must get the design perfect before tapeout
• Emergence of EDA, design capture, logic simulation: “Daisy/Mentor/Valid”
• Simulation is very slow,
must write testbenches,
can’t run the real app.
• This makes the design
process very conservative.
Crimps architect’s style.
• To me EDA has always
been a bit of a video game.
Mike Butts - RAMP - August, 2010
FPGAs Emerge!
• Real hardware! We can prototype again!
• But simulators are automatic, and
FPGA tools are strange and hard.
What if we had an automatic box of FPGAs
that plugs into an ASIC socket. Emulate!
• Many FPGAs are needed. How to
interconnect? Extend the row-column
FPGA architecture:
64 CLBs, 1986
US 5,109,353,
Mike Butts - RAMP - August, 2010
First Logic Emulator Product
• Quickturn RPM: 1989
• Nearest-neighbor interconnect
• Hard to get expected logic capacity,
hard to manage delays.
• But it worked!
Sample, US 5,109,353, 1992
Mike Butts - RAMP - August, 2010
First big success: Intel P5
• Quickturn worked closely with Intel to emulate the original
Pentium microarchitecture: P5.
– Ten RPM systems were cabled together, and the design was manually
broken up into RPM-sized segments which were emulated.
“The emulator had one more benefit: blunting the spread of RISC.
At a technology forum for PC companies and software developers last November
(1991), (Intel VP Albert Yu) dialed it up and ran a
Lotus 1-2-3 spreadsheet from a terminal.
The crowd was astonished that a model was
already working. Six months later,
Compaq Computer Corp. scrubbed its plans
for a RISC-based PC.”
- Business Week 6/1/1992 “Inside Intel”
Mike Butts - RAMP - August, 2010
But row/column doesn’t scale
Logic circuit topology is not flat, 2D nearest-neighbor. Wires go anywhere.
FPGA pins get used up by nets that are just passing through. Long delays.
Quickturn RPM had serious capacity, placement and routing issues.
It turns out the wires and pins of an FPGA
are its most precious resource.
– 80-90% of FPGA transistors are interconnect.
– “We charge for the wires, the gates are free”
-- Altera VP Eng. Clive McCarthy, 1994
• Logic density follows Moore’s Law,
but packaging and pin counts do not.
– Not even the square root (perimeter).
• Logic emulators inevitably outstripped
FPGA pin counts. Why???
Mike Butts - RAMP - August, 2010
Rent’s Rule
• The problem of how many
pins to provide for each
partition of a system came up
in the IBM 1401 project, 1960.
• Ed Rent found this empirical
rule for the relationship
between pins per logic block
and the number of gates in
the block:
p = Kgr
where p = pins, g = gates,
r is the “Rent exponent”, and
K is the “Rent constant”.
Mike Butts - RAMP - August, 2010
Rent’s Rule
• IBM 1401 used a Standard Modular System (SMS) of logic modules,
backplanes and chassis, with standard pin counts. How to size? Rent’s Rule.
• Rent never published, but in 1971
Landman and Russo did.
B. S. Landman, R. L. Russo, On a Pin Versus Block Relationship
For Partitions of Logic Graphs, IEEE Trans. Comp., col. C-20, 1971.
• Profound influence on system
architecture and CAD/EDA tools.
• Different Rent coefficients apply to
different environments.
• Empirical. Theory? Inconclusive.
– Exponent > 0.5: global connectivity.
– Constant > 1: net fanout.
• Rent’s Rule guided FPGA
emulation system architecture.
We used p = 2.5g0.57
IEEE Solid-State Circuits magazine, winter 2010
Mike Butts - RAMP - August, 2010
Emulators: Big Green Button
A logic emulator is automatic and universal.
It takes any arbitrary netlist and
implements it in standard hardware,
with little or no user intervention.
Uniform hardware, uniform-size FPGAs.
Design netlist is cut arbitrarily into many equal
partitions to keep the chips full.
– Balanced k-way partitioning (NP-hard)
This means Rent’s Rule applies.
M. Butts, “Emulators”, Wiley Encyclopedia of
Electrical and Electronics Engineering, 1999.
An FPGA prototype is manual and specific.
Hardware is usually chosen for one project, the
design is manually partitioned according to its
modular structure, FPGAs are sized accordingly.
System modules naturally have smaller pinouts
than arbitrary cuts. Rent’s Rule does not apply.
(Well, yes it does but weakly.)
G. Schelle, et. al., Intel Nehalem Processor Core
Made FPGA Synthesizable, ACM FPGA 2010
Mike Butts - RAMP - August, 2010
Rent’s Rule says FPGA Pins are Precious
• XC3090: 640 LUTs, 5K gates.
Rent’s Rule says 325 pins,
FPGA has 144 pins, only 44%
• Lesson: FPGA pins are vital
to FPGA emulator capacity.
=> Separate interconnect
• Crossbar is ideal
– Interconnects any pins,
any way, with any fanout
– Uniform delay: one level
• Far too expensive: O(n2)
• Far more fanout than needed,
average net fanout is 2 to 3.
• Doesn’t take advantage of
FPGA pin routability.
Butts, US 5,036,473, 1991
Mike Butts - RAMP - August, 2010
Partial Crossbar Interconnect
• Drop out most of the
crosspoints, leaving a
partial crossbar.
– Group FPGA pins
into subsets,
– Fully populate crosspoints
within each subset,
– Leave the rest out.
• For each net, find a subset
which can route it.
– High fanout nets first.
Map nets to FPGA pins accordingly.
Still uniform single-level delay.
Symmetrical, no placement needed.
Scalable: O(n)
Butts, US 5,036,473, 1991
Mike Butts - RAMP - August, 2010
Partial Crossbar Systems
• Redraw: Group each subset’s
crosspoints into a crossbar chip
for that subset
Each crossbar has pins to
every FPGA, and vice versa.
Make crossbar chip or use cheap FPGA
• Multilevel for systems: second-level
crossbars on the backplane.
Max delay is three hops.
• Cost is slightly higher than O(n).
• Partial crossbar interconnect made
large-scale logic emulation practical.
Mike Butts - RAMP - August, 2010
Butts, US 5,036,473, 1991
History of FPGA Emulators, 1989-2000
Nearest-neighbor architecture
• Quickturn RPM (1989): First commercial emulator
• Virtual Machine Works (1994): Virtual Wires pin multiplexing
Partial Crossbar architecture
• Mentor Realizer (1989): First hardware, emulated Apple II mobo
• Mentor Realizer (1991): Proof-of-concept system prototype
– 8 logic boards (14 XC3090 FPGAs, 32 XC2018 xbars), 64 XC2018 2nd-level xbars
• Mentor sold this logic emulator technology to Quickturn (1992).
• Quickturn Enterprise (1993): First commercial partial crossbar emulator
– 11 logic boards (46 XC3090s, 46 custom xbars), 144 2nd-level xbars, 330K gates
• HP Teramac (1995): Configurable computing research machine: 1M gates
• Quickturn System Realizer (1995): XC4000 series, 2M gates
• Quickturn Mercury Plus (2000): Large custom emulation FPGA, 20M gates
Mike Butts - RAMP - August, 2010
FPGA Emulation Clocking Issues
• ASIC and custom chips have gated clocks, latches, many clock domains.
FPGAs can introduce their own violations.
• FPGA interconnect delay is very hard to manage.
– FPGAs use dedicated low-skew clock networks.
• Gated clocks: must run clock through
logic blocks. Hold-time violations:
clock gets sooner than the data.
• Latches: timing of both edges matters,
plus there’s latch transparency.
• How to reliably map these to FPGA? Re-synthesis.
– Map gated clocks to FPGA FF clock enables (which is the gate, which is the clock?)
– Map latches into flops, using 2x clocking.
• Emulators developed sophisticated design mapping techniques.
Mike Butts - RAMP - August, 2010
Emulator User Psychology
• Emulators were often hard to use, especially in the early days.
– First-time users + clocking issues = errors.
– Ultra-high pincount backplanes, cabling = errors.
• This trained users to blame the emulator.
• After weeks of effort, they finally get their
design up and running on the emulator.
A bug is found. What is their response?
a) “Wonderful! It found a bug in our design.
We’re getting value from all this expense.”
b) “It’s not our design, it’s your emulator.”
User starts running diagnostics and swapping boards.
Swap enough boards and guess what happens.....
Solutions: Locked board extractors, Better emulators.
Mike Butts - RAMP - August, 2010
Emulators have thousands
of pins per board
1995: Quickturn System Realizer
• Up to 990 FPGAs (Xilinx XC4013), custom crossbar chips
• Logic board: 45 FPGAs, 100 K gates
– 2500 pins to backplane, 900 pins in-circuit or LA
• Max system 22 boards
2M gates, 14 MB RAM
• Built-in LAPG
• 14K I/Os for
multiple systems
• Compiler 100KG/hr
• Two-level partial
crossbar connects 990
FPGAs in 3 hops max.
Mike Butts - RAMP - August, 2010
2000: Mercury Plus FPGA
• Custom FPGA for emulation
• Five-level partial crossbar
across entire 20M gate system:
– Logic cluster: full crossbar
– Two partial crossbar levels on-chip
– Two more levels in the system
• 10x faster compile
• Predictable capacity and delays
• 6-LUTs, FFs, RAMs
– hold time trimmers
• Full visibility, on-chip logic analyzer • QT’s last FPGA emulator
Mike Butts - RAMP - August, 2010
FPGA Pin Shortage Gets Worse Over Time
• Using FPGAs directly in logic emulators falls to Rent’s Rule
– FPGA-based emulators were always starved for pins.
– Xilinx FPGAs from the beginning. Altera, other FPGAs are similar.
LCs (4-LUT)
Rent pins Real pins*
* ordinary pins only, SERDES latency is too long for logic emulation
Mike Butts - RAMP - August, 2010
FPGA Emulator Pin Multiplexing
• Multiple nets per pin,
slower design clock
Xilinx data book
• Quickturn:
– Asynchronous
using DDR IOBs
– Transparent to the emulated design
• VMW: Virtual Wires
– Synchronous to design
– Modify design netlist:
many levels
– Multiple clock domains?
Mike Butts - RAMP - August, 2010
Babb et. al, “Logic Emulation
with Virtual Wires”, vol. 16,
pp. 609 - 626, 1997.
Continuous to Discrete Time
• As FPGAs got further and further from Rent’s Rule, FPGA
emulators went to deeper and deeper pin multiplexing.
• Continuous time:
– Pure FPGA emulator runs in the continuous time of the design. Signals
propagate as in the real hardware, just with different delays.
• Continuous / discrete time mix:
– Pin-multiplexed FPGA emulator runs in an ad-hoc mix of
continuous and discrete time. Yet pins still mostly lie idle.
• Discrete time:
– Go all the way into discrete time == levelized simulation
• Now it’s a massively parallel computer
Mike Butts - RAMP - August, 2010
Processor-based Emulation
• Levelize netlist, evaluate all gates
every cycle, level-by-level.
• No branches: deep pipelining, fast,
massively parallel, very scalable.
• Compile-time net scheduling:
Emulated design escapes Rent’s Rule
• IBM Yorktown Simulation Engine
Monty Denneau, DAC 1982.
– “... high speed special purpose parallel
processor designed and built at the IBM
Thomas J. Watson Research Center to
simulate logical operation ... up to
2,000,000 gates at a rate exceeding 3
billion gate computations per second”
• IBM Engineering Verification Engine
Beece et. al, DAC 1988.
Mike Butts - RAMP - August, 2010
Quickturn CoBALT
Wm. Beausoleil et. al., IBM
1997 commercialization of IBM engines
8M gates, 1 MHz emulation speed
IBM HW, QT front end compiler
Maps multi clock domains, latches, gated
clocks onto single faster clock, making
use of FPGA compiler experience
• Compiles 1M gates / hour
• Full custom 100 MHz
250um chip with
64 logic processors
• 65 chips / board
Mike Butts - RAMP - August, 2010
Processor-based Emulation in 2000’s
• IBM technology and team acquired by QT, then QT acquired by Cadence
• FPGA emulators dropped
• 2002: Palladium
128M gates, 0.75 MHz
Full visibility
Compile 30M gates / hour
• 2004: Palladium II
– 256M gates, 1.5 MHz
• 2007: Palladium III
– 256M gates, 2 MHz
Palladium XP
• 2010: Palladium XP
– 2000M gates, 4 MHz
Mike Butts - RAMP - August, 2010
Emulation at NVIDIA
One of the largest emulation labs in the world
Mike Butts - RAMP - August, 2010
Early Emulation Success
• In 1995, CEO Jensen Huang “spent $1 million, a third of the company’s
cash, on a technology known as emulation, which allows engineers to play
with virtual copies of their graphics chips before they put them into
silicon. That allowed Nvidia to speed a new graphics chip to market every
six to nine months, a pace the company has sustained ever since.”
- Forbes, 1/7/08
• RIVA 128, or "NV3", was one of the first
consumer graphics processing units to
integrate 2D and 3D acceleration. When
announced in 1997, the market found the
specifications hard to believe: performance
superior to market-leader 3dfx. RIVA 128
shipped in volume, and the combination of its low cost and high
performance made it a popular choice for OEMs.
Mike Butts - RAMP - August, 2010
Emulation in 2005
The specific verification goals that were required
for the GeForce 6800 project include:
• Bring up a new generation of GPUs on an
accelerated verification platform in a oneweek time frame. Derivative chips
must be brought up in a few days.
• Automate the Compile-Run-Debug process
so that ASIC design engineers could use an
accelerated verification platform.
• Verify GPU and frame-buffer/system-memory
• Validate AGP/PCI-bus interface functions.
• Ensure functionality at various levels of
abstraction (RTL and gates).
• Expand accelerated verification solution to
ATPG and BIST applications.
- Chip Design Magazine, January 2005
Mike Butts - RAMP - August, 2010
Emulation Today
• 2010: Cadence Palladium XP
• Up to 2 billion gates, up to 4 MHz, up to 512 users
– Compile up to 35M gates / hour on 1 PC
• Full visibility to all signals
• Integrates with logic and power simulation,
SystemC/C++ models, prototype hardware
• System integration steps used at NVIDIA:
– Design and verify the silicon itself.
• Power analysis is vital.
– Run silicon in the virtual system (such as
a PC), verify that the GPU works in a system.
– Run lots of software applications on the
virtualized platform.
- “NVidia Engineer Cites HW/SW Integration Challenges”, 5/5/10,
Mike Butts - RAMP - August, 2010
FPGA Prototyping today
• FPGA prototyping is widely used as a verification tool by chip
development projects (not to mention RAMP of course).
• Practical for one to four to maybe ten FPGAs.
– 2-4M gates each, typically 10 to 50 MHz
• Prototypes are rarely disclosed, two research efforts were:
Nehalem CPU in
five FPGAs, 520 kHz
due to pin multiplexing,
18 to 24-ways
(ACM FPGA ‘10)
Atom CPU in one
Virtex-5 LX330, 50 MHz
(ACM FPGA ‘09)
Mike Butts - RAMP - August, 2010
• State-of-the-art projects continue to rely heavily on processorbased emulation and FPGA prototyping for tapeouts.
• State-of-the-art tapeouts today cost $50-100M++.
– Only possible for established $B vendors.
– Very hard to get new chip startups funded.
• Therefore, ASIC project starts are dropping.
• FPGAs and GPUs are the only processing silicon
that scales with Moore’s Law (so far).
– Their vendors are the “foundries” for new HW efforts.
• Off-the-shelf chips: we’re coming full circle.
Mike Butts - RAMP - August, 2010
The Ultimate Interconnect
Human brain: 1011 neurons, 1014 to 1015 total synapses, 20-40 W,
somewhat reconfigurable.
“The Brain Unveiled”, Technology Review, Nov-Dec, 2008
Mike Butts - RAMP - August, 2010

similar documents