ESP (Ensembles of Specialized Processors)

Report
Algorithms and Specializers for Provably Optimal
Implementations with Resiliency and Efficiency
Elad Alon, Krste Asanovic (Director),
Jonathan Bachrach, Jim Demmel, Armando Fox,
Kurt Keutzer, Borivoje Nikolic, David Patterson,
Koushik Sen, John Wawrzynek
[email protected]
http://aspire.eecs.berkeley.edu
UC Berkeley
Future Application Drivers
2
UC Berkeley
Compute Energy “Iron Law”
Performance =
Power * Energy Efficiency
(Tasks/Second) (Joules/Second) (Tasks/Joule)
 When power is constrained, need better energy
efficiency for more performance
 Where performance is constrained (real-time), want
better energy efficiency to lower power
Improving energy Efficiency is critical goal for all
future systems and workloads
3
UC Berkeley
Good News: Moore’s Law Continues
“Cramming more components onto integrated
circuits”, Gordon E. Moore, Electronics, 1965
4
UC Berkeley
L3 energy scaling ended in 2005
BadAnd
News:
Dennard (Voltage) Scaling Over
Moore, ISSCC Keynote, 2003
Dennard
Gordon Moore, ISSCC 2003
Moore, ISSCC Keynote, 2003
Scaling
Post-Dennard Scaling
CICC
Sept 10, 2012
Data courtesy S. Borkar/Intel 2011
5
UC Berkeley
1st Impact of End of Scaling:
End of Sequential Processor Era
6
UC Berkeley
Parallelism: A one-time gain
Use more, slower cores for better energy efficiency.
Either
 simpler cores, or
 run cores at lower Vdd/frequency
 Even simpler general-purpose microarchitectures?
- Limited by smallest sensible core
 Even Lower Vdd/Frequency?
- Limited by Vdd/Vt scaling, errors
 Now what?
7
UC Berkeley
2nd Impact of End of Scaling: “Dark Silicon”
Cannot switch all transistors at full frequency!
[Muller, ARM CTO, 2009]
No savior device technology on horizon.
Future energy-efficiency innovations must be above transistor level.
8
UC Berkeley
The End of General-Purpose Processors?
 Most computing happens in
specialized, heterogeneous
processors
- Can be 100-1000X more
efficient than general-purpose
processor
 Challenges:
- Hardware design costs
- Software development costs
NVIDIA Tegra2
9
UC Berkeley
The Real Scaling Challenge:
Communication
As transistors become smaller and cheaper,
communication dominates performance and energy
All scales:
 Across chip
 Up and down
memory hierarchy
 Chip-to-chip
 Board-to-board
 Rack-to-rack
10
UC Berkeley
ASPIRE: From Better to Best
Specialize and optimize communication
and computation across whole stack from
applications to hardware
 What is the best we can do?
- For a fixed target technology (e.g., 7nm)
 Can we prove a bound?
 Can we design implementation approaching bound?
 Provably Optimal Implementations
11
UC Berkeley
Communication-Avoiding Algorithms:
Algorithm Cost Measures
1.Arithmetic (FLOPS)
2.Communication: moving data between
- levels of a memory hierarchy (sequential case)
- processors over a network (parallel case).
CPU
Cache
CPU
DRAM
CPU
DRAM
CPU
DRAM
CPU
DRAM
DRAM
12
UC Berkeley
Modeling Runtime & Energy
13
UC Berkeley
A few examples of speedups
 Matrix multiplication
- Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communication
 QR decomposition (used in least squares, data mining, …)
- Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10
- Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200
- Up to 13x on Tesla C2050 / Fermi, for 110k x 100
- Up to 4x on Grid of 4 cities (Dongarra, Langou et al)
- “infinite speedup” for out-of-core on PowerPC laptop
- LAPACK thrashed virtual memory, didn’t finish
 Eigenvalues of band symmetric matrices
- Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential)
 Iterative sparse linear equations solvers (GMRES)
- Up to 4.3x on Intel Clovertown, 8 core
 N-body (direct particle interactions with cutoff distance)
- Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs.
14
UC Berkeley
Modeling Energy: Dynamic
15
UC Berkeley
Modeling Energy: Memory Retention
16
UC Berkeley
Modeling Energy: Background Power
17
UC Berkeley
Energy Lower Bounds
18
UC Berkeley




Early Result:
Perfect Strong Scaling in Time and Energy
Every time you add processor, use its memory M too
Start with minimal number of procs: PM = 3n2
Increase P by factor c  total memory increases by factor c
Notation for timing model:
- γt , βt , αt = secs per flop, per word_moved, per message of size m
T(cP) = n3/(cP) [ γT+ βt/M1/2 + αt/(mM1/2) ] = T(P)/c
 Notation for energy model:
- γe , βe , αe = Joules for same operations
- δe = Joules per word of memory used per sec
- εe = Joules per sec for leakage, etc.
E(cP) = cP { n3/(cP) [ γe+ βe/M1/2 + αe/(mM1/2) ] + δeMT(cP)
+ εET(cP) } = E(P)
 Perfect scaling extends to n-body, Strassen, …
[IPDPS, 2013]
19
UC Berkeley
C-A Algorithms Not Just for HPC
 In ASPIRE, apply to other key application areas:
machine vision, databases, speech recognition,
software-defined radio, …
 Initial results on lower bounds of database join
algorithms
20
UC Berkeley
From C-A Algorithms to Provably
Optimal Systems?
 1) Prove lower bounds on communication for a
computation
 2) Develop algorithm that achieves lower bound on a
system
 3) Find that communication time/energy cost is >90%
of resulting implementation
 4) We know we’re within 10% of optimal!
 Supporting technique: Optimizing software stack and
compute engines to reduce compute costs and expose
unavoidable communication costs
21
UC Berkeley
ESP: An Applications Processor
Architecture for ASPIRE
Intel Ivy Bridge (22nm)
Qualcomm Snapdragon MSM8960 (28nm)
 Future server and mobile
SoCs will have many fixedfunction accelerators and a
general-purpose
programmable multicore
 Well-known how to
customize hardware
engines for specific task
 ESP challenge is using
specialized engines for
general-purpose code
22
UC Berkeley
ESP: Ensembles of Specialized Processors
 General-purpose hardware, flexible but inefficient
 Fixed-function hardware, efficient but inflexible
 Par Lab Insight: Patterns capture common operations
across many applications, each with unique
communication & computation structure
 Build an ensemble of specialized engines, each
individually optimized for particular pattern but
collectively covering application needs
 Bet: Will give us efficiency plus flexibility
- Any given core can have a different mix of these depending
on workload
23
UC Berkeley
Par Lab: Motifs common across apps
Audio
Recognition
Object
Recognition
Dense
Sparse
…
Graph
Scene
Analysis
Applications
Berkeley View “Dwarfs”
or Motifs
24
Motif (nee “Dwarf”) Popularity
UC Berkeley
(Red Hot / Blue Cool)
Computing Domains
Par Lab Apps
25
UC Berkeley
Architecting Parallel Software
Application
Identify the Software
Structure
•Pipe-and-Filter
•Agent-and-Repository
•Event-based
•Bulk Synchronous
•Map-Reduce
•Layered Systems
•Model-view controller
•Arbitrary Task Graphs
•Puppeteer
•Model-View-Controller
Identify the Key
Computations
• Graph Algorithms
• Dynamic programming
• Dense/Spare Linear Algebra
• Un/Structured Grids
• Graphical Models
• Finite State Machines
• Backtrack Branch-and-Bound
• N-Body Methods
• Circuits
• Spectral Methods
• Monte-Carlo
UC Berkeley
Mapping Software to ESP: Specializers
Object
Recognition
Dense
Scene
Analysis
Audio
Recognition
Sparse
…
Graph
Applications
Berkeley View “Dwarfs”
or Motifs
Specializers with SEJITS Implementations and Autotuning
ESP
Code
Glue
Code
Dense
Code
Sparse
Code
Graph
Code
ILP
Engine
Dense
Engine
Sparse
Engine
Graph ESP
Engine Core
 Capture desired functionality at high-level using
patterns in a productive high-level language
 Use pattern-specific compilers (Specializers) with
autotuners to produce efficient low-level code
 ASP specializer infrastructure, open-source download
27
UC Berkeley
Replacing Fixed Accelerators with
Programmable Fabric
Intel Ivy Bridge (22nm)
Qualcomm Snapdragon MSM8960 (28nm)
 Future server and mobile
SoCs will have many fixedfunction accelerators and a
general-purpose
programmable multicore
 Fabric challenge is
retaining extreme energy
efficiency while retaining
programmability
28
Strawman Fabric Architecture
UC Berkeley
R
R
M
A
R
M
A
M
A
R
A
A
M
A
R
M
R
M
A
R
R
M
A
M
A
R
A
R
M
R
M
A
M
A
M
R
M
A
R
A
R
R
M
A
M
A
M
R
 Will never have a C compiler
 Only programmed using pattern-based
DSLs
 More dynamic, less static than earlier
approaches
 Dynamic dataflow-driven execution
 Dynamic routing
 Large memory support
29
UC Berkeley
“Agile Hardware” Development
 Current hardware design slow and arduous
 But now have huge design space to explore
 How to examine many design points efficiently?
 Build parameterized generators, not point designs!
 Adopt and adapt best practices from Agile Software
- Complete LVS-DRC clean physical design of current version
every ~ two weeks (“tapein”)
- Incremental feature addition
- Test & Verification first step
30
UC Berkeley
Chisel: Constructing Hardware In a
Scala Embedded Language
 Embed a hardware-description language in Scala, using
Scala’s extension facilities
 A hardware module is just a data structure in Scala
 Different output routines can generate different types of
output (C, FPGA-Verilog, ASIC-Verilog) from same
hardware representation
 Full power of Scala for writing hardware generators
- Object-Oriented: Factory objects, traits, overloading etc
- Functional: Higher-order funcs, anonymous funcs, currying
- Compiles to JVM: Good performance, Java interoperability
31
Chisel Design Flow
UC Berkeley
Chisel Program
Scala/JVM
C++
code
C++ Compiler
Software
Simulator
FPGA
Verilog
FPGA Tools
FPGA
Emulation
ASIC
Verilog
ASIC Tools
GDS
Layout
32
UC Berkeley
Chisel is much more than an HDL
 The base Chisel system allows you to use the full power
of Scala to describe the RTL of a design, then generate
Verilog or C++ output from the RTL
 But Chisel can be extended above with domain-specific
languages (e.g., signal processing) for fabric
 Importantly, Chisel can also be extended below with
new backends or to add new tools or features (e.g.,
quantum computing circuits)
 Only ~6,000 lines of code in current version including
libraries!
 BSD-licensed open source at:
chisel.eecs.berkeley.edu
33
CORE 3
VC3
CORE 1
VC1
CORE 0
VC0
Processor
Site
Clock
test
DCDC
site test site
512KB
L2
VFIXED
CORE 2
VC2
Test
Sites
UC Berkeley
Many processor tapeouts in few years
with small group (45nm, 28nm)
SRAM
test site
34
UC Berkeley
Resilient Circuits & Modeling
 Future scaled technologies have high variability but want to
run with lowest-possible margins to save energy
 Significant increase in soft errors, need resilient systems
 Technology modeling to determine tradeoff between MTBF
and energy per task for logic, SRAM, & interconnect.
Techniques to reduce
operating voltage can be
worse for energy due to
rapid rise in errors
35
Algorithms and Specializers for Provably Optimal
Implementations with Resiliency and Efficiency
Software
UC Berkeley
Scene
Analysis
Audio
Recognition
Object
Recognition
Applications
Computational and Structural Patterns
Dense
C-A GEMM
Sparse
C-A SpMV
…
Pipe&Filter
Graph
C-A BFS
…
Map-Reduce
Communication-Avoiding Algorithms
Specializers with SEJITS Implementations and Autotuning
Glue
Code
Dense
Code
Sparse
Code
ESP
Code
Dense Sparse Graph ESP
Engine Engine Engine Core
Hardware Cache Coherence
Local Stores + DMA
Hardware Generators using Chisel HDL
ILP
Engine
Hardware
Graph
Code
C++
FPGA
Simulation
Emulation
Validation/Verification
ESP (Ensembles of
Specialized Processors)
Architecture
FPGA
ASIC
Computer
SoC
Implementation Technologies
Deep HW/SW
Design-Space
Exploration
36
UC Berkeley
ASPIRE Project
 Initial $15.6M/5.5 year funding from DARPA PERFECT
program
- Started 9/28/2012
- Located in Par Lab space + BWRC
 Looking for industrial affiliates (see Krste!)
 Open House today, 5th floor Soda Hall
Research funded by DARPA Award Number HR0011-12-2-0016. Approved for public release;
distribution is unlimited. The content of this presentation does not necessarily reflect the
position or the policy of the US government and no official endorsement should be inferred.
37

similar documents