Presentation - HPC User Forum

Report
Runnemede:
Disruptive Technologies
for UHPC
John Gustafson
Intel Labs
HPC User Forum – Houston 2011
1
“We’re going to try to
make the entire
exascale machine
cache-coherent.”
—Bill Dally, Nvidia
“Caches are for morons.”
—Shekhar Borkar, Intel
The battle lines are drawn…
2
Intel’s UHPC Approach
Design test chips with the idea of
maximizing learning.
Very different from producing product
roadmap processor designs.
Going from Peta to Exa is nothing like
the last few 1000x increases…
3
Building with Today’s Technology
TFLOP Machine today
Decode and control
Translations
…etc
4450W
Power supply losses
Cooling…etc
5KW
100W
Disk
10TB disk @ 1TB/disk @10W
100pJ com per FLOP
Com
100W
Memory
150W
0.1B/FLOP @ 1.5nJ per Byte
Compute
200W
200pJ per FLOP
KW Tera, MW Peta, GW Exa?
4
The Power & Energy Challenge
TFLOP Machine today
4550W
5KW
Disk
100W
Com
100W
Memory
150W
Compute
200W
TFLOP Machine then
With Exa Technology
5W
~3W
~5W
2W
5W
~20W
5
Scaling Assumptions
Technology
(High Volume)
45 nm
(2008)
32 nm
(2010)
22 nm
(2012)
16 nm
(2014)
11 nm
(2016)
8 nm
(2018)
5 nm
(2020)
Transistor density
1.75
1.75
1.75
1.75
1.75
1.75
1.75
Frequency scaling
15%
10%
8%
5%
4%
3%
2%
Vdd scaling
-10%
-7.5%
-5%
-2.5%
-1.5%
-1%
-0.5%
SD Leakage scaling/micron
1X Optimistic to 1.43X Pessimistic
65 nm Core + Local Memory
DP FP Add, Multiply
Integer Core, RF
Router
5mm2 (50%)
Memory 0.35MB
5mm2 (50%)
10 mm2, 3 GHz, 6 GF, 1.8 W
8 nm Core + Local Memory
DP FP Add, Multiply
Integer Core, RF
Router
0.17mm2 (50%)
Memory 0.35MB
0.17mm2 (50%)
~0.6mm
0.34 mm2, 4.6 GHz, 9.2 GF, 0.24 to 0.46 W
6
Near Threshold Logic
10
101
2
1
101
10-1
1
0.4
0.6
0.8
1.0
1.2
Supply Voltage (V)
H. Kaul et al, 16.6: ISSCC08
10
1.4
-2
101
400
350
300
250
200
150
100
50
320mV
0.2
65nm CMOS, 50°C
1
Subthreshold Region
103
Energy Efficiency (GOPS/Watt)
65nm CMOS, 50°C
450
0
0.2
9.6X
10-1
Active Leakage Power (mW)
102
Total Power (mW)
Maximum Frequency (MHz)
104
320mV
0.4
0.6
0.8
1.0
1.2
10-2
1.4
Supply Voltage (V)
7
Revise DRAM Architecture
Signaling
Energy cost today:
~150 pJ/bit
M Control
DRAM
Array
New DRAM architecture
Page
Addr
RAS
Traditional DRAM
Page
Page
Page
Page
Page
Addr
CAS
Activates many pages
Lots of reads and writes (refresh)
Small amount of read data is used
Requires small number of pins
Activates few pages
Read and write (refresh) what is needed
All read data is used
Requires large number of I/Os (3D)
8
Data Locality
Chip to memory
Communication:
~1.5 nJ per Byte
~150 pJ per Byte
Core-to-core
Communication
on the chip:
~10 pJ per Byte
Chip to chip
Communication:
~100 pJ per Byte
Data movement is expensive—keep it local
(1) Core to core, (2) Chip-to-chip, (3) Memory
9
Disruptive Approach to Faults
We tend to assume that execution faults
(soft errors, hard errors) are rare. And
it’s a valid speculation. Currently.
Soon, we will need much more paranoia
in hardware designs.
10
Road to Unreliability?
From Peta to Exa
Reliability Issues
1,000X parallelism
More hardware for something to go wrong
>1,000X intermittent faults due to soft errors
Aggressive Vcc scaling
to reduce power/energy
Gradual faults due to increased variations
More susceptible to Vcc droops (noise)
More susceptible to dynamic temp variations
Exacerbates intermittent faults—soft errors
Deeply scaled
technologies
Aging related faults
Lack of burn-in?
Variability increases dramatically
Resiliency will be the cornerstone
11
Resiliency
Faults
Example
Faults cause errors (data & control)
Permanent faults
Stuck-at 0 & 1
Datapath errors
Detected by parity/ECC
Gradual faults
Variability
Temperature
Silent data corruption
Need HW hooks
Control errors
Control lost (Blue screen)
Intermittent faults Soft errors
Voltage droops
Aging faults
Minimal overhead for resiliency
Degradation
Applications
System Software
Programming system
Microcode, Platform
Microarchitecture
Circuit & Design
Error detection
Fault isolation
Fault confinement
Reconfiguration
Recovery & Adapt
12
Execution Model and Codelets
Programming Models/Systems (Rich)
Sea of Codelets
•
•
Codelet - Code that can be executed nonpreemptively with an “event-driven” model
Shared memory model based on LC (Location
Consistency – a generalized singleassignment model [GaoSarkar1980])
Run Time System
Net
Cores
Peripherals/Devices
Hardware Abstraction
Advanced
Hardware
Monitoring
13
Summary
Voltage scaling to reduce power and energy
• Explodes parallelism
• Cost of communication vs computation—critical balance
• Resiliency to combat side-effects and unreliability
Programming system for extreme parallelism
Application driven, HW/SW co-design approach
Self-awareness & execution model to harmonize
14

similar documents