Dark Silicon and the End of Multicore Scaling

Report
The Dark Silicon Implications for
Microprocessors
Karu Sankaralingam
University of Wisconsin-Madison
Collaborators: Hadi Esmaeilzadeh, Emily
Blem, Renee St. Amant, and Doug Burger
Multicore Decade?
We have relied on multicore scaling for over five years.
?
2000
2005
Pentium
Core 2
Extreme Quad-Core
Dual-Core
2010
i7 980x
Hex-Core
2015
How much longer will it be our primary
performance scaling technique?
2
Finding Optimal Multicore Designs
Comprehensive design space:





Fixed area budget
Fixed power budget
Two sets of CMOS scaling projections
Optimal core and diverse multicore organizations
Parallel benchmarks
For next 5 technology generations, we find the best
performing multicore from a comprehensive design
space search for each of the PARSEC benchmarks
3
Symmetric Multicore Projections
20
Speedup
16
18x
Target
Symmetric
12
8
3.4x
in 10 years
4
0
0
2
4
Year
6
8
10
Symmetric multicores alone will not sustain the multicore era.
4
Multicore Solutions
20
16
Speedup
Asymmetric
Topologies
Target
Symmetric
Asymmetric
12
8
3.5x
4
0
0
2
4
Year
6
8
10
5
Multicore Solutions
Speedup
20
Dynamic
Topologies
Target
16
Symmetric
12
Dynamic
Asymmetric
8
3.5x
4
0
0
2
4
Year
6
8
10
[Chakraborty (2008), Suleman et al (2009)]
6
Multicore Solutions
Speedup
20
Composed/Fused
Topologies
Target
16
Symmetric
Asymmetric
12
Dynamic
Composed
8
3.7x
4
0
0
2
4
Year
6
8
10
[Ipek et al (2007), Kim et al (2007)]
7
Multicore Solutions
20
Symmetric
16
Speedup
GPU-Style Cores
Target
Asymmetric
Dynamic
12
Composed
GPU
8
2.7x
4
0
0
2
4
Year
6
8
10
8
Multicore Era Projections
20
Speedup
16
18x
Target
Composed
Composed
12
8
3.7x
4
0
0
2
4
Year
6
8
The best designs speed up 14% per year
rather than the recent trend of 34% per year
10
9
Why Diminishing Returns?
 Transistor area is still scaling
 Voltage and capacitance scaling have slowed
 Result: designs are power, not area, limited
10
Overview
Devices
• Find the best case technology scaling
Cores
• Find the best cores
Multicores
• Find the best multicore organization
Projections
• Predict best case multicore performance for
each technology generation
11
Device Scaling Projections
From 45 nm to 8 nm:
Conservative
Optimistic
Area
32x
32x
Power
4.5x
8.3x
Frequency
1.3x
3.9x
[Borkar 2007]
[ITRS 2010]
12
Modeling Ideal Core Power/Perf.
30
Intel Nehalem
AMD Shanghai
Intel Core
Intel Atom
Power (TDP, Watts)
25
Nehalem
20
Pareto Frontier includes all optimal power/performance points
15
Repeat using core area for optimal area/performance points
10
5
0
0
Atom
10
20
SPECmark Score
30
40
13
Combining Device and Core Models
30
45 nm
Frontier
Power (TDP, Watts)
25
20
Device Scaling
32 nm
Frontier
15
10
5
0
0
10
20
SPECmark Score
30
40
14
Overview
Devices
• Find the best case technology scaling
Cores
• Find the best cores
Multicores
• Find the best multicore organization
Projections
• Predict best case multicore performance for
each technology generation
15
What belongs in multicore model?
Styles
Number of Threads,
Cache Sizes
Topologies
Pareto Frontiers
Area & Power Budget
Architectures
Cache & memory latencies,
memory bandwidth
Area & Power /
Performance Tradeoffs
Applications
PARSEC fparallel,
Data Use
16
Multicore Speedup Model
Multicore
=
Speedup
1
1-fparallel
fparallel
+
Serial Speedup
Parallel Speedup
17
Multicore Performance Model
Performance is limited by:
Memory bandwidth
BWmax / (instructions per byte from memory)
and
Computation
Ncores  (core frequency/CPIexe)  core utilization
[Guz et al, 2009]
18
Core Utilization Model
Core utilization is limited by:
Fraction of Time Core is Ready to Issue
Number of Threads in Core / Number of Threads to Keep Busy
[Guz et al, 2009]
19
Multicore Model & Pareto Frontiers
30
25
100 points
20
A(q), 15
P(q)
10
5
0
0
10
20
q
30
40
20
Translating from SPECmark
1. From q, find core’s SPECmark speedup
2.
Frequency linearly distributed from Atom to Nehalem
3.
Recall: model predicts benchmark performance as
f(benchmark chars, frequency, CPIexe)
4.
Compute CPIexe such that
Benchmark Speedup = SPECmark Speedup
21
Area and Power Constraints
Ncores x A(q) ≤ Area Budget
Ncores x P(q) ≤ Power Budget
Dark silicon = Ncores / # of cores that fit in chip area
22
Overview
Devices
• Find the best case technology scaling
Cores
• Find the best cores
Multicores
• Find the best multicore organization
Projections
• Predict best case multicore performance for
each technology generation
23
Dark Silicon
Percentage Dark Silicon
100%
100%
8 nm:
AtAt22
ITRS
Conservative
80%
80%
71%
60%
60%
51%
40%
40%
Sources of Dark Silicon:
Power + Limited Parallelism
20%
20%
0%
0%
blacksholes
bodytrack
canneal
ferret
streamcluster
streamcluster
17%
26%
GM
GM
24
Overall Performance
20
ITRS: All Topologies
Conservative: All Topologies
ITRS: Symmetric
Conservative: Symmetric
Symmetric
Conservative:
Target
Speedup
16
18x
16x
fparallel = 0.99
12
8x
6x
3x
8
4
0
0
2
4
6
8
10
Year
25
Conclusions
Multicore performance gains are limited
Unicore Era
Multicore Era
?
Need at least 18%-40% per generation from
architecture alone without additional power
26
Specialization
Shrinking chips
Pervasive
Efficiency
27

similar documents