1 - Microarch.org

Report
CoScale: Coordinating CPU and Memory
System DVFS in Server Systems
Qingyuan Deng, David Meisner+, Abhishek Bhattacharjee,
Thomas F. Wenisch*, and Ricardo Bianchini
Rutgers University
+Facebook
Inc. *University of Michigan
1
Server power challenges
CPU
Memory
Others
Power Breakdown
100%
80%
60%
40%
20%
0%
ILP
MID
MEM
MIX
• CPU and memory power represent the vast majority of server power
2
Need to conserve both CPU and memory energy
• Related work
• Lots of previous CPU DVFS works
• MemScale: Active low-power modes for memory [ASPLOS11]
• Uncoordinated DVFS causes poor behavior
• Conflicts, oscillations, unstable behavior
• May not generate the best energy savings
• Difficult to bound the performance degradation
• Need coordinated CPU and memory DVFS to achieve best results
• Challenge: Constrain the search space to good frequency combinations
3
CoScale: Coordinating CPU and memory DVFS
• Key goal
• Conserve significant energy while meeting performance constraints
• Hardware mechanisms
• New performance counters
• Frequency scaling (DFS) of the channels, DIMMs, DRAM devices
• Voltage & frequency scaling (DVFS) of memory controller, CPU cores
• Approach
• Online profiling to estimate performance and power consumption
• Epoch-based modeling and control to meet performance constraints
• Main result
• Energy savings of up to 24% (16% on average) within 10% perf. target;
4% on average within 1% perf. target
4
Outline
• Motivation and overview
• CoScale
• Results
• Conclusions
5
CoScale design
• Goal: Minimize energy under user-specified performance bound
• Approach: epoch-based OS-managed CPU / mem freq. tuning
• Each epoch (e.g., an OS quantum):
1. Profile performance & CPU/memory boundness
•
Performance counters track mem-CPI & CPU-CPI, cache performance
2. Efficiently search for best frequency combination
•
Models estimate CPU/memory performance and power
3. Re-lock to best frequencies; continue tracking performance
•
Slack: delta between estimated & observed performance
4. Carry slack forward to performance target for next epoch
6
Frequency and slack management
Actual
Profiling
Target
Performance
Pos. Slack
Neg. Slack
Pos. Slack
EstimateCalculate
performance/energy
slack vs. target
via models
High Core Freq.
Core
MC, Bus + DRAM
Low Core Freq.
High Mem Freq. Low Mem Freq.
Epoch 1
Epoch 2
Time
Epoch 3
Epoch 4
7
Frequency search algorithm
Memory Frequency
 Offline
Core 1 Frequency
Impractical! O (M ×   ) : M: number of memory frequencies
C: number of CPU frequencies
N: number of CPU cores
8
Frequency search algorithm
Memory Frequency
 CoScale
Metric: △Power/△Performance
Mem Core 0 Core 1
Action
0.73
0.65
0.81
0.73
0.65
0.52
Core 1
Mem
0.61
0.65
0.52
Core 0
Mem
Core 0
Mem
Mem
Core 1
Core 1 Frequency
Core grouping: Balance impact of memory and cores
O ( +  ×  2 )
9
Outline
• Motivation and overview
• CoScale
• Results
• Conclusions
10
Methodology
• Detailed simulation
• 16 cores, 16MB LLC, 4 DDR3 channels, 8 DIMMs
• Multi-programmed workloads from SPEC suites
• Power modes
• Memory: 10 frequencies between 200 and 800 MHz
• CPU: 10 frequencies between 2.2GHz and 4GHz
• Power model
• Micron’s DRAM power model
• McPAT CPU power model
11
Results – energy savings and performance
Average energy savings
Performance overhead
60%
50%
14%
Full system energy
Memory energy
CPU energy
12%
Multiprogram average
Worst program in mix
Performance loss bound
10%
40%
8%
30%
6%
20%
4%
10%
2%
0%
0%
MEM
MID
ILP
MIX
AVG
MEM
MID
ILP
MIX
AVG
Higher CPU energy savings on MEM; higher memory savings on ILP
System energy savings of 16% (up to 24%); always within perf. bound
12
Alternative approaches
• Memory system DVFS only: MemScale
• CPU DVFS only
• Select the best combination of core frequencies
• Uncoordinated
• CPU & memory DVFS controllers make independent decisions
• Semi-coordinated
• CPU & memory DVFS controllers coordinate by sharing slack
• Offline
• Select the best combination of memory and core frequencies
• Unrealistic: the search space is exponential on the number of cores
13
core frequency
4
3.5
3
(a) CoScale
2.5
2
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
4.5
4
3.5
3
2.5
(b) Uncoordinated
1
2
3
4
5
6
7
8
2
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0.9
Mem. frequency
(GHz)
4.5
Core frequency
(GHz)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
memory frequency
Core frequency
(GHz)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
4.5
0.8
4
0.7
0.6
3.5
0.5
3
0.4
2.5
0.3
(c) Semi-Coordinated
0.2
1
2
3
4
5
6
7
8
Core frequency
(GHz)
Mem. frequency
(GHz)
Mem. frequency
(GHz)
Results – dynamic behavior
2
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Timeline of milc application in MIX2
14
Results – comparison to alternative approaches
Full-system energy savings
Performance overhead
20%
20%
Worst in Mix
15%
10%
5%
0%
CPI increase (%)
Energy Savings (%)
Multiprogram Average
15%
Performance loss bound
10%
5%
0%
CoScale achieves comparable energy savings to Offline
Uncoordinated fails to bound the performance loss
15
Results – Sensitivity Analysis
Impact of performance bound
1% Bound
5% Bound
10% Bound
15% Bound
20% Bound
30%
25%
20%
15%
10%
5%
0%
System Energy Reduction
Worst Perf. Degradation
Results for MID workloads
16
Conclusions
• CoScale contributions:
• First coordinated DVFS strategy for CPU and memory
• New perf. counters to capture energy and performance
• Smart OS policy to choose best power modes dynamically
• Avg 16% (up to 24%) full-system energy savings
• Framework for coordination of techniques across components
• In the paper
• Details of search algorithm, performance counters, models
• Sensitivity analyses (e.g., rest-of-system power, prefetching)
• CoScale on in-order vs out-of-order CPUs
17
THANKS!
SPONSORS:
18

similar documents