SHiP: Signature-based Hit Predictor for High

Report
CRUISE: Cache Replacement and Utility-Aware Scheduling
Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam,
Simon Steely Jr., Joel Emer
Intel Corporation, VSSAD
[email protected]
Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012)
Motivation
Core 0
L1
Core 0 Core 1
L1
L1
LLC
LLC
Single Core
( SMT )
Dual Core
( ST/SMT )
Core 0 Core 1
Core 2 Core 3
L1
L1
L1
L1
L2
L2
L2
L2
LLC
Quad-Core
( ST/SMT )
• Shared last-level cache (LLC) common with increasing # of cores
• # concurrent applications   contention for shared cache 
2
Misses Per 1000 Instr (under LRU)
Problems with LRU-Managed Shared Caches
• Conventional LRU policy allocates
– Applications that have no cache benefit
cause destructive cache interference
h264ref
soplex
0
resources based on rate of demand
soplex
h264ref
25
50
75
Cache Occupancy Under LRU Replacement
(2MB Shared Cache)
100
3
Misses Per 1000 Instr (under LRU)
Addressing Shared Cache Performance
• Conventional LRU policy allocates
– Applications that have no cache benefit
cause destructive cache interference
h264ref
• State-of-Art Solutions:
– Improve Cache Replacement (HW)
– Modify Memory Allocation (SW)
– Intelligent Application Scheduling (SW)
soplex
0
resources based on rate of demand
soplex
h264ref
25
50
75
Cache Occupancy Under LRU Replacement
(2MB Shared Cache)
100
4
HW Techniques for Improving Shared Caches
• Modify cache replacement policy
• Goal: Allocate cache resources based on cache utility NOT demand
C0
C1
LLC
LRU
C0
C1
LLC
Intelligent
LLC Replacement
5
SW Techniques for Improving Shared Caches I
• Modify OS memory allocation policy
• Goal: Allocate pages to different cache sets to minimize interference
Intelligent Memory
Allocator (OS)
C0
C1
C0
C1
LLC
LLC
LRU
LRU
6
SW Techniques for Improving Shared Caches II
• Modify scheduling policy using Operating System (OS) or hypervisor
• Goal: Intelligently co-schedule applications to minimize contention
C0
C1
LLC0
C2
C3
LLC1
LRU-managed LLC
C0
C1
LLC0
C2
C3
LLC1
LRU-managed LLC
7
SW Techniques for Improving Shared Caches
A
C0
B
C
C1
C2
D
• Three possible schedules:
•
•
•
C3
LLC1
LLC0
A, B | C, D
A, C | B, D
A, D | B, C
Optimal / Worst Schedule
4.9
~30%
5.5
6.3
Throughput
Baseline System
Worst Schedule
Optimal Schedule
(4-core CMP, 3-level hierarchy, LRU-managed LLC)
1.35
1.30
1.25
1.20
1.15
1.10
1.05
~9% On Average
1.00
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
8
Interactions Between Co-Scheduling and Replacement
Existing co-scheduling proposals evaluated on LRU-managed LLCs
Question:
Is intelligent co-scheduling necessary with
improved cache replacement policies?
DRRIP Cache Replacement [ Jaleel et al, ISCA’10 ]
9
Interactions Between Optimal Co-Scheduling and Replacement
Optimal / Worst Schedule ( DRRIP )
(4-core CMP, 3-level hierarchy, per-workload comparison
1365 4-core multi-programmed workloads)
1.28
• Category I: No need for intelligent
co-schedule under both LRU/DRRIP
1.24
• Category II: Require intelligent
co-schedule only under LRU
1.20
1.16
• Category III: Require intelligent
co-schedule only under DRRIP
1.12
• Category IV: Require intelligent
co-schedule under both LRU/DRRIP
1.08
1.04
1.00
1.00
1.04
1.08
1.12
1.16
1.20
1.24
1.28
Optimal / Worst Schedule ( LRU )
10
Interactions Between Optimal Co-Scheduling and Replacement
Optimal / Worst Schedule ( DRRIP )
(4-core CMP, 3-level hierarchy, per-workload comparison
1365 4-core multi-programmed workloads)
1.28
• Category I: No need for intelligent
co-schedule under both LRU/DRRIP
1.24
• Category II: Require intelligent
co-schedule only under LRU
1.20
1.16
• Category III: Require intelligent
co-schedule only under DRRIP
1.12
• Category IV: Require intelligent
co-schedule under both LRU/DRRIP
1.08
1.04
1.00
1.00
1.04
1.08
1.12
1.16
1.20
1.24
1.28
Optimalmal / Worst Schedule ( LRU )
Observation: Need for Intelligent Co-Scheduling is Function of Replacement Policy
11
Interactions Between Optimal Co-Scheduling and Replacement
Optimal / Worst Schedule ( DRRIP )
(4-core CMP, 3-level hierarchy, per-workload comparison
1365 4-core multi-programmed workloads)
1.28
• Category II: Require intelligent
co-schedule only under LRU
1.24
1.20
1.16
1.12
C0
1.08
C1
LLC0
1.04
C2
C3
LLC1
LRU-managed LLCs
1.00
1.00
1.04
1.08
1.12
1.16
1.20
1.24
1.28
Optimalmal / Worst Schedule ( LRU )
12
Interactions Between Optimal Co-Scheduling and Replacement
Optimal / Worst Schedule ( DRRIP )
(4-core CMP, 3-level hierarchy, per-workload comparison
1365 4-core multi-programmed workloads)
1.28
• Category II: Require intelligent
co-schedule only under LRU
1.24
1.20
1.16
1.12
C0
1.08
C1
LLC0
1.04
C2
C3
LLC1
LRU-managed LLCs
1.00
1.00
1.04
1.08
1.12
1.16
1.20
1.24
1.28
Optimalmal / Worst Schedule ( LRU )
13
Interactions Between Optimal Co-Scheduling and Replacement
Optimal / Worst Schedule ( DRRIP )
(4-core CMP, 3-level hierarchy, per-workload comparison
1365 4-core multi-programmed workloads)
1.28
• Category II: Require intelligent
co-schedule only under LRU
1.24
1.20
1.16
1.12
C0
1.08
C1
LLC0
1.04
C2
C3
LLC1
DRRIP-managed LLCs
1.00
1.00
1.04
1.08
1.12
1.16
1.20
1.24
1.28
Optimalmal / Worst Schedule ( LRU )
No Re-Scheduling Necessary for Category II Workloads in DRRIP-managed LLCs
14
Opportunity for Intelligent Application Co-Scheduling
• Prior Art:
• Evaluated using inefficient cache policies (i.e. LRU replacement)
• Proposal:
Cache Replacement and Utility-aware Scheduling:
• Understand how apps access the LLC (in isolation)
• Schedule applications based on how they can impact each other
• ( Keep LLC replacement policy in mind )
15
Memory Diversity of Applications (In Isolation)
LLCF
LLCT
LLCFR
CCF
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 0
Core 1
L2
L2
L2
L2
L2
L2
L2
L2
LLC
Core Cache Fitting
(e.g. povray*)
LLC
LLC Friendly
(e.g. bzip2*)
LLC
LLC
LLC Thrashing
(e.g. bwaves*)
LLC Fitting
(e.g. sphinx3*)
*Assuming
a 4MB shared LLC
16
Cache Replacement and Utility-aware Scheduling (CRUISE)
• Core Cache Fitting (CCF) Apps:
• Infrequently access the LLC
• Do not rely on LLC for performance
• Co-scheduling multiple CCF jobs
on same LLC “wastes” that LLC
• Best to spread CCF applications
CCF
CCF
Core 0
Core 1
Core 2
Core 3
L2
L2
L2
L2
LLC
LLC
across available LLCs
17
Cache Replacement and Utility-aware Scheduling (CRUISE)
• LLC Thrashing (LLCT) Apps:
• Frequently access the LLC
• Do not benefit at all from the LLC
• Under LRU, LLCT apps degrade
performance of other applications
• Co-schedule LLCT with LLCT apps
LLCT
LLCT
Core 0
Core 1
Core 2
Core 3
L2
L2
L2
L2
LLC
LLC
18
Cache Replacement and Utility-aware Scheduling (CRUISE)
• LLC Thrashing (LLCT) Apps:
• Frequently access the LLC
• Do not benefit at all from the LLC
• Under DRRIP, LLCT apps do not
degrade performance of coscheduled apps
• Best to spread LLCT apps across
available LLCs to efficiently utilize
cache resources
LLCT
LLCT
Core 0
Core 1
Core 2
Core 3
L2
L2
L2
L2
LLC
LLC
19
Cache Replacement and Utility-aware Scheduling (CRUISE)
• LLC Fitting (LLCF) Apps:
• Frequently access the LLC
• Require majority of LLC
• Behave like LLCT apps if they do
not receive majority of LLC
• Best to co-schedule LLCF with
CCF applications (if present)
LLCF
LLCF
CCF
Core 0
Core 1
Core 2
Core 3
L2
L2
L2
L2
LLC
LLC
• If no CCF app, schedule with
LLCF/LLCT
20
Cache Replacement and Utility-aware Scheduling (CRUISE)
• LLC Friendly (LLCFR) Apps:
• Rely on LLC for performance
• Can share LLC with similar apps
• Co-scheduling multiple LLCFR
jobs on same LLC will not result
in suboptimal performance
LLCFR
LLCFR
Core 0
Core 1
Core 2
Core 3
L2
L2
L2
L2
LLC
LLC
21
CRUISE for LRU-managed Caches (CRUISE-L)
LLCT LLCT LLCF CCF
• Applications:
• Co-schedule apps as follows:
•
•
•
•
Co-schedule LLCT apps with LLCT apps
Spread CCF applications across LLCs
Co-schedule LLCF apps with CCF
Fill LLCFR apps onto free cores
LLCT
LLCF
LLCT
CCF
Core 0
Core 1
Core 2
Core 3
L2
L2
L2
L2
LLC
LLC
22
CRUISE for DRRIP-managed Caches (CRUISE-D)
LLCT LLCT LLCFR
• Applications:
• Co-schedule apps as follows:
•
•
•
•
CCF
Spread LLCT apps across LLCs
Spread CCF apps across LLCs
Co-schedule LLCF with CCF/LLCT apps
Fill LLCFR apps onto free cores
LLCFR
LLCT
CCF
LLCT
Core 0
Core 1
Core 2
Core 3
L2
L2
L2
L2
LLC
LLC
23
Experimental Methodology
• System Model:
• 4-wide OoO processor (Core i7 type)
• 3-level memory hierarchy (Core i7 type)
• Application Scheduler
• Workloads
• Multi-programmed combinations of SPEC CPU2006 applications
• ~1400 4-core multi-programmed workloads (2 cores/LLC)
• ~6400 8-core multi-programmed workloads (2 cores/LLC, 4 cores/LLC)
24
Experimental Methodology
• System Model:
• 4-wide OoO processor (Core i7 type)
• 3-level memory hierarchy (Core i7 type)
• Application Scheduler
• Workloads
A
B
C
D
C0
C1
C2
C3
LLC0
LLC1
Baseline System
• Multi-programmed combinations of SPEC CPU2006 applications
• ~1400 4-core multi-programmed workloads (2 cores/LLC)
• ~6400 8-core multi-programmed workloads (2 cores/LLC, 4 cores/LLC)
25
CRUISE Performance on Shared Caches
(4-core CMP, 3-level hierarchy, averaged across all 1365 multi-programmed workload mixes)
Random
CRUISE-D
Distributed Intensity
(ASPLOS’10)
Optimal
1.04
1.02
O P T I M A L
1.06
C R U I S E - D
O P T I M A L
1.08
C R U I S E - L
Performance Relative to
Worst Schedule
1.10
CRUISE-L
1.00
LRU-managed LLC
•
•
DRRIP-managed LLC
CRUISE provides near-optimal performance
Optimal co-scheduling decision is a function of LLC replacement policy
26
Classifying Application Cache Utility in Isolation
How Do You Know Application Classification at Run Time?
x
• Profiling:
• Application provides memory intensity at run time
• HW Performance Counters:
x• Assume isolated cache behavior same as shared cache behavior
x• Periodically pause adjacent cores at runtime
• Proposal:

Runtime Isolated Cache Estimator (RICE)
• Architecture support to estimate isolated cache behavior while still sharing the LLC
27
Runtime Isolated Cache Estimator (RICE)
•
Assume a cache shared by 2 applications: APP0 APP1
Monitor isolated
cache
Monitor
behavior.
isolated
Only
cache
APP0behavior.
fills to these
Only
sets,
APP1 all
fills
other
to these
apps
sets,
bypass
all these
other sets
apps
bypass these sets
APP0
APP1
Miss
Miss
Counters to compute
isolated hit/miss rate
(apki, mpki)
< P0, P1, P2, P3 >
Follower Sets
Set-Level View of Cache
+ Access
+ Access
• 32 sets per APP
• 15-bit hit/miss cntrs
High-Level View of Cache
28
Runtime Isolated Cache Estimator (RICE)
•
Assume a cache shared by 2 applications: APP0 APP1
Monitor isolated
cache behavior if only
half the cache
available. Only APP0
fills to half the ways
in the sets. All other
apps use these sets
Needed to classify
LLCF applications.
Set-Level View of Cache
APP0
APP0
APP1
APP1
+
+
+
+
Access-F
Miss-F
Access-H
Miss-H
Access-F
Miss-F
Access-H
Miss-H
Counters to compute
isolated hit/miss rate
(apki, mpki)
< P0, P1, P2, P3 >
Follower Sets
• 32 sets per APP
• 15-bit hit/miss cntrs
High-Level View of Cache
29
Performance of CRUISE using RICE Classifier
Performance Relative to
Worst Schedule
1.30
CRUISE
1.25
Distributed Intensity
(ASPLOS’10)
Optimal
1.20
1.15
1.10
1.05
1.00
0.95
•
CRUISE using Dynamic RICE Classifier Within 1-2% of Optimal
30
Summary
• Optimal application co-scheduling is an important problem
• Useful for future multi-core processors and virtualization technologies
• Co-scheduling decisions are function of replacement policy
• Our Proposal:
• Cache Replacement and Utility-aware Scheduling (CRUISE)
• Architecture support for estimating isolated cache behavior (RICE)
• CRUISE is scalable and performs similar to optimal co-scheduling
• RICE requires negligible hardware overhead
31
Q&A
32

similar documents