Improving CMP Performance with Memory

Report
Meeting Midway: Improving CMP
Performance with Memory-Side Prefetching
Praveen Yedlapalli, Jagadish Kotra, Emre Kultursay, Mahmut Kandemir,
Chita R. Das and Anand Sivasubramaniam
The Pennsylvania State University
Summary
• In modern multi-core systems, increasing number
of cores share common resources
– “Memory Wall”
• Application/Core contention
Interference
Proposal
A novel memory-side prefetching scheme
Mitigates interference while exploiting row buffer locality
• Average 10% improvement in application
performance
Outline
•
•
•
•
•
Background
Motivation
Memory-Side Prefetching
Evaluation
Conclusion
Network On-Chip based CMP
MC0
MC1
Request
Message
Response
Message
MC2
MC3
L1
L2
C
R
Memory Controller
Row
RowBuffer
Buffer
Precharge
Activate
Conflict
Hit
row
rowAB
F21
G12
C41
B5
H22
B4
B4
A
Bank 0
MC
Bank 1
CPU
DRAM
B
B
Outline
•
•
•
•
•
Background
Motivation
Memory-Side Prefetching
Evaluation
Conclusion
Row Buffer Hit Rate
Impact of Interference
100
90
80
70
60
50
40
30
20
10
0
Individual
Mix-8
Latency Breakdown of L2 Miss
High MPKI
22%
Moderate MPKI
18%
35%
46%
60%
Low MPKI
19%
On-chip
Off- chip Queueing
43%
49%
8%
Off-chip Access
Observations
• Memory requests from multiple cores
interleave at the memory controllers
– Row buffer locality of individual apps is lost
• Off-chip latency is the majority part in a
memory access
• On-chip network and caches are critical
– Cannot afford to pollute them
What about Cache Prefetching?
• Not effective for large CMPs
• Agnostic to memory state
– Gap between caches and memory (62% latency increase)
• On-chip resource pollution
– Both caches and network (22% network latency increase)
• Difficulty of stream detection in S-NUCA
– Each L2 bank caters to only a portion of the address space
– Each L2 bank gets requests from multiple L1s
• Our memory-side prefetching scheme can work along with
core-side prefetching
Outline
•
•
•
•
•
Background
Motivation
Memory-Side Prefetching
Evaluation
Conclusion
Memory-Side Prefetching
• Objective 1
– Reduce off-chip access latency
• Objective 2
– With out increasing on-chip resource contention
Memory-Side Prefetching
What to Prefetch?
When to Prefetch?
Where to Prefetch?
What to Prefetch?
• Prefetch from an open row
– Minimizes overhead
• Looked at the line access patterns within a
row
Line 0
Line 4
Line 8
Line 12
Line 16
Line 20
Line 24
Line 28
Line 32
Line 36
Line 40
Line 44
Line 48
Line 52
Line 56
Line 60
% of Accesses
What to Prefetch?
milc
50
40
30
20
10
0
Line 52
Line 39
Line 26
Line 13
Line 0
0
Line 44
Line 22
Line 0
Line 60
20
Line 50
40
Line 40
libquantum
Line 30
60
Line 20
80
% of Accesses
100
Line 0
Line 10
Line 60
Line 50
Line 0
Line 10
Line 20
Line 30
Line 40
% of Accesses
What to Prefetch?
omnetpp
20
15
10
5
0
Line 44
Line 22
Line 0
When to Prefetch?
Idle Periods
Prefetch at RBC
1000000
Critical Path
Locality
# of Prefetches
Yes
No
High
No
Yes
Low
5618579
500000
Prefetch at RBH
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33+
Prefetch at Row
ACT
No
NoCycles
High
Prefetch at Idle
No
Yes
High
Where to Prefetch?
• Should be stored on-chip
• Prefetch buffers in the memory controllers
– To avoid on-chip resource pollution
• Organization
– Per-core
– Shared
Memory-Side Prefetching Optimizations
• Applications vary in memory behavior
• Prefetch Throttling
– Feedback
• Precharge on Prefetch
– Less likely to get a request
• Avert Costly Prefetchets
– Waiting demand requests
Memory-Side Prefetching: Example
A11
Core 0
Core 1
C32, C33, C34, C36
Core 2
R12, R13, R14, R15
Core 3
F20, F21, F22, F23
Row
Prefetch
Buffer
from
Hit A
A11, A12, A13, A14A10
F21
G12
C41
C26
H22
A10
A
Bank 0
A11
MC
Bank 1
CPU
DRAM
B
Memory-Side Prefetching: Comparison
Cache Prefetcher Existing Memory
[Lui et al. ILP ‘11] Prefetchers
[Lin HPCA ‘01]
Our Memoryside Prefetcher
No
Yes
Yes
On-chip resource Yes
pollution
Yes
No
Accuracy
No
Yes
Memory State
Aware
Yes
Implementation
• Prefetch Buffer Implementation
– Organized as n per-core prefetch buffers
– 256 KB per Memory Controller (<3% compared to
LLC)
– < 1% Area and Power overhead
• Prefetch Request Timing
– Requests are generated internally by the memory
controller along with a read row buffer hit request
Outline
•
•
•
•
•
Background
Motivation
Memory-Side Prefetching
Evaluation
Conclusion
Evaluation Platform
•
•
•
•
Cores: 32 at 2.4 GHz
Network: 8x4 2D mesh
Caches: 32KB L1I; 32KB L1D; 1MB L2 per core
Memory: 16GB DDR3-1600 with 4 Memory
Channels
• GEMS simulator with GARNET
Evaluation Methodology
• Benchmarks:
– Multi-programmed: SPEC 2006 (WL1 to WL5)
– Multi-threaded: SPECOMP 2001 (WL6 & WL7)
• Metrics:
– Harmonic IPC
– Off-chip and On-chip Latencies
IPC
IPC Improvement
20
33.2
10%
15
10
5
0
-5
WL1
WL2
WL3
WL4
WL5
WL6
WL7
-10
CSP
MSP
MSP-PUSH
IDLE-PUSH
CSP+MSP
AVG
Latency
600
500
Cycles
400
300
-48.5%
200
100
0
WL1
WL2
No Pref
WL3
CSP
WL4
MSP
WL5
IDLE-PUSH
WL6
CSP+MSP
WL7
AVG
Latency
600
500
Cycles
400
300
-48.5%
200
100
0
WL1
WL2
No Pref
WL3
CSP
MSP
WL4
WL5
MSP-PUSH
WL6
IDLE-PUSH
WL7
CSP+MSP
AVG
L2 Hitrate
100
L2 Hit Rate
80
60
40
20
0
WL1
WL2
WL3
WL4
WL5
WL6
No Pref
CSP
MSP
CSP+MSP
WL7
AVG
Row Buffer Hitrate
Row Buffer Hitrate
80
70
60
50
40
30
20
10
0
WL1
WL2
WL3
No Pref
WL4
CSP
WL5
MSP
WL6
CSP+MSP
WL7
AVG
Outline
•
•
•
•
•
Background
Motivation
Memory-Side Prefetching
Evaluation
Conclusion
Conclusion
• Proposed a new memory-side prefetcher
– Opportunistic
– Instantaneous knowledge of memory state
• Prefetching Midway
– Doesn’t pollute on-chip resources
• Reduces the off-chip latency by 48.5% and
improves performance by 6.2% on average
• Our technique can be combined with coreside prefetching to amplify the benefits
Thank You
• Questions?

similar documents