A Case for Refresh Pausing in DRAM Memory

Report
A Case for Refresh Pausing in
DRAM Memory Systems
Prashant Nair
Chia-Chen Chou
Moinuddin Qureshi
1
Introduction
• Dynamic Random Access Memory (DRAM) used as main memory
• DRAM stores data as charge on capacitor
DRAM Chip
DRAM cells
leak data!
1
Leakage
DRAM is a volatile memory  Charge leaks quickly
2
Refresh: Restoring Data in DRAM
DRAM maintains data by Refresh operations
DRAM Chip
Refresh
Refresh
Refresh
Refresh
Charge on cells restored
JEDEC specified DRAM retention time:
64ms (< 85 C)
32ms (> 85 C)
Time between Refresh ≤ Retention Time
DRAM relies on Refresh for data integrity
3
Refresh: A Growing Problem
Time spent in Refresh proportional to number of Rows
Increasing memory capacity  More time spent in Refresh
~36%
~18%
2.8%
1Gb
5.1%
2Gb
7.7%
4Gb
9%
8Gb
16Gb
32Gb
Chip Density
The time for doing Refresh is increasing with chip density
4
Refresh Blocks Reads
Memory unavailable for Read/Write during Refresh
A
B
time
No Refresh
A
B
Wait
B Serviced
REFRESH
Interference due to Refresh
time
Refresh blocks reads  Higher read latency
5
Impact of Refresh
Performance
60%
Performance Loss
Increase in Read Latency
Read Latency
50%
40%
30%
20%
10%
0%
8Gb
16Gb
32Gb
40%
35%
30%
25%
20%
15%
10%
5%
0%
8Gb
16Gb 32Gb
Impact of Refresh is significant, and increasing
Our Goal: Reduce the Read Latency impact of Refresh
6
Outline
 Introduction & Motivation
 Refresh Operation: Background
 Refresh Pausing
 Evaluation
 Alternative Proposals
 Summary
7
Refresh Operation
Row 1
Row 2
Row 3
Row 4
Row 5
A DRAM Bank
Refresh
Refresh
Row
n-1
Refresh
Row
n
Refresh operates on a Row granularity
8
Refresh Modes
• Burst Mode:
Refresh
64ms
Memory unavailable until all rows finish refresh
• Distributed Mode:
Refresh
8K refresh pulses in 64ms
64ms
Distributed mode reduces contention from Refresh
9
Refresh Bundle
Every pulse refreshes a ‘Bundle of rows’
Chip Size
512 Mb
1Gb
2Gb
4Gb or 8Gb (Twin 4Gb die)
Rows in a Refresh
bundle (per bank)
1
2
4
8
Refresh Bundle currently have upto 8 rows, and increasing
10
The Latency Wall of Refresh
TRFC is the time to do refresh for every refresh pulse
available
TRFC
unavailable
8Gb
available
TRFC
unavailable
16Gb
available
TRFC
unavailable
32Gb
Current 8Gb chips have TRFC of 350ns >> read latency
High TRFC  Read waits for refresh for long time
11
Outline
 Introduction & Motivation
 Refresh Operation: Background
 Refresh Pausing
 Evaluation
 Alternative Proposals
 Summary
12
Refresh Pausing
Insight: Make Refresh Operations Interruptible
A
Refresh
B
Baseline system
time
Request B arrives
A
Refresh
B
Refresh (Cont.)
Refresh Pausing
Interrupted
time
Request B arrives
Pausing at arbitrary point can cause data loss
Pausing Refresh reduces wait time for Reads
13
Refresh Pausing: When to Pause?
Bank
Refresh Pulse
(4 rows in a bundle)
X
Chip
With Refresh
Without
Refresh
Pausing
Pausing
Rows
Pause
Read X
d
c
b
a
Row Buffer
Refresh Pausing at Row boundary to service read
14
Refresh Pausing: Interface Details
• Memory Controller generates a Refresh Enable (RE) signal
• Pausing requires ‘active low’ detection of RE
• One way communication only
RE Pause
1
0
Memory Controller
Refresh Enable
(RE) to DRAM
Resume
15
Refresh Pausing: Track a Paused Row
• Row Address Counter increments the addresses
• Stop the increment using a simple AND gate
• Active Low Refresh Enable as ‘Refresh Pause’
DRAM
Address Generator
Refresh Bundle Addresses
EN
Row Address
Counter
RE
Incrementer
16
Refresh Pausing: Memory Scheduler
• Scheduler schedules: Read, Write, and Refresh
• Responsible for Pausing Refresh for Read
• Keeps track of refresh time done before Pause
Scheduler
Read Queue
Bus
Processor
DRAM
Write Queue
Memory
Controller
Refresh Enable
17
Forced Refresh
• Pausing can delay Refresh
Reads/Writes
Forced
Refresh
Refresh Pulses
Refresh
Issued
Refresh Not Issued
• JEDEC allows delay of up-to 8 pending refresh
• If 8 pending refresh, then issue ‘Forced Refresh’
• Forced Refresh cannot be Paused
Forced Refresh for data integrity
18
Outline
 Introduction & Motivation
 Refresh Operation: Background
 Refresh Pausing
 Evaluation
 Alternative Proposals
 Summary
19
Experimental Setup
• Simulator: uSIMM from Memory Scheduling Championship (MSC)
• Workloads: MSC Suite
COMMERCIAL(5), PARSEC(9), BIOBENCH(2) and SPEC(2)
• Configuration:
Number of Cores
Last Level Cache
DRAM (DDR3)
Channels, Ranks, Banks
Refresh (Baseline)
4
1MB
8 Chips/Rank, 8Gb/Chip
4,2,8
Distributed (JEDEC)
• Results presented for temperature > 85C (paper also has <85C)
20
Normalized Read Latency
Results: Read Latency
Normalized
Read Latency
Refresh
Pausing
No Refresh
1.00
7%
0.95
0.90
0.85
0.80
0.75
COMMERCIAL
SPEC
PARSEC
BIOBENCH
GMEAN
- Refresh Pausing gives ~7% read latency reduction for an 8Gb chip
21
Results: Performance
Performance
Comparison
Refresh
Pausing
No Refresh
Speedup
1.12
1.10
1.08
1.06
1.04
1.02
COMMERCIAL
SPEC
PARSEC
BIOBENCH
GMEAN
- Refresh Pausing gives ~5% performance improvement for an 8Gb chip
22
Results: Impact of Chip Density
Impact
of Density
on RefreshNo
Pausing
Refresh
Pausing
Refresh
Speedup
1.4
1.3
1.2
1.1
1.0
8Gb
16Gb
32Gb
Refresh Pausing more effective as chips density increases
23
Outline
 Introduction & Motivation
 Refresh Operation: Background
 Refresh Pausing
 Evaluation
 Alternative Proposals
 Summary
24
Elastic Refresh for Scheduling Refresh
[MICRO’10]
• Elastic Refresh waits for idle period before issuing a refresh
• Estimates average inter-arrival time of memory request
Request A
No Refreshes
A
Request A
With Refreshes
A
Request B
3 units
4 units
A
Request B
7 units
Wait
time
B
Refresh
Request A
Elastic Refresh
B
Refresh
time
Request B
B
time
The “Wait and Watch” policy can increase wait times
25
Comparison with Elastic Refresh
Comparision of
Elastic Refresh
Elastic Refresh
Refresh
Pausing
No Refresh
1.15
Speedup
1.10
1.05
1.00
0.95
0.90
COMMERCIAL
SPEC
PARSEC
BIOBENCH
GMEAN
Refresh Pausing outperforms Elastic Refresh
26
DDR4 proposals: x2 and x4 modes
Reduce bundles size and have more bundles
DDR3 Distributed Mode
TRFC
TRFC
TREFI
DDR4 x2 Mode
TRFC/2
TREFI/2
TRFC/2
TREFI/2
TRFC/2
TRFC/2
TREFI/2
• In x2 mode, TREFI is reduced by 2 (x4 mode by 4)
• In x2 mode TRFC is reduced by 2 (x4 mode by 4)
Fine Grained Refresh to reduce contention of Refresh
27
Comparison with DDR4
1.40
1.35
Speedup
1.30
1.25
1.20
1.15
1.10
1.05
1.00
DDR4 x2 DDR4 x4 Pausing
16Gb
No
DDR4 x2 DDR4 x4 Pausing
No
Refresh
Refresh
32Gb
DDR4 modes (x2 and x4) useful but not enough
28
Outline
 Introduction & Motivation
 Refresh Operation: Background
 Refresh Pausing
 Evaluation
 Alternative Proposals
 Summary
29
Summary
• DRAM relies on Refresh for data integrity
• Time for Refresh increases with chip density
• Refresh blocks read, increases read latency
• Refresh Pausing: make Refresh Interruptible
• Pausing provides 5% improvement for 8Gb, increases with
higher density
• Applicable also to DDR4 (fine grained refresh)
30
THANK YOU
31
Refresh+Read
• Reads operate on a rank
• Refreshes may also operate on the same rank
• DRAMs serve only a single request at a time
Rank
Reads
Refresh
Scheduler
Read Queue
32
Refresh Row Bundle
Refresh Row Bundle
REFRESH
TREFI
Row n
Row 1
TRFC
TREC
REFRESH
• TRFC : Time to refresh one bundle of rows
• TREC : Current Recovery Time
• TREFI : Time until next bundle refresh
Larger refresh-row bundle implies larger TRFC
33
DRAM Organization
Hierarchically organized as Channels, Ranks and Banks
Banks
Rank 2
Chip
Rank 1
READ
Channel
Rows
34
Refresh Modes
Burst and Distributed Mode
Chips
Refresh
Rank
Bank
Refresh
Rows
Distributed
Burst Mode
Mode
Distributed mode: Only a few rows in all banks refresh;
In burst mode, all rows in all banks refresh simultaneously
35
refresh is distributed in time
Transactions in DRAMs
Three transactions of concern
– Reads
– Writes
– Refreshes
Refresh
Write
Processor
Read
DRAM
Bus
Mismanagement
of requests leads
to collisions!
A scheduler is needed to manage requests to DRAM
36
Temperature Sensitivity of Refresh Pausing
Temperature Sensitivity
40.00%
30.00%
20.00%
8Gb
16Gb
10.00%
32Gb
0.00%
<85C
<85C
>85C
No Refresh
Refresh Pausing
No Refresh
>85C
Refresh Pausing
- Upto 22% increase in speedup for future chips
The savings of Refresh Pausing is higher while
operating at high temperatures
37
Auto and Self Refresh
• Special Refresh Modes for DRAMs
• Auto Refresh – Internal Counter issues pulses
in distributed fashion (CBR and RAS only)
• Self Refresh – DRAM is internally refreshed at
a power optimized rate (Activity == 0)
Self Refresh Modes are only used when DRAMs stay idle
38
Mitigating Penalty
• Pause a refresh bundle at row granularity
• TRPC = row cycle time + current recovery time
• Current recovery time is small for individual rows
• Thus refreshes can be made interruptible
a.
b.
Maximum Refresh penalty without pausing is TRFC
Maximum Refresh penalty with pausing is to TRPC
39

similar documents