ISCA_MTTOPs

Report
Exploring Memory Consistency for
Massively Threaded ThroughputOriented Processors
Blake Hechtman
Daniel J. Sorin
0
Executive Summary
• Massively
Threaded
Throughput-Oriented
Processors (MTTOPs) like GPUs are being
integrated on chips with CPUs and being used for
general purpose programming
• Conventional wisdom favors weak consistency on
MTTOPs
• We implement a range of memory consistency
models (SC, TSO and RMO) on MTTOPs
• We show that strong consistency is viable for
MTTOPs
1
What is an MTTOP?
• Massively Threaded Throughput-Oriented
– 4-16 core clusters
– 8-64 threads wide SIMD
– 64-128 deep SMT
Thousands of concurrent threads
• Massively Threaded Throughput-Oriented
– Sacrifice latency for throughput
• Heavily banked caches and memories
• Many cores, each of which is simple
2
Example MTTOP
Core
Cluster
Fetch
Decode
E E E E EE E
L1
Core
Cluster
Core
Cluster
Core
Cluster
Core
Cluster
Core
Cluster
L2
Bank
L2
Bank
Core
Cluster
Core
Cluster
L2
Bank
L2
Bank
Core
Cluster
Core
Cluster
L2
Bank
L2
Bank
Core
Cluster
Core
Cluster
Core
Cluster
Core
Cluster
Core
Cluster
Cache Coherent
L2
Core
Core
L2 Memory
Shared
Bank
Cluster
Cluster
Bank
Memory
Controller
3
What is Memory Consistency?
Initially A = B = 0
Thread 0
Thread 1
ST B = 1
ST A = 1
LD r1, A
LD r2, B
Sequential Consistency : {r1,r2} = 0,1; 1,0; 1,1
Weak Consistency : {r1,r2} = 0,1; 1,0; 1,1; 0,0
 enables store buffering
In this work, we explore hardware consistency models
MTTOP hardware concurrency seems likely to be constrained by
Sequential Consistency (SC)
4
(CPU) Memory Consistency Debate
Performance
Strong
Consistency
Weak
Consistency
Slower
Faster
Programmability Easier
Harder
• Conclusion for CPUs: trading off ~10-40%
performance for programmability
– “Is SC + ILP = RC?” (Gniady ISCA99)
But does this conclusion apply to MTTOPs?
5
Memory Consistency on MTTOPs
• GPUs have undocumented hardware consistency
models
• Intel MIC uses x86-TSO for the full chip with
directory cache coherence protocol
• MTTOP programming languages provide weak
ordering guarantees
– OpenCL does not guarantee store visibility without a
barrier or kernel completion
– CUDA includes a memory fence that can enable global
store visibility
6
MTTOP Conventional Wisdom
• Highly parallel systems benefit from less ordering
– Graphics doesn’t need ordering
• Strong Consistency seems likely to limit MLP
• Strong Consistency likely to suffer extra latencies
Weak ordering helps CPUs, does it help MTTOPs?
It depends on how MTTOPs differ from CPUs …
7
Diff 1: Ratio of Loads to Stores
Weak Consistency reduces impact of store latency on performance
CPUs
Loads per Store
Prior work shows CPUs
perform 2-4 loads per
store
MTTOPs
10000
1000
100
10
1
MTTOPs perform more loads per store store latency optimizations
will not be as critical to MTTOP performance
8
Diff 2: Outstanding L1 cache misses
Weak consistency enables more outstanding L1 misses per thread
CPU Core
MTTOP Core (CU/SM)
threads per core
4
64
SIMD Width
4
64
L1 Miss Rate
0.1
0.5
SC Misses per Core
1.6 (too few misses)
2048 (enough misses)
…
…
…
RMO Misses per Core
6.4
8192
MTTOPs have more outstanding L1 cache misses  thread
reordering enabled by weak consistency is less important to handle
memory latency
9
Diff 3: Memory System Latencies
Weak consistency enables reductions of store latencies
CPU core
Fetch
Decode
Issue/Sel
E E E E
1-2 cycles
5-20 cycles
100-500 cycles
LSQ
L1
L2
Mem
MTTOP core cluster
Fetch
Decode
R
O
B
EE E E EE E
10-70 cycles
L1
100-300 cycles
L2
300-1000 cycles
Mem
MTTOPs have longer memory latencies  small latency savings will
not significantly improve performance
10
Diff 4: Frequency of Synchronization
Weak consistency only re-orders memory ops between sync
MTTOPs
CPUs
split problem into regions
assign regions to threads
do:
work on local region
synchronize
CPU local region:
~private cache size
MTTOP local region:
~private cache size/threads per cache
MTTOPs have more threads to compute a problem  each thread will
have fewer independent memory ops between syncs.
11
Diff 5: RAW Dependences
Through Memory
Weak consistency enables store to load forwarding
CPUs
• Blocking for cache
performance
• Frequent function calls
• Few architected registers
 Many RAW dependencies
through memory
MTTOPs
• Coalescing for cache
performance
• Inlined function calls
• Many architected registers
 Few RAW dependencies
through memory
MTTOP algorithms have fewer RAW memory dependencies  there
is little benefit to being able to read from a write buffer
12
MTTOP Differences & Their Impact
• Other differences are mentioned in the paper
• How much do these differences affect
performance of memory consistency
implementations on MTTOPs?
13
Memory Consistency Implementations
Strongest
Weakest
SC simple
SC wb
TSO
RMO
Fetch
Decode
Fetch
Decode
Fetch
Decode
Fetch
Decode
E E E E EE E
E E E E EE E
E E E E EE E
E E E E EE E
FIFO WB
FIFO WB
L1
L1
L1
No write buffer
Per-lane FIFO
write buffer
drained on
LOADS
Per-lane FIFO
write buffer
drained on
FENCES
L1
C
A
M
Per-lane CAM
for outstanding
write addresses
14
Methodology
• Modified gem5 to support SIMT cores running
a modified version of the Alpha ISA
• Looked at typical MTTOP workloads
– Had to port workloads to run in system model
• Ported Rodinia benchmarks
– bfs, hotspot, kmeans, and nn
• Handwritten benchmarks
– dijkstra, 2dconv, and matrix_mul
15
Target MTTOP System
Parameter
core clusters
core
Value
16 core clusters; 8 wide SIMD
in-order, Alpha-like ISA, 64 deep SMT
interconnection network
coherence protocol
L1I cache (shared by cluster)
L1D cache (shared by cluster)
L2 cache (shared by all clusters)
2D torus
Writeback MOESI protocol
perfect, 1-cycle hit
16KB, 4-way, 20-cycle hit, no local memory
256KB, 8 banks, 8-way, 50-cycle hit
consistency model-specific features (give benefit to weaker models)
write buffer (SCwb and TSO)
perfect, instant access
CAM for store address matching perfect, instant access
16
Results
MTTOP Consistency Model Performance Comparison
1.6
1.4
Speedup
1.2
SC
SC_WB
TSO
RMO
1
0.8
0.6
0.4
0.2
0
2dconv
barnes
bfs
djisktra
fft
hotspot
kmeans matrix_mul
nn
17
Results
MTTOP Consistency Model Performance Comparison
1.6
Significant
load
reordering
1.4
Speedup
1.2
1
0.8
0.6
SC
SC_WB
TSO
RMO
0.4
0.2
0
2dconv
barnes
bfs
djisktra
fft
hotspot
kmeans matrix_mul
nn
18
Conclusions
• Strong Consistency should not be ruled out
for MTTOPs on the basis of performance
• Improving store performance with write
buffers appears unnecessary
• Graphics-like workloads may get significant
MLP from load reordering (dijkstra, 2dconv)
Conventional wisdom may be wrong about MTTOPs
19
Caveats and Limitations
• Results do not necessarily apply to all possible
MTTOPs or MTTOP software
• Evaluation with writeback caches when
current MTTOPs use write-through caches
• Potential of future workloads to be more CPUlike
20
Thank you!
Questions?
21

similar documents