CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Report
CuMAPz: A Tool to Analyze
Memory Access Patterns in
CUDA
Yooseong Kim and Aviral Shrivastava
Compiler and Microarchitecture Laboratory
, Arizona State University
DAC2011
Outline





Introduction
Preliminaries
Motivating examples
CuMAPz approach
Experimental results and Conclusions
Introduction



Currently, the computational power of
Graphics Processing Units (GPUs) has
reached teraFLOP scale.
NVIDIA CUDA and OpenCL make GPGPU
(General Purpose computation on GPUs)
programming more easier.
The performance will be heavily affected by
memory performance for the sake of large
data size.
Introduction (cont.)



Shared memory is as fast as registers, and
is the only fast memory where both reads
and writes are enabled.
Many factors affect performance: data reuse,
global memory access coalescing, shared
memory, bank conflict, channel skew.
Develops CuMAPz (CUDA Memory Access
Pattern analyZer) to analyze the memory
performance of CUDA program.
Preliminaries





NVIDIA GPU architecture.
Comparisons between CPU and GPU.
CUDA programming.
Memory coalescing.
Execution of GPU thread
Architecture of Nvidia GTX280


A collection of 30 multiprocessors, with 8
streaming processors each.
The 30 multiprocessors share one off-chip
global memory.


Access time: about 300 clock cycles
Each multiprocessor has a on-chip memory
shared by that 8 streaming processors.

Access time: 2 clock cycles
Architecture diagram
About some differences between
GPU and CPU
GPU (NVIDIA GeForce
CPU (Intel Pentium 4)
8800 GTX)
Cores and clock
rate
128 / 575MHz (core
clock), 1.35GHz
(shader clock)
1 / 3.0GHz
flops
345.6G
~12G
Memory
bandwidth
86.4GB/s (900MHz
memory clock, 384 bit
interface, 2 issues)
6.4GB/s (800MHz
memory clock, 32 bit
interface, 2 issues)
Access time of
global memory
Slow (about 500
memory clock cycles)
Fast (about 5 memory
clock cycles)
Abstract comparisons of memory
between GPU and CPU (cont.)
CPU (Intel Pentium 4)
GPU (NVIDIA GeForce
8800 GTX)
Register
Register
Cache
Texture cache or Constant
cache
Main memory
Shared memory
Hard disk
Global memory, Texture
memory, Constant memory
Memory coalescing


Several memory transactions can be
coalesced into one transaction when
consecutive threads access
consecutive memory locations.
Due to access time of global memory
is relatively large, it is important to
achieve this.
CUDA programming




Compute Unified Device Architecture
The CPU code does the sequential
part.
Highly parallelized part usually
implement in the GPU code, called
kernel.
Calling GPU function in CPU code is
called kernel launch.
Execution of GPU thread




Threads are grouped into thread blocks.
Each thread block is assigned to a
streaming multiprocessors (SMs), which
contains multiple scalar processors (SPs), to
be executed.
The actual execution of threads on SPs is
done in groups of 32 threads, called warps.
SPs execute one warp at a time.
Motivating examples



What to fetch into shared memory?
How to access shared memory?
How to access global memory?
What to fetch into shared
memory?

A simple program that does not use
shared memory.
What to fetch into shared
memory? (cont.)

If we fetch row*MAX+col+1 to the
shared memory…
What to fetch into shared
memory? (cont.)

Generally, higher data reuse should imply better
performance. => But may not be true here.

This counter-intuitive result is mainly caused by
global memory access coalescing.
How to access shared
memory?


In Figure 2, data is accessed in a
column-wise manner, as shown at Line
4, 9, 11, and 16.
What if we change into row-wise
manner (i.e. s_in[tIdx.y][tIdx.x]) or
skewing the access pattern (i.e.
__shared__ float s_in[BLKDIM][BLKDIM+1])?
How to access shared
memory? (cont.)

Shared memory bank conflicts occur if
there are multiple requests to different
addresses in the same bank. In this
case, the requests are serialized.
How to access global memory?

A programmer might have designed the global memory
write reference at Line 18 in Figure 2 to be in a columnwise manner as in out[col*MAX+row].

This unexpected slowdown is caused by channel skew.
Channel skew is the ratio of the number of concurrent
accesses to the most used channel to theleast used
channel.
Previous works

[20] modeled the amount of parallelism employed in
a program and the efficiency of a single kernel
execution in a thread.


[8] includes the effect of parallelism to hide global
memory access latency.


Did not consider memory performance and their analysis
is only for compute intensive benchmarks.
Does not take into account branch divergence.
[14][15][16][17][18] automate optimization of
GPGPU applications.

None of the above work comes up with a comprehensive
performance metric to estimate the efficiency of memory
access pattern.
CuMAPz overview
Data Reuse Profit Estimation

CuMAPz maintains a counter to count
the number of times shared memory
buffers are accessed. The degree of
data reuse is represented in a term,
data reuse, as follows:
Coalesced Access Profit Estimation

Due to coalescing, the actual transfer size that will
consume bus width can be different from the size of
data requested from threads. CuMAPz calculates
the bandwidth utilization as the following:
Channel Skew Cost Estimation


Channel skew refers to the case where the
concurrent memory accesses are not evenly
distributed to all the channels but focused on only a
few channels.
When a kernel is launched, threads blocks are
assigned to SMs in a sequential order so that
adjacent blocks are executed on adjacent SMs.
Then, it becomes unpredictable after the first round
of schedule since the order in which thread blocks
finish the execution cannot be determined [13].
Channel Skew Cost Estimation
(cont.)

The impact of channel skew can be
stated in figures as the skewness of
mapping to channels which can be
calculated as follows:
Bank Conflict Cost Estimation


Similarly to global memory channels,
shared memory space is divided into
multiple banks. Each bank can serve
one address at a time.
Efficiency of shared memory access is
modeled as follows:
Branch Divergence Cost
Estimation



Branches are introduced when there is uncovered
region that is not buffered into shared memory, as
shown at Line 6 and 13 in Figure 2.
When threads in a warp take different execution
paths, then all paths are serialized.
We simply model the impact of branch divergence
as follows:
Overall Memory Performance
Estimation

Memory performance estimation is
calculated by the following formula.
Experimental results

Environments



Using C language.
CUDA driver version 3.2 on NVIDIA Tesla
C1060.
Benchmark are from benchmark suites in
[6], and CUDA SDK.
Two experiments


Validation: studying the correlation
between our memory performance
estimation and the performance of the
benchmarks for different ways.
Performance Optimization: trying to
find the best way to accesses shared
and global memory using CuMAPz
and the previous technique [8].
Validation
Performance optimization
Runtime Considerations

The timing complexity of the CuMAPz analysis is
O(|W|*|R|*|B|), where W, R, and B are the set of all
warps, global memory references, and shared
memory buffers respectively.
Limitations

Compile-time analysis


Cannot handle any information that can
only be determined during run-time.
Assume adequate occupancy

A measure of how many thread blocks
can be scheduled on one SM so that the
hardware is kept busy.
Conclusions




GPU is a new platform for high-performance
computing.
Develops CuMAPz to analyze memory
performance of CUDA.
Considering many aspects like channel
skew, etc.
Experimental results show very high
correlation between the actual execution
times and CuMAPz estimation.

similar documents