A Performance Analysis Framework for Identifying

Report
Jaewoong Sim Aniruddha Dasgupta
Hyesoon Kim Richard Vuduc
1
2/26
| Motivation
| GPUPerf: Performance analysis framework



Performance Advisor
Analytical Model
Frontend Data Collector
| Evaluations
| Conclusion
3/26
| GPGPU architectures have become very powerful.
| Programmers want to convert CPU applications to GPGPU applications.
| Case 1: 10x speed-up 
CPU
Version
GPGPU
Version
| Case 2: 1.1x speed-up 
CPU
Version
GPGPU
Version
| For case 2, programmers might wonder why the benefit is so poor.


Maybe, the algorithm is not parallelizable
Programmers
want
optimize
code whenever possible!
Maybe,
the GPGPU code
aretonot
well optimized
| For case 1, programmers might wonder if 10x is the best speed-up.
4/26
Normalized Performance
| Optimizing parallel programs is difficult^100!
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Best for this kernel
Baseline
Shared
Memory
SFU
Tight
One Optimization
Still the best!
UJAM
Shared
Shared
Shared
Memory + Memory + Memory +
SFU
Tight
UJAM
Shared Memory + Another
want
to understand
benefit!
| Most Programmers
of programmers
apply
optimizationperformance
techniques one
by one.
| Try one more optimization with Shared Memory. Which one to choose?
5/26
| Providing performance guidance is not easy.



Program analysis: Obtain program information as much as possible
Performance modeling: Have a sophiscated analytical model
User-friendly metrics: Convert the performance analysis information
into performance guidance
| We propose GPUPerf, performance analysis framework

Quantatively predicts potential performance benefits
| In this talk, we will focus more on performance modeling
and potential benefit metrics
6/26
| Motivation
| GPUPerf: Performance Analysis Framework



Performance Advisor
Analytical Model
Frontend Data Collector
| Evaluations
| Conclusion
7/26
| What is required for performance guidance?

Program analysis
Performance modeling
User-friendly metrics
GPGPU
Kernel
Frontend
Data Collector


ILP, #insts
Analytical
Model
Model output
Performance
Advisor
GPUPerf
For clarity, each component will be
explained in a reverse order
Benefit
Metrics
Frontend
Data Collector
Analytical
Model
Performance
Advisor
| Goal of the performance advisor


Convey performance bottleneck information
Estimate the potential gains from reducing the bottlenecks
| Performance advisor provides four potential benefit metrics




Bitilp : benefits of increasing ITILP
Benefit metrics are provided
Bmemlp : benefits of increasing MLP
by our analytical model
Bserial : benefits of removing serialization effects
Bfp
: benefits of improving computing inefficiency
| Programmers can get an idea of the potential benefit of a
GPGPU Kernel
8/26
9/26
| MWP (Memory Warp Parllelism)

Indicator of memory-level parallelism
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
8 warps
| CWP (Compute Warp Parllelism)
Mem
Comp
Comp
MWP=4
Comp
CWP=3
Time
| Depending on MWP and CWP, the execution time is predicted by
the model.
MWP-CWP [Hong and Kim, ISCA’09]
| The MWP-CWP model can predict general cases.
| Problem: did not model corner cases, which is critical to predict
different program optimization benefits!
Frontend
Data Collector
Analytical
Model
Performance
Advisor
10/26
| Our analytical model follows a top-down approach


Easy to interpret model components
Relate them directly to performance bottlenecks
Texec
Tcomp
Comp
Texec = Tcomp + Tmem - Toverlap
Tmem
Toverlap
 Texec
Mem
Comp
Mem
Comp
Mem
Tcomp
Comp
Time
Comp
Comp
 Tcomp : Computation time
 Tmem : Memory time
 Toverlap : Overlapped time
Mem
Comp
4 warps
Tmem
Comp
: Final execution time
TMem
overlap
Mem
Mem
Mem
MWP=2
Frontend
Data Collector
Analytical
Model
Performance
Advisor
11/26
| Tcomp is the amount of time to execute compute instructions
Texec
Tcomp
Wparallel
Tmem
Wserial
Tcomp = Wparallel + Wserial
Toverlap
 Wparallel : Work executed in parallel (useful work)
 Wserial : Overhead due to serialization effects
Frontend
Data Collector
Analytical
Model
Performance
Advisor
12/26
| Wparallel is the amount of work that can be executed in parallel
Texec
Tcomp
Wparallel
Tmem
Wserial
Toverlap
Effective inst. throughput =
f(warp_size, SIMD_width, # pipeline stages)
ITILP represents the| number
ofTotal insts × Effective inst. throughput
Wparallel =
instructions that can be parallely
executed in the pipeline.
average_instruction_latency
| Effective Inst. throughput =
ITILP
Frontend
Data Collector
| ITILP is inter-thread ILP.
Inst1
Inst1
Inst1
Inst1
Inst1
stall
stall
ExecutionTime (msec)
TLP (N)
Model (MWP-CWP)
ILP =4/3
Inst3
Inst3
Inst3
Inst3
Inst2 10
Inst2
Inst2
Inst2
5
1
3
5
7
Inst3
ITILP = MIN(ILP × N, ITILPmax)
warp_size
Inst1
Inst3
Inst3
Inst2
Inst2
Inst3
Inst3
stall
9 11 13 15 17Inst4
19 21 23 25Inst4
27 29 31
ILP X TLP
ITILPmax =avg_inst_lat/ SIMD_width
TLP =3
ITILP=ITILP max
As TLP increases,
Execution latency is
Model
(New)
stall
already all hidden!
Inst2 time reduces
Inst2
Inst2
stall
Inst4
Inst4
Inst4
0
Inst4
13/26
TLP =2
ITILP=8/3
stall Actual
Low ITILP
Performance
Advisor
TLP =1
ITILP=4/3
25
TLP =1
20TLP =2
TLP =3
Inst1
Inst1
15 Inst1
Inst1
Analytical
Model
Inst4
Time
stall
Inst4
Inst4
Inst2
Inst3
Frontend
Data Collector
Analytical
Model
Performance
Advisor
14/26
| Wserial represents the overhead due to serialization effects
Wserial = Osync + OSFU + OCFDiv + Obank
Texec
Tcomp
Wparallel
Tmem
Toverlap
Wserial
Osync
OSFU
OCFDiv
Obank




Osync : Synchroization overhead
OSFU : SFU contention overhead
OCFDiv : branch divergence overhead
Obank : Bank-conflict overhead
Frontend
Data Collector
Analytical
Model
Performance
Advisor
15/26
| GPGPU has SFUs where expensive operations can be executed.

With a good ratio of insts and SFU insts, SFU executing cost can be hidden.
SFU Inst
Inst
Inst
Inst
Inst
Inst
SFU Inst
Inst
SFU Inst
Inst
Inst
OSFU
SFU Inst
Low Inst to SFU ratio
Execution Time (msec)
Inst
SFU Inst
High Inst to SFU ratio
SFU Inst
Inst
Inst
Inst
Inst
SFU Inst
Inst
SFU Inst
600
Actual
Model (MWP-CWP)
Model (New)
500
400
300
Latency of SFU instructions is not
completely hidden in this case!
200
100
0
1
2
3
4
5
6
# of SFU insts. per eight FMA insts.
7
Frontend
Data Collector
Analytical
Model
Performance
Advisor
16/26
| Tmem represents the amount of time spent on memory requests and
transfers
Tmem = Effective mem. requests × AMAT
Texec
Tcomp
Comp
Tmem
Toverlap
Mem
Comp
Mem
Comp
Mem
Mem
Mem
Mem
Tmem = 4MEM / 2
Mem
Comp
MWP=2
Mem
Mem
Mem
Mem
Mem
Tmem = 4MEM / 1
MWP=1
Frontend
Data Collector
Analytical
Model
Performance
Advisor
| Toverlap represents how much the memory cost can be hidden by multithreading
Texec
Tcomp
Tmem
Toverlap
Toverlap ≈Tmem
Comp
Mem
Comp
MWP=3
Mem
Mem
Comp
Comp
CWP=3
MWP ≥ CWP
Comp
All the memory costs are
overlapped with computation
17/26
Frontend
Data Collector
Analytical
Model
Performance
Advisor
18/26
| Toverlap represents how much the memory access cost can be hidden by
multi-threading
Texec
Tcomp
Tmem
Toverlap
Toverlap ≈Tcomp
Mem
Comp
Mem
Mem
Comp
Mem
Mem
MWP=2
Comp
Comp
CWP=4
Comp
CWP > MWP
Comp
Computation cost is hidden
by memory cost
Frontend
Data Collector
Analytical
Model
Performance
Advisor
| Time metrics are converted into potential benefit metrics.
Comp Cost
Tcomp
Bmemlp
Toverlap
Single Thread
Bserial
 Tfp : ideal computation cost
 Tmem_min : ideal memory cost
Bitilp
Bfp
Optimized
Kernel
Tfp
Tmem_min
Tmem’
Potential Benefit Chart
Tmem
Mem Cost
Benefit
Metrics
Benefits of
Bmemlp
Increasing MLP
Bserial
Removing serialization effects
Bitilp
Increasing inter-thread ILP
Bfp
Improving computing efficiency
19/26
Frontend
Data Collector
Frontend
Data Collector
CUDA
Executable
Ocelot
[Diamos et al.,
PACT’10]
Ocelot
Executable
Compute
Visual Profiler
Analytical
Model
#Insts
Occupancy
#SFU_Insts
Static Analysis
Tools
ILP, MLP, ...
The collected information is fed into
our analytical model
20/26
Performance
Advisor
#insts, global LD/ST requests, cache info
| Detailed information from emulating
PTX executions

CUDA Binary
(CUBIN)
Performance
Advisor
| Architecture-related information
from H/W counters

Instruction
Analyzer (IA)
Analytical
Model
#SFU insts, #sync insts, loop counters
| Information in CUDA binaries
instead of PTX after low-level
compiler optimizations

ILP, MLP
21/26
| Motivation
| GPUPerf: A Performance Analysis Framework



Performance Advisor
Analytical Model
Frontend Data Collector
| Evaluations
| Conclusion
22/26
| NVIDIA C2050 Fermi architecture
| FMM (Fast Multi-pole Method): approximation of n-body
problem [Winner, 2010 Gordon Bell Prize at Supercomputing]
Prefetching
SFU
Vector
Packing
Loop
Unrolling
Shared
Memory
Loop
optimization
| Parboil benchmarks, Reduction (in the paper)
44 Optimization
combinations
3.5
pref
pref_rsqrt
rsqrt
rsqrt_tight
shmem
shmem_pref_rsqrt_tight
shmem_pref_ujam_rsqrt_tight
shmem_rsqrt
shmem_rsqrt_tight
shmem_tight
shmem_trans
shmem_trans_pref_ujam_rsqrt_tight
shmem_trans_rsqrt
shmem_trans_rsqrt_tight
shmem_trans_tight
shmem_trans_ujam
shmem_trans_ujam_rsqrt
shmem_trans_ujam_rsqrt_tight
shmem_trans_ujam_tight
shmem_ujam
shmem_ujam_rsqrt
shmem_ujam_rsqrt_tight
shmem_ujam_tight
tight
ujam
ujam_rsqrt
ujam_rsqrt_tight
ujam_tight
vecpack
vecpack_pref
vecpack_pref_rsqrt
vecpack_pref_ujam
vecpack_pref_ujam_rsqrt
vecpack_rsqrt
vecpack_shmem
vecpack_shmem_pref_rsqrt
vecpack_shmem_pref_ujam_rsqrt
vecpack_shmem_rsqrt
vecpack_shmem_trans
vecpack_shmem_trans_pref_rsqrt
vecpack_shmem_trans_rsqrt
vecpack_shmem_trans_ujam_rsqrt
vecpack_shmem_ujam_rsqrt
vecpack_ujam
Speedup over no optimizations
23/26
Actual
3
2.5
2
1.5
1
0.5
0
Vector packing + Shared memory + Unroll-Jam +
SFU combination shows the best performance
Optimizations
3.5
3
pref
pref_rsqrt
rsqrt
rsqrt_tight
shmem
shmem_pref_rsqrt_tight
shmem_pref_ujam_rsqrt_tight
shmem_rsqrt
shmem_rsqrt_tight
shmem_tight
shmem_trans
shmem_trans_pref_ujam_rsqrt_tight
shmem_trans_rsqrt
shmem_trans_rsqrt_tight
shmem_trans_tight
shmem_trans_ujam
shmem_trans_ujam_rsqrt
shmem_trans_ujam_rsqrt_tight
shmem_trans_ujam_tight
shmem_ujam
shmem_ujam_rsqrt
shmem_ujam_rsqrt_tight
shmem_ujam_tight
tight
ujam
ujam_rsqrt
ujam_rsqrt_tight
ujam_tight
vecpack
vecpack_pref
vecpack_pref_rsqrt
vecpack_pref_ujam
vecpack_pref_ujam_rsqrt
vecpack_rsqrt
vecpack_shmem
vecpack_shmem_pref_rsqrt
vecpack_shmem_pref_ujam_rsqrt
vecpack_shmem_rsqrt
vecpack_shmem_trans
vecpack_shmem_trans_pref_rsqrt
vecpack_shmem_trans_rsqrt
vecpack_shmem_trans_ujam_rsqrt
vecpack_shmem_ujam_rsqrt
vecpack_ujam
Speedup over no optimizations
24/26
Our model follows the Actual
speed-up trend pretty well
Prediction
2.5
2
1.5
1
0.5
0
Our model correctly pinpoints the best
optimization combination that improves the kernel
Optimizations
Normalized Benefits
Speedup (Actual)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
vecpack_rsqrt
_shmem_ujam
_pref
vecpack_rsqrt
_shmem
Bfp : Computing ineffciency
vecpack_rsqrt
vecpack
B_fp
vecpack_rsqrt
_shmem_ujam
3.5
3
2.5
2
1.5
1
0.5
0
baseline
Speedup
25/26
(Higher is wrose)
| Bfp implies that the kernel could be improved via optimizations
| Small Bfp value indicates that adding Prefetching(Pref) does not lead to
further performance improvement
26/26
| We propose performance analysis framework.

Front-end data collector, analytical model and performance advisor.
| Performance advisor provides potential benefit metrics,
which can guide performance tuning for GPGPU code.


(Bmemlp, Bserial, Bitilp, Bfp).
44 optimization combinations in FMM are well predicted.
| Future work: the performance benefit advisor can be inputs
to compilers.

similar documents