PPTX

Report
A Survey of the
Current State of
the Art in SIMD:
Or, How much wood could a woodchuck chuck if a
woodchuck could chuck n pieces of wood in parallel?
Wojtek Rajski, Nels Oscar,
David Burri, Alex Diede
Introduction
• We have seen how to improve performance
•
•
through exploitation of:
Instruction-level parallelism
Thread-level parallelism
• One other exploitation we have not discussed
is Data-level parallelism.
Introduction
• Flynn's Taxonomy
•
•
An organization of computer architectures
based on their instruction and data streams
Divides all architectures into 4 categories:
1. SISD
2. SIMD
3. MISD
4. MIMD
Introduction
• Implementations of SIMD
•
•
•
Prevalent in GPUs
SIMD extensions in CPU
Embedded systems and Mobile Platforms
Introduction
• Software for SIMD
•
•
Many libraries utilize and encapsulate SIMD
Adopted in these areas
Graphics
o Signal Processing
o Video Encoding/Decoding
o Some scientific applications
o
Introduction
• SIMD Implementations fall into three highlevel categories:
1. Vector Processors
2. Multimedia Extensions
3. Graphics Processors
Introduction
• Going forward:
• Streaming SIMD
•
•
Extensions(MMX/SSE/AVX)
o Similar technology in
GPUs
Compiler techniques for
DLP
Problems in the world of
SIMD
Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
x86 computers. This figure assumes that two cores per chip for MIMD will be added every two years and the
number of operations for SIMD will double every four years. Copyright © 2011, Elsevier Inc.
SIMD in Hardware
•
•
•
•
Register Size/Hardware
changes
Intel Core i7 example
The ‘Roofline’ model
Limitations of streaming
extensions in a CPU
SIMD in Hardware
•
Streaming SIMD requires some basic
components
o
o
o
Wide Registers
 Rather than 32bits, have 64, 128, or 256 bit wide
registers.
Additional control lines
Additional ALU's to handle the simultaneous
operation on up to operand sizes of 16-bytes
Hardware
Figure 4.4 Using multiple functional units to improve the performance of a single vector add instruction, C = A
+ B. The vector processor (a) on the left has a single add pipeline and can complete one addition per cycle. The vector
processor (b) on the right has four add pipelines and can complete four additions per cycle. The elements within a
single vector add instruction are interleaved across the four pipelines. The set of elements that move through the
pipelines together is termed an element group. s
Intel i7
•
The Intel i7 Core
o Superscalar processor
o Contains several SIMD
extensions

16x256-bit wide
registers, and physical
registers on pipeline.
 Support for 2 and 3
operand instructions
The Roofline Model of Performance
• The Roofline model of performance
aggregates
floating-point performance,
operational intensity
memory
•
•
•
The Roofline Model of Performance
Opteron X2
The Roofline Model of Performance
Opteron X2
The Roofline Model of Performance
Opteron X2
Limitations
•
•
•
Memory Latency
Memory Bandwidth
The actual amount of vectorizable code
SIMD at the software level
•
•
SIMD is not a new field.
But more focus has been brought to it by the
GPGPU movement.
SIMD at the software level
• CUDA
•
•
•
•
•
Developed by Nvidia
Compute Unified Device Architecture
Closed to GPUs with chips from Nvidia
Graphics cards G8x and newer
Provides both high and low level API
SIMD at the software level
• OpenCL
•
•
•
•
•
Developed by Apple
Open to any vendor that decide to support it
Designed to execute across GPUs and CPUs
Graphics cards G8x and newer
Provides both high and low level API
SIMD at the software level
• Direct Compute
•
•
•
•
•
Developed by Microsoft
Open to any vendor that supports DirectX11
Windows only
Graphics cards GTX400 and HD5000
Intel’s Ivy Bridge will also be supported
Compiler Optimization
•
•
•
Not everyone programs in SIMD based
languages.
But C, Java were never designed with SIMD
in mind.
Compiler technology had to improve to catch
code with vectorizable instructions.
Compiler Optimization
• Before optimization can begin
•
•
•
Data dependencies have to be understood
But only within the vector window size matter
Vector window size - The size of data executed in
parallel with the SIMD instruction
Compiler Optimization
• Before optimization can begin
•
Example:
for( int i = 0; i < 16; i++){
C[i] = c[i+1];
C[i] = c[i+16];
}
for( int i = 0; i < 16; 4++){
C[i] = c[i+1];
C[i+1] = c[i+2]; (Wrong)
C[i+2] = c[i+3]; (Wrong)
C[i+3] = c[i+4]; (Wrong)
C[i] = c[i+16];
C[i+1] = c[i+17];
C[i+2] = c[i+18];
C[i+3] = c[i+20];
}
Compiler Optimization
• Framework for vectorization
o
o
o
o
Prelude
Loop
Postlude
Cleanup
Compiler Optimization
• Framework for vectorization
• Prelude
•
•
•
Loop independent variables are prepared for use.
Run time checks that vectorization is possible
Loop
•
•
•
Vectorizable instructions are performed in order
with original code.
Loop could be split into multiple loops.
Vectorizable sections could be split by more
complex code in original loop.
Compiler Optimization
• Framework for vectorization
o Postlude
 All loop independent variables are returned.
o Cleanup
 Non vectorizable iterations of the loop are run.
 These include the remainder of vectorizable
instructions that do not fit evenly into the vector
size.
Compiler Optimization
•
•
•
•
Compiler techniques
Loop Level Automatic Vectorization
Basic Block Level Automatic Vectorization
In the presence of control flow
Compiler Optimization
•
•
•
Loop Level Automatic Vectorization
1. Find innermost loop that can be vectorized.
2. Transform loop and create vector instructions.
Original Code
for (i = 0; i < 1024; i+=1)
C[i] = A[i]*B[i];
Vectorized Code
for( i=0; i<1024; i+=4){
vA = vec_ld( A[i] );
vB = vec_ld( B[i] );
vC = vec_mul( vA, vB);
vec_st( vC, C[i] );
}
Compiler Optimization
•
Basic Block Level Automatic Vectorization
1. The inner most loop is unrolled by the size of the
vector window.
2. Isomorphic scalar instructions are packed into vector
instruction.
Original Code
for (i = 0; i < 1024; i+=1)
C[i] = A[i]*B[i];
Vectorized Code
for (i = 0; i < 1024; i+=4)
C[i] = A[i]*B[i];
C[i+1] = A[i+1]*B[i+1];
C[i+2] = A[i+2]*B[i+2];
C[i+3] = A[i+3]*B[i+3];
Compiler Optimization
•
In the presence of control flow
1. Apply predication
2. Apply method from above
After Predication
3. Remove vector predication
for (i = 0; i < 1024; i+=1){
4. Remove scalar predication
Original Code
for (i = 0; i < 1024; i+=1){
if (A[i] > 0)
C[i] = B[i];
else
D[i] = D[i-1];
}
P = A[i] > 0;
NP = !P;
C[i] = B[i];
(P)
D[i] = D[i-1];
(NP)
}
Compiler Optimization
•
In the presence of control flow
After Vectorization
After Removing Predicates
for (i = 0; i < 1024; i+=4){
vP=A[i:i+3] > (0,0,0,0);
vNP=vec_not(vP);
C[i:i+3]=B[i:i+3]; (vP)
(NP1,NP2,NP3,NP4) = vP;
D[i+3]=D[i+2]; (NP4)
D[i+2]=D[i+1]; (NP3)
D[i+1]=D[i]; (NP2)
D[i]=D[i-1]; (NP1)
}
for (i = 0; i < 1024; i+=4){
vP=A[i:i+3] > (0,0,0,0);
vNP=vec_not(vP);
C[i:i+3]=vec_sel(C[i:i+3], B[i:i+3], vP);
(NP1,NP2,NP3,NP4) = vP;
if (NP4) D[i+3]=D[i+2];
if (NP3) D[i+2]=D[i+1];
if (NP2) D[i+1]=D[i];
if (NP1) D[i]=D[i-1];
}
CPU vs GPU
•
•
Founding of the GPU as we know it today
was Nvidia in 1999
Popularity increased in recent years
VisionTek GeForec 256 [Wikipedia]
Nvidia GeForce GTX590 [Nvidia]
CPU vs GPU
• Theoretical GFLOP/s & Bandwidth
[Nvidia, NVIDIA CUDA C Programming Guide]
CPU vs GPU
• Intel Core i7 Nehalem Die Shot
[NVIDIA’s Fermi: The First Complete GPU Computing Architecture]
CPU vs GPU
Game, Little Big Planet [http://trendygamers.com]
CPU vs GPU
• OpenGL Graphics Pipeline
[Wojtek Palubicki; http://pages.cpsc.ucalgary.ca/~wppalubi/]
CPU vs GPU
• CPU SIMD vs. GPU SIMD
•
•
Intel’s sandy-bridge architecture:
256-bit AVX --> on 8 registers parallel
•
CUDA multiprocessor up to 512 raw
mathematical operations in parallel
CPU vs GPU
• Nvidia’s Fermi
Source: http://www.legitreviews.com/article/1193/2/
CPU vs GPU
• Nvidia’s Fermi
[Nvidia; NVIDIA’s Next Generation CUDA
Compute Architecture: Fermi]
Standardization Problems and
Industry Challenges
[Widescreen Wallpapers; http://widescreen.dpiq.org/30__AMD_vs_Intel_Challenge.htm]
Standardization Problems and
Industry Challenges
• 1998
o AMD - 3Dnow
o Intel - SSE instruction set a few years later without supporting the
3Dnow
o Intel won this battle since SSE was better
Standardization Problems and
Industry Challenges
• 2001
o Intel - Itanium processor (64-bit, parallel computing instruction set)
o AMD - Its own 64-bit instruction set (backward compatible)
o AMD won this time because of its backward compatibility.
• 2007
o AMD - SSE5
o Intel - AVX
Standardization Problems and
Industry Challenges
•
Example: fused-multiply-add (FMA)
o d=a+b*c
•
AMD
o Supports since 2011 FMA4
o FMA4 - 4 operand form
•
Intel
o Will support FMA3 in 2013 with Haswell
o FMA3 - 3 operand form
Standardization Problems and
Industry Challenges
• This causes
• More work for the programmer
• Impossible maintenance of the code
Standardization required!
Conclusion
•
•
•
SIMD Processors exploit data-level
parallelism increasing performance.
The hardware requirements are easily met
as transistor size decreases.
HPC languages have been created to give
programmers access to high and low level
SIMD operations.
Conclusion
•
•
•
•
Compiler technology has improved to recognize some
potential SIMD operations in serial code.
The utility of SIMD instructions in modern
microprocessors is diminishing except in special
purpose applications due to standardization problems
and industry in-fighting.
The increasing adoption of GPGPU computing has the
potential to supplant SIMD type instructions in the CPU.
On-chip GPU's appear to be on the horizon, so wider
really is better.

similar documents