ppt - The University of Akron

Report
Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The
University of Akron.
 Your own PCs running G80 emulators
 Better debugging environment
 Sufficient for the first couple of weeks
 Your own PCs with a CUDA-enabled GPU
 NVIDIA boards in department
 GeForce family of processors for high-performance
gaming
 Tesla C2070 for high-performance computing – no
graphics output (?) and more memory
CUDA at the University of Akron – Slide 2
Description
Low Power
Card Models
Where Available
Ion
Netbooks in CAS 241.
Consumer Graphics
Processors
GeForce 8500GT
GeForce 9500GT
GeForce 9600GT
Add-in cards in Dell
Optiplex 745s in
department.
2nd Generation GPUs
GeForce GTX275
In Dell Precision T3500s in
department.
Fermi GPUs
GeForce GTX480
In select Dell Precision
T3500s in department.
In Dell Precision T7500
Linux server
(tesla.cs.uakron.edu)
Tesla C2070
CUDA at the University of Akron – Slide 3
 Basic building block is a “streaming multiprocessor”
 different chips have different numbers of these SMs:
Product
SMs
Compute
Capability
GeForce 8500GT
2
v. 1.1
GeForce 9500GT
4
v. 1.1
GeForce 9600GT
8
v. 1.1
CUDA at the University of Akron – Slide 4
 Basic building block is a “streaming multiprocessor”
with
 8 cores, each with 2048 registers
 up to 128 threads per core
 16KB of shared memory
 8KB cache for constants held in device memory
 different chips have different numbers of these SMs:
Product
SMs
Bandwidth
Memory
Compute
Capability
GTX275
30
127 GB/s
1 -2 GB
v. 1.3
CUDA at the University of Akron – Slide 5
 each streaming multiprocessor has
 32 cores, each with 1024 registers
 up to 48 threads per core
 64KB of shared memory / L1 cache
 8KB cache for constants held in device memory
 there’s also a unified 384KB L2 cache
 different chips again have different numbers of SMs:
Product
SMs
Bandwidth Memory
Compute
Capability
GTX480
15
180 GB/s
1.5 GB
v. 2.0
Tesla C2070
14
140 GB/s
6 GB ECC
v. 2.1
CUDA at the University of Akron – Slide 6
Feature
v. 1.1
v. 1.3, 2.x
Integer atomic functions operating on
64-bit words in global memory
Integer atomic functions operating on
32-bit words in shared memory
no
yes
no
yes
Warp vote functions
no
yes
Double-precision floating-point
operations
no
yes
CUDA at the University of Akron – Slide 7
Feature
v. 1.1, 1.3
v. 2.x
3D grid of thread block
no
yes
Floating-point atomic addition operating on
32-bit words in global and shared memory
no
yes
_ballot()
no
yes
_threadfence_system()
no
yes
_syncthread_count(),
_syncthread_and(),
_syncthread_or()
no
yes
Surface functions
no
yes
CUDA at the University of Akron – Slide 8
Spec
65536
Maximum x- or y- dimensions of a grid of thread blocks
3
Maximum dimensionality of thread block
Maximum z- dimension of a block
64
Warp size
32
Maximum number of resident blocks per multiprocessor
8
Constant memory size
64 K
Cache working set per multiprocessor for constant memory
8K
Maximum width for 1D texture reference bound to linear
memory
Maximum width, height and depth for a 3D texture reference bound
to linear memory or a CUDA array
2 27
Maximum number of textures that can be bound to a kernel
Maximum number of instructions per kernel
2048 x 2048 x 2048
128
2 million
CUDA at the University of Akron – Slide 9
Spec
Maximum number of resident warps per
multiprocessor
Maximum number of resident threads per
multiprocessor
Number of 32-bit registers per multiprocessor
v. 1.1
v. 1.3
v. 2.x
24
32
48
768
1024
1536
8K
16 K
32 K
CUDA at the University of Akron – Slide 10
Spec
v. 1.1, 1.3
v. 2.x
2
3
Maximum x- or y- dimension of a block
512
1024
Maximum number of threads per block
512
1024
Maximum amount of shared memory per
multiprocessor
16 K
48 K
16
32
Amount of local memory per thread
16 K
512 K
Maximum width for 1D texture
reference bound to a CUDA array
8192
32768
Maximum dimensionality of grid of thread block
Number of shared memory banks
CUDA at the University of Akron – Slide 11
Spec
Maximum width and number of layers
for a 1D layered texture reference
Maximum width and height for 2D
texture reference bound to
linear memory or a CUDA array
Maximum width, height, and number
of layers for a 2D layered texture reference
Maximum width for a 1D surface
reference bound to a CUDA array
Maximum width and height for a 2D
surface reference bound to a CUDA array
Maximum number of surfaces that
can be bound to a kernel
v. 1.1, 1.3
v. 2.x
8192 x 512
16384 x 2048
65536 x 32768
65536 x 65536
8192 x 8192 x 512
16384 x 16384 x 2048
8192
Not supported
8192 x 8192
8
CUDA at the University of Akron – Slide 12
 CUDA (Compute Unified Device Architecture) is
NVIDIA’s program development environment:
 based on C with some extensions
 C++ support increasing steadily
 FORTRAN support provided by PGI compiler
 lots of example code and good documentation – 2-4
week learning curve for those with experience of
OpenMP and MPI programming
 large user community on NVIDIA forums
CUDA at the University of Akron – Slide 13
 When installing CUDA on a system, there are 3
components:
 driver
 low-level software that controls the graphics card
 usually installed by sys-admin
 toolkit
 nvcc CUDA compiler
 some profiling and debugging tools
 various libraries
 usually installed by sys-admin in /usr/local/cuda
CUDA at the University of Akron – Slide 14
 SDK
 lots of demonstration examples
 a convenient Makefile for building applications
 some error-checking utilities
 not supported by NVIDIA
 almost no documentation
 often installed by user in own directory
CUDA at the University of Akron – Slide 15
 Remotely access the front end:
ssh tesla.cs.uakron.edu
 ssh sends your commands over an encrypted stream so
your passwords, etc., can’t be sniffed over the network
CUDA at the University of Akron – Slide 16
 The first time you do this:
 After login, run
/root/gpucomputingsdk_3.2.16_linux.run
and just take the default answers to get your own
personal copy of the SDK.
 Then:
cd ~/NVIDIA_GPU_Computing_SDK/C
make -j12 -k
will build all that can be built.
CUDA at the University of Akron – Slide 17
 The first time you do this:
 Binaries end up in:
~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release
 In particular header file <cutil_inline.h> is in
~/NVIDIA_GPU_Computing_SDK/C/common/inc
 Can then get a summary of technical specs and
compute capabilities by executing
~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery
CUDA at the University of Akron – Slide 18
 Two choices:
 use nvcc within a standard Makefile
 use the special Makefile template provided in the SDK
 The SDK Makefile provides some useful options:
 make emu=1
 uses an emulation library for debugging on a CPU
 make dbg=1
 activates run-time error checking
 In general just use a standard Makefile
CUDA at the University of Akron – Slide 19
GENCODE_ARCH
:= -gencode=arch=compute_10,code=\"sm_10,compute_10\“
-gencode=arch=compute_13,code=\"sm_13,compute_13\“
-gencode=arch=compute_20,code=\"sm_20,compute_20\“
INCLOCS := -I$(HOME)/NVIDIA_GPU_Computing_SDK/shared/inc
-I$(HOME)/NVIDIA_GPU_Computing_SDK/C/common/inc
LIBLOCS := -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib
-L$(HOME)/NVIDIA_GPU_Computing_SDK/C/lib
LIBS =
-lcutil_x86_64
<progName>: <progName>.cu <progName>.cu <progName>.cuh
nvcc $(GENCODE_ARCH) $(INCLOCS) <progName>.cu $(LIBLOCS)
$(LIBS) -o <progName>
CUDA at the University of Akron – Slide 20
 Parallel Thread
Execution (PTX)
 Virtual machine and ISA
 Programming model
 Execution resources and
state
CUDA Tools and Threads – Slide 2
 Any source file containing CUDA extensions must be
compiled with NVCC
 NVCC is a compiler driver
 Works by invoking all the necessary tools and compilers
like cudacc, g++, cl, …
 NVCC outputs
 C code (host CPU code)
 Must then be compiled with the rest of the application using
another tool
 PTX
 Object code directly, or PTX source interpreted at runtime
CUDA Tools and Threads – Slide 22
 Any executable with CUDA code requires two dynamic
libraries
 The CUDA runtime library (cudart)
 The CUDA core library (cuda)
CUDA Tools and Threads – Slide 23
 An executable compiled in device emulation mode
(nvcc –deviceemu) runs completely on the host
using the CUDA runtime
 No need of any device and CUDA driver
 Each device thread is emulated with a host thread
CUDA Tools and Threads – Slide 24
 Running in device emulation mode, one can
 Use host native debug support (breakpoints, inspection,
etc.)
 Access any device-specific data from host code and viceversa
 Call any host function from device code (e.g. printf)
and vice-versa
 Detect deadlock situations caused by improper usage of
__syncthreads
CUDA Tools and Threads – Slide 25
 Emulated device threads execute sequentially, so
simultaneous access of the same memory location by
multiple threads could produce different results
 Dereferencing device pointers on the host or host
pointers on the device can produce correct results in
device emulation mode, but will generate an error in
device execution mode
CUDA Tools and Threads – Slide 26
 Results of floating-point computations will slightly
differ because of
 Different compiler outputs, instructions sets
 Use of extended precision for intermediate results
 There are various options to force strict single precision on
the host
CUDA Tools and Threads – Slide 27
 New Visual Studio Based GPU Integrated
Development
 http://developer.nvidia.com/object/nexus.html
 Available in Beta (as of October 2009)
CUDA Tools and Threads – Slide 28
 Based on original material from
 http://en.wikipedia.com/wiki/CUDA, accessed 6/22/2011.
 The University of Akron: Charles Van Tilburg
 The University of Illinois at Urbana-Champaign
 David Kirk, Wen-mei W. Hwu
 Oxford University: Mike Giles
 Stanford University
 Jared Hoberock, David Tarjan
 Revision history: last updated 6/23/2011.
CUDA at the University of Akron – Slide 29

similar documents