PPT - Caltech

Report
CS179: GPU Programming
Lecture 10: GPU-Accelerated Libraries
Today
 Some useful libraries:
 cuRAND
 cuBLAS
 cuFFT
cuRAND
 Oftentimes, we want random data
 Simulations often need entropy to behave realistically
 How to obtain on GPU?
 No rand(), or simple equivalent
 Could use pseudo-random function with inputs based on properties
 Ex.: int i = cos(999 * thread.Idx.x + 123 * threadIdx.y)
 Works okay, but not great
cuRAND
 What could do with your current tools:
 Generate N random numbers on CPU
 Allocate space on GPU
 Memcpy to GPU
 Not bad -- if we want to do this only once
 Issues:
 Number generation is synchronous
 Memcpy can be slow
 Much more ideal if random data can live only on GPU
cuRAND
 Solution: cuRAND
 CUDA random number library
 Works on both host and device
 Lots of different distributions
 Uniform, normal, log-normal, Poisson, etc.
cuRAND
 Performance
cuRAND
Host API
 Using on the host:
 Call from host
 Allocates memory on GPU
 Generates random numbers on GPU
 Several pseudorandom generators available
 Several random distributions available
cuRAND
Host API
 Functions to know:
 curandCreateGenerator(&g, GEN_TYPE)
 GEN_TYPE = CURAND_RNG_PSEUDO_DEFAULT,
CURAND_RNG_PSEUDO_XORWOW
 Doesn’t particularly matter, differences are small
 curandSetRandomGeneratorSeed(g, SEED)
 Again, SEED doesn’t matter too much, just pick one (ex.: time(NULL))
 curandGenerate______(…)
 Depends on distribution
 Ex.: curandGenerate(g, src, n),
curandGenerateNormal(g, src, n, mean, stddev)
 curandDestroyGenerator(g)
cuRAND
Host API
 curandGenerate() launches asynchronously
 Much faster than serial CPU generation
 However, we still need to copy data to GPU
 src in curandGenerate() is host pointer, not device pointer!
 Introduces some undesired overhead
 Might need more memory than we can pass in one go
 Solution: cuRAND device API
cuRAND
Device API
 Supports RNG on kernels
 Do not need to generate random data before kernel
 We don’t have to copy and store all data at once
 Stores RNG states completely on GPU
 Still need to allocate memory for it on host
cuRAND
Device API
 Example:
curandState *devStates;
cudaMalloc(&devStates,
sizeof(curandState) * nThreads);
kernel<<<gD, bD, sM>>>(devStates, …);
cudaFree(devStates);
don’t forget to free!
cuRAND
Device API
 Example continued:
// On the device:
__global__ kernel(curandState *states, …) {
int id = … // calculate thread id
curand_init(seed, id, 0, &states[id]);
// generate random value in range [0, 1]
v[id] = curand_uniform(&states[id])
// transform to rand [a, b]
v[id] = v[id] * (b - a) + a
}
cuRAND
Device API
 Note the difference between cuRAND states and the actual
values
 States determine random seed of variables
 Numbers aren’t generated until
curand_<DISTRIBUTION>(&state) is called
cuRAND
Overview
 Can generate numbers on either host or device
 Whether generating on host or device, host must allocate
space for device
 Many different random seeds, distributions available
 Check out these for more details:
 http://docs.nvidia.com/cuda/curand/host-api-overview.html
 http://docs.nvidia.com/cuda/curand/device-api-overview.html
cuBLAS
 Linear algebra is extremely important in many applications
 Physics, engineering, mathematics, computer graphics, networking,
…
 Anything STEM, really
 Linear algebra systems are oftentimes HUGE
 Ex.: Invert a matrix of size 106x106 would take a while on a CPU…
 Linear algebra systems are oftentimes parallelizable
 Element a[0][0] doesn’t care about what a[1][0] will be, just what it was
 Linear algebra is a perfect candidate for GPU
cuBLAS
 cuBLAS: CUDA’s linear algebra system
 Based on BLAS (basic linear algebra system)
 Supports all 152 standard BLAS routines
 Works pretty similarly to BLAS
cuBLAS
 Performance
cuBLAS
 Performance
cuBLAS
 Performance
cuBLAS
 Several levels of BLAS:
 BLAS1: Handles vector & vector-vector functions
 Sum, min, max, etc.
 Add, scale, dot, etc.
 BLAS2: Handles matrix-vector functions
 Multiplication, generally
 BLAS3: Handles matrix-matrix functions
 Multiplication, adding, etc.
cuBLAS
 Using is fairly simple
 Call initialization before kernel
 cublasInit()
 Use whatever functions you need in kernel
 Call shutdown after you’re done with cuBLAS
 cublasShutdown
 Check out the following for more info:
 http://docs.nvidia.com/cuda/cublas/index.html
cuBLAS
 Alternative: cuSPARSE
 Another CUDA LA library
 Generally works well when dealing with sparse matrices (most
entries are 0)
 Works pretty well even with dense vectors
cuFFT
 Another concept with lots of application, scalability, and
parallelizability: Fourier Transformation
 Commonly used in physics, signal processing, etc.
 Oftentimes needs to be real-time
 Makes great use of GPU
cuFFT
 Supports 1D, 2D, or 3D Fourier Transforms
 1D transforms can have up to 128 million elements
 Based on Cooley-Tukey and Bluestein FFT algorithms
 Similar API to FFTW, if familiar
 Thread-safe, streamed, asynchronous execution
 Supports both in-place and out-of-place transforms
 Supports real, complex, float, double data
cuFFT
 Performance
cuFFT
 Performance
cuFFT
 Usage is fairly simple
 Allocate space on the GPU
 Same old cudaMalloc() call
 Create a cuFFT plan
 Tells dimension, sizes, and data types
 cufftPlan3d(&plan, nx, ny, nz, TYPE)
 TYPE = C2C, C2R, R2C (complex to complex, complex to real, real to
complex)
cuFFT
 Execute the plan
 cufftExecC2C(plan, in_data, out_data, CUFFT_FORWARD)
 Replace C2C with your plan type
 Can replace CUFFT_FORWARD with CUFFT_INVERSE
 Destroy plan, clean up data
 cufftDestroy(plan)
 cudaFree(in_data), cudaFree(out_data)
 Check out more here:
 http://docs.nvidia.com/cuda/cufft/index.html
GPU-Accelerated Libraries
 Many more available




https://developer.nvidia.com/gpu-accelerated-libraries
OpenCV: Computer vision library (has GPU acceleration libraries)
NPP: Performance primitives library, helps with signal/image processing
Check them out!
 Best practice for learning:




Check out documentation
Check out examples
Modify example code
Repeat above until familiar, then use in your own code!

similar documents