PPT

Report
CS179: GPU Programming
Lecture 8: More CUDA Runtime
Today
 CUDA arrays for textures
 CUDA runtime
 Helpful CUDA functions
CUDA Arrays
 Recall texture memory
 Used to store large data
 Stored on GPU
 Accessible to all blocks,
threads
CUDA Arrays
 Used Texture memory for buffers (lab 3)
 Allows vertex data to remain on GPU
 How else can we access texture memory?
 CUDA arrays
CUDA Arrays
 Why CUDA arrays over normal arrays?




Better caching, 2D caching
Spatial locality
Supports wrapping/clamping
Supports filtering
CUDA Linear Textures
 “Textures” but in global memory
 Usage:
 Step 1: Create texture reference
 texture<TYPE> tex
 TYPE = float, float3, int, etc.
 Step 2: Bind memory to texture reference
 cudaBindTexture(offset, tex, devPtr, size);
 Step 3: Get data on device via tex1Dfetch
 tex1DFetch(tex, x);
 x is the byte where we want to read!
 Step 4: Clean up after finished
 cudaUnbindTexture(&tex)
CUDA Linear Textures
 Texture reference properties:
 texRef<type, dim, mode>
 type = float, int, float3, etc.
 dim = # of dimensions (1, 2, or 3)
 mode =
 cudaReadModeElementType: standard read
 cudaReadModeNormalizedFloat: maps 0->0.0, 255->1.0 for ints->floats
CUDA Linear Textures
 Important warning:
 Textures are in a global space of memory
 Threads can read and write to texture at same time
 This can cause synchronization problems!
 Do not rely on thread running order, ever
CUDA Linear Textures
 Other limitations:
 Only 1D, can make indexing and caching a bit less convenient
 Pitch may be not ideal for 2D array
 Not read-write
 Solution: CUDA arrays
CUDA Arrays
 Live in texture memory space
 Access via texture fetches
CUDA Arrays
 Step 1: Create channel description




Tells us texture attributes
cudaCreateChannelDesc(int x, int y, int z, int w, enum mode)
x, y, z, w are number of bytes per component
mode is cudaChannelFormatKindFloat, etc.
CUDA Arrays
 Step 2: Allocate memory
 Must be done dynamically
 Use cudaMallocArray(cudaArray **array, struct desc, int size)
 Most global memory functions work with CUDA arrays too
 cudaMemcpyToArray, etc.
CUDA Arrays
 Step 3: Create texture reference
 texture<TYPE, dim, mode> texRef -- just as before
 Parameters must match channel description where applicable
 Step 4: Edit texture settings
 Settings are encoded as texRef struct members
CUDA Arrays
 Step 5: Bind the texture reference to array
 cudaBindTextureToArray(texRef, array)
 Step 6: Access texture
 Similar to before, now we have more options:
 tex1DFetch(texRef, x)
 tex2DFetch(texRef, x, y)
CUDA Arrays
 Final Notes:
 Coordinates can be normalized to [0, 1] if in float mode
 Filter modes: nearest point or linear
 Tells CUDA how to blend texture
 Wrap vs. clamp:
 Wrap: out of bounds accesses wrap around to other side
 Ex.: (1.5, 0.5) -> (0.5, 0.5)
 Clamp: out of bounds accesses set to border value
 Ex.: (1.5, 0.5) -> (1.0, 0.5)
CUDA Arrays
point sampling
linear sampling
CUDA Arrays
wrap
clamp
CUDA Runtime
 Nothing new, every function cuda____ is part of the runtime
 Lots of other helpful functions
 Many runtime functions based on making your program
robust
 Check properties of card, set up multiple GPUs, etc.
 Necessary for multi-platform development!
CUDA Runtime
 Starting the runtime:
 Simply call a cuda_____ function!
 CUDA can waste a lot of resources
 Stop CUDA with cudaThreadExit()
 Called automatically on CPU exit, but you may want to call earlier
CUDA Runtime
 Getting devices and properties:
 cudaGetDeviceCount(int * n);
 Returns # of CUDA-capable devices
 Can use to check if machine is CUDA-capable!
 cudaSetDevice(int n)
 Sets device n to the currently used device
 cudaGetDeviceProperties(struct *devProp prop, int n);
 Loads data from device n into prop
Device Properties
 char name[256]: ASCII identifier of GPU
 size_t totalGlobalMem: Total global memory available
 size_t sharedMemPerBlock: Shared memory available per
multiprocessor
 int regsPerBlock: How many registers we have per block
 int warpSize: size of our warps
 size_t memPitch: maximum pitch allowed for array allocation
 int maxThreadsPerBlock: maximum number of threads/block
 int maxThreadsDim[3]: maximum sizes of a block
Device Properties
 int maxGridSize[3]: maximum grid sizes
 size_t totalConstantMemory: maximum available constant
memory
 int major, int minor: major and minor versions of CUDA
support
 int clockRate: clock rate of device in kHz
 size_t textureAlignment: memory alignment required for
textures
 int deviceOverlap: Does this device allow for memory
copying while kernel is running? (0 = no, 1 = yes)
 int multiprocessorCount: # of multiprocessors on device
Device Properties
 Uses?
 Actually get values for memory, instead of guessing
 Program to be accessible for multiple systems
 Can get the best device
Device Properties
 Getting the best device:
 Pick a metric (Ex.: most multiprocessors could be good)
int num_devices, device;
cudaGetDeviceCount(&num_devices);
if (num_devices > 1) {
int max_mp = 0, best_device = 0;
for (device = 0; device < num_devices; device++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, device);
int mp_count = prop.multiProcessorCount;
if (mp_count > max_mp) {
max_mp = mp_count;
best_device = device;
}
}
cudaSetDevice(best_device);
}
Device Properties
 We can also use this to launch multiple GPUs
 Each GPU must have its own host thread
 Multithread on CPU, each thread calls different device
 Set device on thread using cudaSetDevice(n);
CUDA Runtime
 Synchronization Note:
 Most calls to GPU/CUDA are asynchronous
 Some are synchonous (usually things dealing with memory)
 Can force synchronization:
 cudaThreadSynchronize()
 Blocks until all devices are done
 Good for error checking, timing, etc.
CUDA Events
 Great for timing!
 Can place event markers in CUDA to measure time
 Example code:
cudaEvent_t start, stop;
cudaCreateEvent(&start); cudaCreateEvent(&stop);
cudaEventRecord(start, 0);
// DO SOME GPU CODE HERE
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsed_time;
cudaEventElapsedTime(&elapsed_time, start, stop);
CUDA Streams
 Streams manage concurrency and ordering
 Ex.: call malloc, then kernel 1, then kernel 2, etc.
 Calls in different streams are asynchronous!
 Don’t know when each stream is where in code
Using Streams
 Create stream
 cudaStreamCreate(cudaStream_t *stream)
 Copy memory using async calls:
 cudaMemcpyAsync(…, cudaStream_t stream)
 Call in kernel as another parameter:
 kernel<<<gridDim, blockDim, sMem, stream>>>
 Query if stream is done:
 cudaStreamQuery(cudaStream_t stream)
 returns cudaSuccess if stream is done, cudaErrorNotReady otherwise
 Block process until a stream is done:
 cudaStreamSynchronize(cudaStream_t stream)
 Destroy stream & cleanup:
 cudaStreamDestroy(cudaStream_t stream)
Using Streams
 Example:
cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
for (int i = 0; i < 2; ++i)
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
for (int i = 0; i < 2; ++i)
myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size,
inputDevPtr + i * size, size);
for (int i = 0; i < 2; ++i)
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);
cudaThreadSynchronize();
Next Time
 Lab 4 Recitation:
 3D Textures
 Pixel Buffer Objects (PBOs)
 Fractals!

similar documents