NVVP, Existing Libraries, Q/A Text book / resources Eclipse Nsight, NVIDIA Visual Profiler Available libraries Questions Certificate dispersal (Optional) Multiple GPUs: Where’s PixelWaldo? TEXT BOOK Programming Massively Parallel Processors, A Hands on approach David Kirk, Wen-mei Hwu NVIDIA DEVELOPER ZONE Early access to updated drivers / updates Heavily curated help forum Requires registration and approval (nearly automated) developer.nvidia.com US! We’re pretty passionate about this GPU computing stuff. Collaboration is cool If you think you’ve got a problem that can benefit from GPU computation we may have some ideas. IDE with an Eclipse foundation CUDA aware syntax highlighting / suggestions / recognition Hooked into NVVP Deep profiling of every aspect of GPU execution ( memory bandwidth, branch divergence, bank conflicts, compute / transfer overlap, and more! ) Provides suggestions for optimization Graphical view of GPU performance Nsight and NVVP are available on our cuda# machines Ssh –X <user>@<cuda machine> Nsight demo on Week 3 code Why re-invent the wheel? • There are many GPU enabled tools built on CUDA that are already available • These tools have been extensively tested for efficiency and in most cases will outperform custom solutions • Some require CUDA-like code structure Linear Algebra, cuBLAS CUDA enabled basic linear algebra subroutines • GPU-accelerated version of the complete standard BLAS library • Provided with the CUDA toolkit. Code examples are also provided • Callable from C and Fortran Linear Algebra, cuBLAS Linear Algebra, cuBLAS Linear Algebra, CULA, MAGMA CULA and MAGMA extend BLAS • CULA (Paid) CULA-dense: LAPACK and BLAS implementations, solvers, decompositions, basic matrix operations CULA-sparse: sparse matrix specialized routines, specialized storage structures, iterative methods • MAGMA (Free, BSD) (Fortran Bindings) LAPACK and BLAS implementations, developed by the same dev. team as LAPACK. Linear Algebra, CULA, MAGMA Linear Algebra, CULA, MAGMA IMSL Fortran/C Numerical Library Large collection of mathematical and statistical gpu-accelerated functions • Free evaluation, paid extension • http://www.roguewave.com/products/imsl- numerical-libraries/fortran-library.aspx Image/Signal Processing: NVIDIA Performance Primitives 1900 Image processing and 600 signal processing algorithms • Free and provided with the CUDA toolkit, code examples included. • Can be used in tandem with visualization libraries like OpenGL, DirectX. Image/Signal Processing: NVIDIA Performance Primitives CUDA without the CUDA: Thrust Library Thrust is a high level interface to GPU computing. • Offers template-interface access to sort, scan, reduce, etc. • A production tested version is provided with the CUDA toolkit. CUDA without the CUDA: Thrust Library CUDA without the CUDA: Thrust Library CUDA without the CUDA: Thrust Library Python and CUDA PyCUDA • Python interface to CUDA functions. • Simply a collection of wrappers, but effective. NumbaPro (Paid) • Announced this year at GTC 2013, native CUDA python compiler • Python = 4th major cuda language R and CUDA R+GPU • Package with accelerated alternatives for common R statistical functions Rpud / rpudplus • Package with accelerated alternatives for common R statistical functions Rcuda • … Package with accelerated alternatives for common R statistical functions R and CUDA Where’s Pixel-Waldo? Motivation: Given two images which contain a unique suspect and a number of distinct bystanders, identify the suspect by pairwise comparison. This is hard We’ll simplify the problem by reducing the targets to pixel triples. 0: upload an image and a list to store targets to each GPU. f.bmp GPU0 0|0|0|… s.bmp GPU1 0|0|0|… 1: Find all positions of potential targets (triples) within each image using both GPUS independently. f.bmp GPU0 11 | 143 | 243 | … s.bmp GPU1 3 | 1632 | 54321 | … 2: Allow GPU0 to access GPU1 memory, use both images and target lists to compare potential suspects. f.bmp s.bmp 11 | 143 | 243 | … GPU0 GPU1 0|0 PCI Bus 3 | 1632 | 54321 | … 3: Print the positions of the single matching suspect. f.bmp CPU 11 | 143 | 243 | … GPU0 132 | 629 PCI Bus Walk though the source code. Things to note: • This is un-optimized and known to be inefficient, but the concepts of asynchronous streams, GPU context switching, universal addressing, and peer-to-peer access are covered • Source code requires the tclap library to compile appropriately. • Source code will be made available in a github repository after the workshop.