Report

Ray Tracing in CUDA Andrei Monteiro Marcelo Gattass Assignment 3 June 2010 Topics Motivation Related Work Grid Construction Ray Tracing in CUDA Results Conclusion References Motivation Ray Tracing is a technique for generating an image by launching rays for each pixel and calculating its intersections with the scene objects. Simulates several effects naturally, such as reflection and refraction, producing a very high degree of virtual realism. Computationally expensive. Can use different acceleration structures. (Kd-trees, Uniform grid, BVH) Why CUDA? Designed for General-Purpose Computing. Construction of Grid is faster than other structures (e.g. Kd-trees, BVH). Provides natural compactness, avoiding memory waste in contrast with the stencil routing algorithm using GLSL. Use of shared memory speed up construction. Fast data transfers. Atomic operations Motivation Related Work Uniform Grid A Parallel Algorithm for Construction of Uniform Grids, Kalojanov, J. GPU-Accelerated Uniform Grid Construction for Ray Tracing Dynamic Scenes, Ivson, P., Duarte, L., Celes, W. Ray Tracing Understanding the Efficiency of Ray Traversal on GPUs, Aila, T. NVIDIA. Ray Tracing on Programmable Graphics Hardware, Purcell, T. Ray Tracing Animated Scenes using Coherent Grid Traversal, Wald et al. Grid Construction Uniform Grid Speed Up simulation, avoids going through every scene object to test intersection. Supports Dynamic Scenes. Each voxel contains a list of primitives The ray traverses the grid. Grid Resolution Grid Construction Algorithm in CUDA: 1. 2. 3. 4. 5. Insert triangles in voxels Calculate grid hash table Sort the pairs Write cell start and end Reorder particles Grid Construction Insert triangles in voxels Bounding Box (corners) Check triangle plane Intersection Avoids more than the same reference of the triangle inside the voxel. Contained in same voxel 4 times Contained in more than one voxel Grid Construction 9 3 Grid Hash Table 1. 6 6 Pair Cell Index – Particle Index. E.g. Cell Dimension = 3, Grid resolution = 3x3 7 4 1 0 1 2 3 4 5 6 7 8 440 120 840 570 030 880 310 810 710 5 7 6 0 0 PARTICLES: 2 3 0 8 0 4 3 5 8 1 3 6 HASH: 4 0 5 7 3 8 1 2 2 Cell Index 0 1 2 3 4 5 6 7 8 Particle Index 2 9 Grid Construction Sorting the Pairs 1. 2. 3. In order to calculate the cells´start and end, it is necessary to order particles in respect to cell indices which they belong. Actually, the application sorts the previous hash table with respect to their keys, or cell indices. Use of Radix Sort from CUDA SDK. Grid Construction Sorting the Pairs 1. Sort Hash table by key values (cell indices). HASH: 4 0 5 7 3 8 1 2 2 Cell Index 0 1 2 3 4 5 6 7 8 Particle Index 2 2 3 4 5 7 8 Cell Index 7 8 4 0 2 3 5 Particle Index Sorted HASH: 0 1 1 6 Grid Construction 1. Finding Cell Start/End and Reordering Particles. Sorted HASH: 0 0 1 0 Cell Index [thread_id - 1] 2 1 ≠ 3 1 4 1 5 1 6 2 7 2 8 2 Cell Index CellStart [Cell Index[thread_id]] = 2 Cell 0: end = 2 Cell 1: start = 2 Cell Index [thread_id] ... Current Thread CellEnd [Cell Index [thread_id -1]] = 2 Cell Start/End: 0 0/2 1 2/6 2 6/... 3 4 5 6 7 8 ... Cell Index Cell Start / End Ray Tracing in CUDA Can be easily parallelized Each thread is responsible for one pixel / ray intersection. Problems that slow performance: Internal Loops Cause threads to diverge. Random memory access Causes bank conflicts, non-coalesce reading Ray Tracing in CUDA Kernels 1. 2. 3. 4. 5. 6. 7. Build grid (if scene changes) Setup rays (if camera moved) Lauch Rays Get Hits Get Shadow Hits Get Reflection Hits (repeat) Shade Ray Tracing in CUDA Setup Rays Calculate the ray equation for each pixel. One thread per pixel p(t ) o td Ray Tracing in CUDA Lauch Rays Calculates the ray-grid intersection, if any. For rays that do not intersect the grid, they are discarded for the next steps. Returns the first cell intersection and the parameters for traversing the grid. p(t) Ray Tracing in CUDA Get Hits The most expensive steps of the simulation. Typical algorithm: While (not hit or ray inside grid) { Traverse cell; if (! Cell empty) { for each triangle in cell { get hit (); } } } Problem: Causes too much thread divergency. Solution: Use while-while algorithm Causes less divergency Ray Tracing in CUDA while- while algorithm while-while trace(): while ray not terminated while node does not contain primitives traverse to the next node while node contains untested primitives perform a ray-primitive intersection test Ray Tracing in CUDA Get Hits Traversal Algorithm: 3D DDA if (nextx < nexty) nextx += deltax X += 1; else nexty += deltay Y += 1; process_grid(X, Y); Ray Tracing in CUDA Get Hits Triangle Intersection Möller Enables face culling. Greatly increased performance Careful: Triangles can be in more than one voxel, so it´s necessary to check if the intersection point is in the current voxel. Ray Tracing in CUDA Increase efficiency The internal loops make threads diverge and thus lower performance. To contour this problem, NVIDIA researcher T. Aila included a method called Persistent Threads in CUDA. The idea is to keep threads busy while at least one of them is not done. Increased performance depends on the GPU. 9800 GX2: 2.2x increase GTX 480: 3.0x increase Ray Tracing in CUDA Persistent threads implementation code Ray Tracing in CUDA Shade Linear Interpolation using baricentric coordinates Texture Normal Texture Used CUDA 3D Texture to support variable number of scene textures. Phong Shading lr k dr I r (rr ) I r I ar k dr lr k sr I r (rt ) n ˆ ˆ ˆ ˆ I I k f l k n L l k r L k I ( r ) ( 1 o ) I ( r ) g r g ag dg s g dg g sg r g t luzes I I k l k I (r ) b ab db b sb b t lb k db I b (rr ) Results Real-Time Ray Tracing Performance Depends on: Grid resolution Number of primitives Camera in/outside grid Shadow Pass Reflection Passes (1 or more times) Scenes with reflections and many primitives vary about 20~30 fps Results Results Results Results Results Results Results Results Conclusion The user was able to replicate physical effects. CUDA is slower compared to other languages (e.g. GLSL) if not optimizing and use its maximum optimization resources. There are still several optimizations pending in this work. Math CUDA threads and kernels Too much memory used References Kalojanov, J. A Parallel Algorithm for Construction of Uniform Grids. High Performance Graphics, 2009. Retrieved in Apr 21 2010. Ivson, P., Duarte, L., Celes, W., GPU-Accelerated Uniform Grid Construction for ray Tracing Dynamic Scenes. Understanding the Efficiency of Ray Traversal on GPUs, Aila, T. NVIDIA Research. Retrieved May 23, 2010. Ray Tracing on Programmable Graphics Hardware, Purcell, T. Stanford University. Retrieved May 28, 2010. Ray Tracing Animated Scenes using Coherent Grid Traversal, Wald et al. SCI Institute, University of Utah. Retrieved May 25, 2010 NVIDIA CUDA Programming Guide. V. 2.0, 2008. Retrieved Mar 29, 2010.