### Parallel Algorithm for Construction of Uniform Grids - PUC-Rio

```Ray Tracing in CUDA
Andrei Monteiro
Marcelo Gattass
Assignment 3
June 2010
Topics
Motivation
 Related Work
 Grid Construction
 Ray Tracing in CUDA
 Results
 Conclusion
 References

Motivation

Ray Tracing is a technique for generating an image by launching rays for each
pixel and calculating its intersections with the scene objects.

Simulates several effects naturally, such as reflection and refraction, producing a
very high degree of virtual realism.

Computationally expensive.

Can use different acceleration structures. (Kd-trees, Uniform grid, BVH)

Why CUDA?






Designed for General-Purpose Computing.
Construction of Grid is faster than other structures (e.g. Kd-trees, BVH).
Provides natural compactness, avoiding memory waste in contrast with the stencil
routing algorithm using GLSL.
Use of shared memory speed up construction.
Fast data transfers.
Atomic operations
Motivation
Related Work

Uniform Grid



A Parallel Algorithm for Construction of Uniform Grids,
Kalojanov, J.
GPU-Accelerated Uniform Grid Construction for Ray
Tracing Dynamic Scenes, Ivson, P., Duarte, L., Celes, W.
Ray Tracing


Understanding the Efficiency of Ray Traversal on GPUs,
Aila, T. NVIDIA.
Ray Tracing on Programmable Graphics Hardware,
Purcell, T.

Ray Tracing Animated Scenes using Coherent Grid
Traversal, Wald et al.
Grid Construction
Uniform Grid






Speed Up simulation, avoids going
through every scene object to test
intersection.
Supports Dynamic Scenes.
Each voxel contains a list of primitives
The ray traverses the grid.
Grid Resolution
Grid Construction

Algorithm in CUDA:
1.
2.
3.
4.
5.
Insert triangles in voxels
Calculate grid hash table
Sort the pairs
Write cell start and end
Reorder particles
Grid Construction

Insert triangles in voxels

Bounding Box (corners)

Check triangle plane Intersection

Avoids more than the same reference of the triangle inside the
voxel.
Contained in same
voxel 4 times
Contained in more
than one voxel
Grid Construction
9
3
Grid Hash Table
1.
6
6

Pair Cell Index – Particle Index.

E.g. Cell Dimension = 3,
Grid resolution = 3x3
7
4
1
0
1
2
3
4
5
6
7
8
440
120
840
570
030
880
310
810
710
5
7
6
0
0
PARTICLES:
2
3
0
8
0
4
3
5
8
1
3
6
HASH:
4
0
5
7
3
8
1
2
2
Cell Index
0
1
2
3
4
5
6
7
8
Particle Index
2
9
Grid Construction

Sorting the Pairs
1.
2.
3.
In order to calculate the cells´start and
end, it is necessary to order particles in
respect to cell indices which they belong.
Actually, the application sorts the previous
hash table with respect to their keys, or
cell indices.
Use of Radix Sort from CUDA SDK.
Grid Construction
Sorting the Pairs
1.
Sort Hash table by key values (cell indices).

HASH:
4
0
5
7
3
8
1
2
2
Cell Index
0
1
2
3
4
5
6
7
8
Particle Index
2
2
3
4
5
7
8
Cell Index
7
8
4
0
2
3
5
Particle Index
Sorted HASH:
0
1
1
6
Grid Construction
1.
Finding Cell Start/End and Reordering Particles.
Sorted HASH:
0
0
1
0
2
1
≠
3
1
4
1
5
1
6
2
7
2
8
2
Cell Index
Cell 0: end = 2
Cell 1: start = 2
CellEnd [Cell Index [thread_id -1]] = 2
Cell Start/End:
0
0/2
1
2/6
2
6/...
3
4
5
6
7
8
... Cell Index
Cell Start / End
Ray Tracing in CUDA

Can be easily parallelized


Each thread is responsible for one pixel /
ray intersection.
Problems that slow performance:

Internal Loops


Random memory access

Ray Tracing in CUDA

Kernels
1.
2.
3.
4.
5.
6.
7.
Build grid (if scene changes)
Setup rays (if camera moved)
Lauch Rays
Get Hits
Get Reflection Hits
(repeat)
Ray Tracing in CUDA

Setup Rays

Calculate the ray equation for each pixel.


 
p(t )  o  td
Ray Tracing in CUDA

Lauch Rays


Calculates the ray-grid intersection, if any. For rays that do not
intersect the grid, they are discarded for the next steps.
Returns the first cell intersection and the parameters for
traversing the grid.
p(t)
Ray Tracing in CUDA

Get Hits
The most expensive steps of the simulation.
Typical algorithm:


While (not hit or ray inside grid) {
Traverse cell;
if (! Cell empty) {
for each triangle in cell {
get hit ();
}
}
}


Problem: Causes too much thread divergency.
Solution: Use while-while algorithm

Causes less divergency
Ray Tracing in CUDA

while- while algorithm
while-while trace():
while ray not terminated
while node does not contain primitives
traverse to the next node
while node contains untested primitives
perform a ray-primitive intersection test
Ray Tracing in CUDA

Get Hits

Traversal Algorithm: 3D DDA
if (nextx < nexty)
nextx += deltax
X += 1;
else
nexty += deltay
Y += 1;
process_grid(X, Y);
Ray Tracing in CUDA

Get Hits

Triangle Intersection
Möller
 Enables face culling.
 Greatly increased performance
 Careful:


Triangles can be in more than one voxel, so it´s
necessary to check if the intersection point is in the
current voxel.
Ray Tracing in CUDA

Increase efficiency


The internal loops make threads diverge and thus
lower performance.
To contour this problem, NVIDIA researcher T. Aila
included a method called Persistent Threads in
CUDA.


The idea is to keep threads busy while at least one of
them is not done.
Increased performance depends on the GPU.


9800 GX2: 2.2x increase
GTX 480: 3.0x increase
Ray Tracing in CUDA

Ray Tracing in CUDA


Linear Interpolation using baricentric coordinates



Texture


Normal
Texture
Used CUDA 3D Texture to support variable number of scene
textures.
  lr   k dr 
  I r (rr ) 
 I r   I ar   k dr 
 lr   k sr 
 I r (rt ) 
   
     
   




n
ˆ
ˆ
ˆ
ˆ
I

I

k

f
l

k
n

L

l

k
r

L

k
I
(
r
)

(
1

o
)
I
(
r
)
  g r 
 g   ag   dg   s   g   dg 
 g   sg  r
 g t 
luzes


 I  I  k 
   
l  k 


 I (r ) 
 b   ab   db 
 b   sb 
 b t 
  lb   k db 
  I b (rr ) 
 


Results

Real-Time Ray Tracing Performance

Depends on:
Grid resolution
 Number of primitives
 Camera in/outside grid
 Reflection Passes (1 or more times)


Scenes with reflections and many
Results
Results
Results
Results
Results
Results
Results
Results
Conclusion

The user was able to replicate physical effects.
 CUDA is slower compared to other languages
(e.g. GLSL) if not optimizing and use its
maximum optimization resources.
 There are still several optimizations pending in
this work.



Math
Too much memory used
References

Kalojanov, J. A Parallel Algorithm for Construction of Uniform Grids.
High Performance Graphics, 2009. Retrieved in Apr 21 2010.

Ivson, P., Duarte, L., Celes, W., GPU-Accelerated Uniform Grid
Construction for ray Tracing Dynamic Scenes.

Understanding the Efficiency of Ray Traversal on GPUs, Aila, T.
NVIDIA Research. Retrieved May 23, 2010.

Ray Tracing on Programmable Graphics Hardware, Purcell, T.
Stanford University. Retrieved May 28, 2010.

Ray Tracing Animated Scenes using Coherent Grid Traversal, Wald
et al. SCI Institute, University of Utah. Retrieved May 25, 2010

NVIDIA CUDA Programming Guide. V. 2.0, 2008. Retrieved Mar 29,
2010.
```