### pptx 16 - Advanced Graphics

```Jeremiah van Oosten
Reinier van Oeveren




Introduction
Related Works
◦ Prefix Sums and Scans
◦ Recursive Filtering
◦ Summed-Area Tables
Problem Definition
Parallelization Strategies
◦
◦
◦
◦
Baseline (Algorithm RT)
Block Notation
Inter-block Parallelism
Kernel Fusion (Algorithm 2)

Overlapping

Summed-Area Tables


◦ Causal-Anticausal overlapping (Algorithm 3 & 4)
◦ Row-Column Causal-Anitcausal overlapping (Algorithm 5)
◦ Overlapped Summed-Area Tables (Algorithm SAT)
Results
Conclusion


Linear filtering is
commonly used to blur,
sharpen or down-sample
images.
A direct implementation
evaluating a filter of
support d on a h x w
image has a cost of
O(hwd).


The cost of the image filter can be reduced
using a recursive filter in which case previous
results can be used to compute the current
value:
1
=  − −1
2
Cost can be reduced to O(hwr) where r is the
number of recursive feedbacks.

At each step, the filter produces an output
element by a linear combination of the input
element and previously computed output
elements.

=

=
(, )

−
−
=1
0(hwr)
Continue…

Applications of recursive filters
◦ Low-pass filtering like Gaussian kernels
◦ Inverse Convolution (X =  ∗ ,  =  ∗  −1 )
◦ Summed-area tables
recursive filters
input
blurred


Recursive filters can be causal or anticausal
(or non-causal).
Causal filters operate on previous values.

=  −
−
=1

Anticausal filters operate on “future” values.

=  −
+
=1

Anticausal filters operate on “future” values.

=
(, )

=  −

+
=1
Continue…

It is often required to perform a sequence of
recursive image filters.
P
P’
U
V
X
Y
Z
E
E’
• Independent Columns
• Causal
= (, )
• Anticausal
= (, )
• Independent Rows
• Causal
=   (′, )
• Anticausal
=   (′, )




The naïve approach to solving the sequence of recursive filters
does not sufficiently utilize the processing cores of the GPU.
The latest GPU from NVIDIA has 2,668 shader cores. Processing
even large images (2048x2048) will not make full use of all
available cores.
Under utilization of the GPU cores does not allow for latency
hiding.
We need a way to make better utilization of the GPU without
increasing IO.

In the paper “GPU-Efficient Recursive
Filtering and Summed-Area Tables” by Diego
Nehab et. al. they introduce a new
algorithmic framework to reduce memory
bandwidth by overlapping computation over
the full sequence of recursive filters.

Partition the image into 2D blocks of size
.

A prefix sum
=  + −1





Simple case of a first-order recursive filter.
A scan generalizes the recurrence using an
arbitrary binary associative operator.
Parallel prefix-sums and scans are important
building blocks for numerous algorithms.
[Iverson 1962; Stone 1971; Blelloch 1989;
Sengupta et. al. 2007]
An optimized implementation comes with the
CUDPP library [2011].

A generalization of the prefix sum using a
weighted combination of prior outputs.

=  −
−
=1


This can be implemented as a scan operation
with redefined basic operators.
Ruijters and Thevenaz [2010] exploit
parallelisim across the rows and columns of
the input.

Sung and Mitra [1986] use block parallelism
and split the computation into two parts:
◦ One computation based only on the block data
assuming a zero initial conditions.
◦ One computation based only on the initial
conditions and assuming zero block data.
=  0,  + (, 0)
Summed-area tables enable the averaging
rectangular regions of pixel with a constant
+ UL
LL
+
width
UR
height

LR
−  −  +
=
ℎ ∙ ℎℎ

The paper titled “Fast Summed-Area Table
Generation…” from Justin Hensley et. al.
(2005) describes a method called recursive
doubling which requires multiple passes of
the input image. (A 256x256 image requires
16 passes to compute).
Image A
Image B
Image A
Image B

In 2010, Justin Hensley
extended his 2005
implementation to compute
per pass and storing the
result in intermediate shared
memory. Now a 256x256
image only required 4 passes
pass.


Casual recursive filters of order  are
characterized by a set of  feedback
coefficients  in the following manner.
Given a prologue vector  ∈ ℝ and an input
vector  ∈ ℝℎ of any size ℎ the filter
produces the output:
= (, )

Such that  ∈ ℝℎ ( has the same size as the
input ).

Causal recursive filters depend on a prologue
vector  ∈ ℝ

=  −
−
=1

Similar for the anitcausal filter. Given an input
vector  ∈ ℝℎ and an epilogue vector  ∈ ℝ , the
output vector  = (, ) is defined by:

=  −
′ +
=1

For row processing, we define an extended
casual filter   and anticausal filter  .

=  (′, )

=  (, ′)

With these definitions, we are able to formulate the
problem of applying the full sequence of four
recursive filters (down, up, right, left).
P
P’
U
V
X
Y
Z
E
E’
• Independent Columns
• Causal
= (, )
• Anticausal
= (, )
• Independent Rows
• Causal
=   (′, )
• Anticausal
=   (′, )

The goal is to implement this algorithm on
the GPU to make full use of all available
resources.
◦ Maximize occupancy by splitting the problem up to
make use of all cores.
◦ Reduce I/O to global memory.


Must break the dependency chain in order to
Primary design goal: Increase the amount of
parallelism without increasing memory I/O.
 Baseline
algorithm ‘RT’
 Block notation
 Inter-block parallelism
 Kernel fusion
Independent row and column processing




Step RT1: In parallel for each column in ,
apply  sequentially and store .
Step RT2: In parallel for each column in ,
apply  sequentially and store .
Step RT1: In parallel for each row in ,
apply   sequentially and store .
Step RT1: In parallel for each row in ,
apply  sequentially and store .
input
stages
output
column processing
row processing
ℎ


Completion takes (

Bandwidth usage in total is 8ℎ

ℎ

=
=
=
=
=
4r ) steps
streaming multiprocessors
number of cores (per processor)
width of the input image
height of the input image
order of the applied filter


Partition input image into    blocks
◦  = number of threads in warp (=32)
What means what?
◦ ,
=    block in matrix  with index ,
◦ −,  =    column-prologue submatrix
◦ +,  =    column-epilogue submatrix
For rows we have (similar) transposed operators:
,−  and  ,+

Tail and head operators: selecting prologueand epilogue-shaped submatrices from
,  =  ,
,  , =  ,
,  =  ,
,  , =  ,

Result: blocked version of problem definition
= (, ),  = (, ),  =  (′, ),  =  (′, )
,  =  −,  , ,
,  =  ,  , +,
,  =   ,−  , ,
,  =  ,  ,  ,+
Superposition
(based on linearity)
Effects of the input and prologue/epilogue on
the output can be computed independently
,  =  ,  +  ,
,  =  ,  +  ,
Express as matrix products
For any ,  is the    identity matrix

,
,
,
,
=
=
=
=
,
,
,
,
=
=
=
=

( ) =
( ) =
Precomputed
matrices that depend only on the feedback

,  =     ,  =
coefficients of filters  and  respectively. Details in paper.
Perform block computation independently
=  −  ,
output block
Prologue −
/ superposition
tail of prev. output block (−  )
=  ,
+  −  ,
=  ,
first term
()
+  −  ,
second term
,  −
incomplete causal output
− ()
=   +

− () (1)
Recall:   =  −  ,
(2)
Algorithm 1
1.1
In parallel for all m, compute and store each
1.2
Sequentially for each m, compute and store the   according
to (1) and using the previously computed
1.3
In parallel for all m, compute & store output block
using (2) and the previously computed − ()
Processing all rows and columns using causal and anti-causal
filter pairs requires 4 successive applications of algorithm 1.

There are
independent tasks: hides memory access latency.

However.. The memory bandwidth usage is now  +
Significantly more than algorithm RT () 
can be solved

.



Original idea: Kirk & Hwu [2010]
Use output of one kernel as input for the
next without going through global memory.
Fused kernel: code from both kernels but
keep intermediate results in shared mem.




Use Algorithm 1 for all filters, do fusing.
Fuse last stage of  with first stage of
Fuse last stage of  and first stage of
Fuse last stage of   with first stage of
We aimed for bandwidth reduction. Did it work?

Algorithm 1:

Algorithm 2:
16
12 +
ℎ

16
9+
ℎ

yes, it did!
output
stages
input
*
fix
fix
fix
fix
* for the full algorithm in text, please see the paper



Further I/O reduction is still possible: by recomputing
intermediary results instead of storing in memory.
More bandwidth reduction: 6 +
No. of steps:
ℎ
(

(14 +
1
4

22

ℎ
(=good)
Bandwidth usage is less than Algorithm RT(!) but involves
more computations  *But.. future hardware may tip the
balance in favor of more computations.



Overlapping is introduced to reduce IO to
global memory.
It is possible to work with twice-incomplete
anticausal epilogues , (), computed
directly from the incomplete causal output
block , ().
This is called casual-anticausal overlapping.

Recall that we can express the filter so that
the input and the prologue or epilogue can be
together.
(, ) =  ,  +  ,  ,
(, ) =  ,  + (, )

Using the previous properties, we can split
the dependency chains of anticausal
epilogues.

Which can be further simplified to:


Where the twice-incomplete  is such that
Each twice-incomplete epilogue  () depends
only on the corresponding input block  () and
therefore they can all be computed in parallel
already in the first pass. As a byproduct of that
same pass, we can compute and store the  ()
that will be needed to obtain  (). With  (),
we can compute all  () in the following pass.
1.
2.
3.
4.
In parallel for all , compute and store  ()
and  ().
Sequentially for each , compute and store the
() using the previously computed  ().
Sequentially for each , compute and store
() using the previously computed −1 ()
and  ().
In parallel for all , compute each causal
output block  () using the previously
computed −1 (). Then compute and store
each anticausal output block  () using the
previously computed +1 ().

Algorithm 3 computes row and columns in
separate passes. Fusing these two stages,
results in algorithm 4.
1.
2.
3.
4.
5.
6.
7.
In parallel for all  and , compute and store the , () and , ().
Sequentially for each , but in parallel for each , compute and store
the , () using the previously computed , ().
Sequentially for each , but in parallel for each , compute and store
the , () using the previously computed −1, ( ) and , ().
In parallel for all  and , compute , () using the previously
computed −1, (). Then compute and store the , () using the
previously computed +1, (). Finally, compute and store both
() and   ().
,
,
Sequentially for each , but in parallel for each , compute and store
() from  ().
the ,
,
Sequentially for each , but in parallel for each , compute and store
() using the previously computed

each ,
,−1 () and , ().
In parallel for all  and , compute , () using the previously

computed ,−1
() and , (). Then compute and store the , ()
using the previously computed ,+1 ().
input
output stages
fix both

fix both
◦ Eliminates reading and writing causal results
 Both in column and in row processing
◦ Modest increase in computation


There is still one source of inefficiency in
algorithm 4. We wait until the complete
block ,  is available in stage 4 before

computing incomplete ,
and twice
incomplete ,
().
We can overlap row and column computations
and work with thrice-incomplete transposed
epilogues obtained directly during algorithm
4 stage 1.


Below is the formula for completing the
thrice-incomplete transposed prologues:
The thrice-incomplete  satisfies

To complete the four-times-incomplete
transposed epilogues of :
1.
2.
3.
4.
5.
6.
In parallel for all  and , compute and store each , (),

, (), ,
(), and ,
().
In parallel for all , sequentially for each , compute and store
the , () using the previously computed −1, ().
In parallel for all , sequentially for each , compute and store
, () using the previously computed −1, () and +1, ().
In parallel for all , sequentially for each , compute and store

,
() using the previously computed ,
(), −1, ( ), and
+1, ().
In parallel for all , sequentially for each , compute and store

,
() using the previously computed ,
(), ,−1
(),

−1, (), and +1, ().
In parallel for all  and , successively compute , (),
, (), , (), and , () using the previously computed

−1, (), +1, (), ,−1
(), and ,+1
(). Store , ().
input
output stages
fix all!

◦ Eliminates reading and writing column results
◦ Modest increase in computation
Start from input and global borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
All borders in global memory
Fix incomplete borders
Fix twice-incomplete borders
Fix thrice-incomplete borders
Fix four-times-incomplete borders
Done fixing all borders
Finish causal columns
Finish anticausal columns
Finish causal rows
Finish anticausal rows
Store results to global memory
Done!



A summed-area table is obtained using prefix
sums over columns and rows.
The prefix-sum filter  is a special case firstorder causal recursive filter (with feedback
coefficient 1 = −1).
We can directly apply overlapping to optimize
the computation of summed-area tables.

In blocked form, the problem is to obtain
output  from input  where the blocks
satisfy the relations:

,  =   ,−1
, , ()
,  = (−1,  , ,  )


Using the strategy developed for causalanticausal overlapping, computing  and
using overlapping becomes easy.
In the first stage, we compute the incomplete
output blocks , () and , () directly
from the input.
,  = (, ,  )
,  =   , , ()

We store only the incomplete prologues

, () and ,
(). Then we complete them
using:
,  = −1,  + , ()

,
= ,−1
+  −1,


+ ,
()
Scalar  −1,  denotes the sum of all
entries in vector −1,  .
1.
2.
3.
4.
In parallel for all  and , compute and store

the , () and ,
().
Sequentially for each , but in parallel for each
, compute and store the , () using the
previously computed, (). Compute and store
(,  ).
Sequentially for each , but in parallel for each
() using the
, compute and store the ,
() and
previously computed −1, (), ,
(,  ).
In parallel for all  and , compute , () then
compute and store , () using the previously
().
computed , () and ,

S.1 Reads the input then computes and stores
the incomplete prologues , () (red) and

,
(blue).

S.2 Completes the prologues , () (red) and
computes scalars  −1,
(yellow).


S.3 Completes prologues ,
()

S.4 Reads the input and completed
prologues, then computes and stores the
final summed-area table.
First-order filter benchmarks
• Alg. RT is the baseline implementation
• Alg. 4 adds causal-anticausal overlapping
• Ruijters et al. 2010 “GPU prefilter […]”
• Eliminates 4hw of IO
• Modest increase in computation
• Alg. 2 adds block parallelism & tricks
• Sung et al. 1986 “Efficient […] recursive […]” • Alg. 5 adds row-column overlapping
• Blelloch 1990 “Prefix sums […]”
• Eliminates additional 2hw of IO
• + tricks from GPU parallel scan algorithms
• Modest increase in computation
Cubic B-Spline Interpolation
(GeForce GTX 480)
7
5
4
2
RT
Throughput (GiP/s)
6
5
Alg.
5
4
4
3
2
2
1
RT
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2
Step
Complexity
Max. # of
Memory
Bandwidth
Summed-area table benchmarks
• First-order filter, unit coefficient, no anticausal component
• Harris et al 2008, GPU Gems 3
• Hensley 2010, Gamefest
• “Parallel prefix-scan […]”
• Multi-scan + transpose + multiscan
• Implemented with CUDPP
• “High-quality depth of field”
• Multi-wave method
• Our improvements
+ specialized row and column kernels
+ save only incomplete borders
+ fuse row and column stages
Summed-area Table
(GeForce GTX 480)
9
Overlapped SAT
Improved Hensley [2010]
Hensley [2010]
Harris et al [2008]
8
Throughput (GiP/s)
7
• Overlapped SAT
6
• Row-column overlapping
5
4
3
2
1
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2



The paper describes an efficient algorithmic
framework that reduces memory bandwidth
over a sequence of recursive filters.
It splits the input into blocks that are
processed in parallel on the GPU.
Overlapping the causal, anticausal, row and
columns filters to reduce IO to global memory
which leads to substantial performance gains.


Difficult to understand theoretically
Complex implementation
Questions?
Alg. RT (0.5 GiP/s)
Alg. 2 (3 GiP/s)
baseline
+ block parallelism
Alg. 4 (5 GiP/s)
Alg. 5 (6 GiP/s)
+ causal-anticausal overlapping
+ row-column overlapping
```