Report

Jeremiah van Oosten Reinier van Oeveren Introduction Related Works ◦ Prefix Sums and Scans ◦ Recursive Filtering ◦ Summed-Area Tables Problem Definition Parallelization Strategies ◦ ◦ ◦ ◦ Baseline (Algorithm RT) Block Notation Inter-block Parallelism Kernel Fusion (Algorithm 2) Overlapping Summed-Area Tables ◦ Causal-Anticausal overlapping (Algorithm 3 & 4) ◦ Row-Column Causal-Anitcausal overlapping (Algorithm 5) ◦ Overlapped Summed-Area Tables (Algorithm SAT) Results Conclusion Linear filtering is commonly used to blur, sharpen or down-sample images. A direct implementation evaluating a filter of support d on a h x w image has a cost of O(hwd). The cost of the image filter can be reduced using a recursive filter in which case previous results can be used to compute the current value: 1 = − −1 2 Cost can be reduced to O(hwr) where r is the number of recursive feedbacks. At each step, the filter produces an output element by a linear combination of the input element and previously computed output elements. = = (, ) − − =1 0(hwr) Continue… Applications of recursive filters ◦ Low-pass filtering like Gaussian kernels ◦ Inverse Convolution (X = ∗ , = ∗ −1 ) ◦ Summed-area tables recursive filters input blurred Recursive filters can be causal or anticausal (or non-causal). Causal filters operate on previous values. = − − =1 Anticausal filters operate on “future” values. = − + =1 Anticausal filters operate on “future” values. = (, ) = − + =1 Continue… It is often required to perform a sequence of recursive image filters. P P’ U V X Y Z E E’ • Independent Columns • Causal = (, ) • Anticausal = (, ) • Independent Rows • Causal = (′, ) • Anticausal = (′, ) The naïve approach to solving the sequence of recursive filters does not sufficiently utilize the processing cores of the GPU. The latest GPU from NVIDIA has 2,668 shader cores. Processing even large images (2048x2048) will not make full use of all available cores. Under utilization of the GPU cores does not allow for latency hiding. We need a way to make better utilization of the GPU without increasing IO. In the paper “GPU-Efficient Recursive Filtering and Summed-Area Tables” by Diego Nehab et. al. they introduce a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters. Partition the image into 2D blocks of size . A prefix sum = + −1 Simple case of a first-order recursive filter. A scan generalizes the recurrence using an arbitrary binary associative operator. Parallel prefix-sums and scans are important building blocks for numerous algorithms. [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et. al. 2007] An optimized implementation comes with the CUDPP library [2011]. A generalization of the prefix sum using a weighted combination of prior outputs. = − − =1 This can be implemented as a scan operation with redefined basic operators. Ruijters and Thevenaz [2010] exploit parallelisim across the rows and columns of the input. Sung and Mitra [1986] use block parallelism and split the computation into two parts: ◦ One computation based only on the block data assuming a zero initial conditions. ◦ One computation based only on the initial conditions and assuming zero block data. = 0, + (, 0) Summed-area tables enable the averaging rectangular regions of pixel with a constant number of reads + UL LL + width UR height LR − − + = ℎ ∙ ℎℎ The paper titled “Fast Summed-Area Table Generation…” from Justin Hensley et. al. (2005) describes a method called recursive doubling which requires multiple passes of the input image. (A 256x256 image requires 16 passes to compute). Image A Image B Image A Image B In 2010, Justin Hensley extended his 2005 implementation to compute shaders taking more samples per pass and storing the result in intermediate shared memory. Now a 256x256 image only required 4 passes when reading 16 samples per pass. Casual recursive filters of order are characterized by a set of feedback coefficients in the following manner. Given a prologue vector ∈ ℝ and an input vector ∈ ℝℎ of any size ℎ the filter produces the output: = (, ) Such that ∈ ℝℎ ( has the same size as the input ). Causal recursive filters depend on a prologue vector ∈ ℝ = − − =1 Similar for the anitcausal filter. Given an input vector ∈ ℝℎ and an epilogue vector ∈ ℝ , the output vector = (, ) is defined by: = − ′ + =1 For row processing, we define an extended casual filter and anticausal filter . = (′, ) = (, ′) With these definitions, we are able to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left). P P’ U V X Y Z E E’ • Independent Columns • Causal = (, ) • Anticausal = (, ) • Independent Rows • Causal = (′, ) • Anticausal = (′, ) The goal is to implement this algorithm on the GPU to make full use of all available resources. ◦ Maximize occupancy by splitting the problem up to make use of all cores. ◦ Reduce I/O to global memory. Must break the dependency chain in order to increase task parallelism. Primary design goal: Increase the amount of parallelism without increasing memory I/O. Baseline algorithm ‘RT’ Block notation Inter-block parallelism Kernel fusion Independent row and column processing Step RT1: In parallel for each column in , apply sequentially and store . Step RT2: In parallel for each column in , apply sequentially and store . Step RT1: In parallel for each row in , apply sequentially and store . Step RT1: In parallel for each row in , apply sequentially and store . input stages output column processing row processing ℎ Completion takes ( Bandwidth usage in total is 8ℎ ℎ = = = = = 4r ) steps streaming multiprocessors number of cores (per processor) width of the input image height of the input image order of the applied filter Partition input image into blocks ◦ = number of threads in warp (=32) What means what? ◦ , = block in matrix with index , ◦ −, = column-prologue submatrix ◦ +, = column-epilogue submatrix For rows we have (similar) transposed operators: ,− and ,+ Tail and head operators: selecting prologueand epilogue-shaped submatrices from , = , , , = , , = , , , = , Result: blocked version of problem definition = (, ), = (, ), = (′, ), = (′, ) , = −, , , , = , , +, , = ,− , , , = , , ,+ Superposition (based on linearity) Effects of the input and prologue/epilogue on the output can be computed independently , = , + , , = , + , Express as matrix products For any , is the identity matrix , , , , = = = = , , , , = = = = ( ) = ( ) = Precomputed matrices that depend only on the feedback , = , = coefficients of filters and respectively. Details in paper. Perform block computation independently = − , output block Prologue − / superposition tail of prev. output block (− ) = , + − , = , first term () + − , second term , − incomplete causal output − () = + − () (1) Recall: = − , (2) Algorithm 1 1.1 In parallel for all m, compute and store each 1.2 Sequentially for each m, compute and store the according to (1) and using the previously computed 1.3 In parallel for all m, compute & store output block using (2) and the previously computed − () Processing all rows and columns using causal and anti-causal filter pairs requires 4 successive applications of algorithm 1. There are independent tasks: hides memory access latency. However.. The memory bandwidth usage is now + Significantly more than algorithm RT () can be solved . Original idea: Kirk & Hwu [2010] Use output of one kernel as input for the next without going through global memory. Fused kernel: code from both kernels but keep intermediate results in shared mem. Use Algorithm 1 for all filters, do fusing. Fuse last stage of with first stage of Fuse last stage of and first stage of Fuse last stage of with first stage of We aimed for bandwidth reduction. Did it work? Algorithm 1: Algorithm 2: 16 12 + ℎ 16 9+ ℎ yes, it did! output stages input * fix fix fix fix * for the full algorithm in text, please see the paper Further I/O reduction is still possible: by recomputing intermediary results instead of storing in memory. More bandwidth reduction: 6 + No. of steps: ℎ ( (14 + 1 4 22 ℎ (=good) 2 + ) (≈bad*) Bandwidth usage is less than Algorithm RT(!) but involves more computations *But.. future hardware may tip the balance in favor of more computations. Overlapping is introduced to reduce IO to global memory. It is possible to work with twice-incomplete anticausal epilogues , (), computed directly from the incomplete causal output block , (). This is called casual-anticausal overlapping. Recall that we can express the filter so that the input and the prologue or epilogue can be computed independently and later added together. (, ) = , + , , (, ) = , + (, ) Using the previous properties, we can split the dependency chains of anticausal epilogues. Which can be further simplified to: Where the twice-incomplete is such that Each twice-incomplete epilogue () depends only on the corresponding input block () and therefore they can all be computed in parallel already in the first pass. As a byproduct of that same pass, we can compute and store the () that will be needed to obtain (). With (), we can compute all () in the following pass. 1. 2. 3. 4. In parallel for all , compute and store () and (). Sequentially for each , compute and store the () using the previously computed (). Sequentially for each , compute and store () using the previously computed −1 () and (). In parallel for all , compute each causal output block () using the previously computed −1 (). Then compute and store each anticausal output block () using the previously computed +1 (). Algorithm 3 computes row and columns in separate passes. Fusing these two stages, results in algorithm 4. 1. 2. 3. 4. 5. 6. 7. In parallel for all and , compute and store the , () and , (). Sequentially for each , but in parallel for each , compute and store the , () using the previously computed , (). Sequentially for each , but in parallel for each , compute and store the , () using the previously computed −1, ( ) and , (). In parallel for all and , compute , () using the previously computed −1, (). Then compute and store the , () using the previously computed +1, (). Finally, compute and store both () and (). , , Sequentially for each , but in parallel for each , compute and store () from (). the , , Sequentially for each , but in parallel for each , compute and store () using the previously computed each , ,−1 () and , (). In parallel for all and , compute , () using the previously computed ,−1 () and , (). Then compute and store the , () using the previously computed ,+1 (). input output stages fix both fix both Adds causal-anticausal overlapping ◦ Eliminates reading and writing causal results Both in column and in row processing ◦ Modest increase in computation There is still one source of inefficiency in algorithm 4. We wait until the complete block , is available in stage 4 before computing incomplete , and twice incomplete , (). We can overlap row and column computations and work with thrice-incomplete transposed epilogues obtained directly during algorithm 4 stage 1. Below is the formula for completing the thrice-incomplete transposed prologues: The thrice-incomplete satisfies To complete the four-times-incomplete transposed epilogues of : 1. 2. 3. 4. 5. 6. In parallel for all and , compute and store each , (), , (), , (), and , (). In parallel for all , sequentially for each , compute and store the , () using the previously computed −1, (). In parallel for all , sequentially for each , compute and store , () using the previously computed −1, () and +1, (). In parallel for all , sequentially for each , compute and store , () using the previously computed , (), −1, ( ), and +1, (). In parallel for all , sequentially for each , compute and store , () using the previously computed , (), ,−1 (), −1, (), and +1, (). In parallel for all and , successively compute , (), , (), , (), and , () using the previously computed −1, (), +1, (), ,−1 (), and ,+1 (). Store , (). input output stages fix all! Adds row-column overlapping ◦ Eliminates reading and writing column results ◦ Modest increase in computation Start from input and global borders Load blocks into shared memory Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders All borders in global memory Fix incomplete borders Fix twice-incomplete borders Fix thrice-incomplete borders Fix four-times-incomplete borders Done fixing all borders Load blocks into shared memory Finish causal columns Finish anticausal columns Finish causal rows Finish anticausal rows Store results to global memory Done! A summed-area table is obtained using prefix sums over columns and rows. The prefix-sum filter is a special case firstorder causal recursive filter (with feedback coefficient 1 = −1). We can directly apply overlapping to optimize the computation of summed-area tables. In blocked form, the problem is to obtain output from input where the blocks satisfy the relations: , = ,−1 , , () , = (−1, , , ) Using the strategy developed for causalanticausal overlapping, computing and using overlapping becomes easy. In the first stage, we compute the incomplete output blocks , () and , () directly from the input. , = (, , ) , = , , () We store only the incomplete prologues , () and , (). Then we complete them using: , = −1, + , () , = ,−1 + −1, + , () Scalar −1, denotes the sum of all entries in vector −1, . 1. 2. 3. 4. In parallel for all and , compute and store the , () and , (). Sequentially for each , but in parallel for each , compute and store the , () using the previously computed, (). Compute and store (, ). Sequentially for each , but in parallel for each () using the , compute and store the , () and previously computed −1, (), , (, ). In parallel for all and , compute , () then compute and store , () using the previously (). computed , () and , S.1 Reads the input then computes and stores the incomplete prologues , () (red) and , (blue). S.2 Completes the prologues , () (red) and computes scalars −1, (yellow). S.3 Completes prologues , () S.4 Reads the input and completed prologues, then computes and stores the final summed-area table. First-order filter benchmarks • Alg. RT is the baseline implementation • Alg. 4 adds causal-anticausal overlapping • Ruijters et al. 2010 “GPU prefilter […]” • Eliminates 4hw of IO • Modest increase in computation • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Alg. 5 adds row-column overlapping • Blelloch 1990 “Prefix sums […]” • Eliminates additional 2hw of IO • + tricks from GPU parallel scan algorithms • Modest increase in computation Cubic B-Spline Interpolation (GeForce GTX 480) 7 5 4 2 RT Throughput (GiP/s) 6 5 Alg. 5 4 4 3 2 2 1 RT 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 Step Complexity Max. # of Threads Memory Bandwidth Summed-area table benchmarks • First-order filter, unit coefficient, no anticausal component • Harris et al 2008, GPU Gems 3 • Hensley 2010, Gamefest • “Parallel prefix-scan […]” • Multi-scan + transpose + multiscan • Implemented with CUDPP • “High-quality depth of field” • Multi-wave method • Our improvements + specialized row and column kernels + save only incomplete borders + fuse row and column stages Summed-area Table (GeForce GTX 480) 9 Overlapped SAT Improved Hensley [2010] Hensley [2010] Harris et al [2008] 8 Throughput (GiP/s) 7 • Overlapped SAT 6 • Row-column overlapping 5 4 3 2 1 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 The paper describes an efficient algorithmic framework that reduces memory bandwidth over a sequence of recursive filters. It splits the input into blocks that are processed in parallel on the GPU. Overlapping the causal, anticausal, row and columns filters to reduce IO to global memory which leads to substantial performance gains. Difficult to understand theoretically Complex implementation Questions? Alg. RT (0.5 GiP/s) Alg. 2 (3 GiP/s) baseline + block parallelism Alg. 4 (5 GiP/s) Alg. 5 (6 GiP/s) + causal-anticausal overlapping + row-column overlapping