FPGA vs GPU Performance Comparison on the - Guy Tel-Zur

FPGA vs GPU Performance
Comparison on the Implementation
of FIR Filters
GPGPU platforms
• GP - General Purpose computation using GPU
• GPU - Graphics Processing Unit
• CUDA and OpenCL are both frameworks for taskbased and data-based general purpose parallel
execution. Their architectures show a great
similarity. The key difference between these two
frameworks is that OpenCL is a cross-platform
framework (implemented on CPUs, GPUs, DSPs
and etc.); whereas, CUDA is supported only by
Memory Hierarchy
• Memory hierarchy of GPGPU architectures show similarity with CPU
• At the bottom level of the memory hierarchy the slowest but the
largest capacity memory type resides. This type of memory is
named as global memory in CUDA terminology. A typical global
memory is 2 or 4 gigabyte size and resides outside of the GPU chip
• Global memories are usually manufactured using DRAM
• Constant memory is another memory type in CUDA devices and is
optimized for broadcasting operations, so that it can be accessed
faster than global memory.
• Like caches in CPU memory hierarchy, there is a faster but smaller
memory type in CUDA memory hierarchy called as shared memory.
• Registers are other storage units in CUDA memory hierarchy which
are private for each thread. Registers have the smallest latency and
maximum throughput, but their amount is very limited.
Memory Hierarchy
CUDA Memory Hierarchy
FIR Filter Overview
• FIR filter structure is constructed from its transfer
function and linear difference equation which is
obtained from taking inverse Z-transform of the
transfer function of the filter
FIR Filter Overview
• The output stream y(n) is calculated by
multiplying the input signal [x(n), x(n-1), …
x(n-M+1)] with the corresponding filter
coefficients [b0, b1, … ,bM-1] and adding all
the multiplication results together.
FIR Filter Overview
GPU Implementation
• Three different implementation techniques are
designed to compare the performance of GPUs with
the FPGAs.
• Two of the designs are implemented using CUDA.
• The other design is an OpenCL kernel implementation.
• The first CUDA design is a naïve and simple kernel that
does not include any significant optimization.
• The other optimized CUDA kernel uses shared memory
and coalesce global memory accesses.
• The third one is just an OpenCL version of the highly
optimized CUDA FIR filter implementation.
Basic CUDA FIR Filter
FPGA Implementation
• Three different implementation techniques are selected to
synthesize FIR filters on various FPGAs: Direct-form,
symmetric-form, and distributed arithmetic.
• It is possible to achieve massive level of parallelism by
utilizing multiplier sources of the FPGAs. Most Xilinx FPGAs
have DSP48 macro blocks embedded in their chips. These
slices have 18x18-bit multiplier units with pre-accumulator,
48-bit accumulators and selection multiplexers in order to
speed up DSP operations.
• For the direct-form and symmetric-form FPGA
implementations Xilinx’s DSP48 macro slices are utilized.
FPGA Implementation
• Distributed arithmetic (DA) technique is an efficient method for
implementing multiplication operations without using the DSP
macro blocks of the FPGA. In the DA technique, the coefficients of
the FIR filter is represented in two’s complement binary form and
all possible sum combinations of the filter coefficients are stored in
look-up tables (LUT). Using classical shift-adder method the
multiplication operation can be performed effectively without using
multiplier units of the FPGA. We used 4-input LUTs of the FPGA to
implement the DA form of the FIR filter structure.
• We chose three different FPGAs to compare the performance
results of the FIR filters. Utilized FPGAs and their properties are
given in Table 1. Xilinx ISE v14.1 software is used to synthesize the
Results and Discussions
GPU and CPU Performance Results of FIR Filter
Application (Million Samples per Second)
• FIR filter order has a noticeable effect on performance.
• For lower order FIR filters both FPGA and GPU achieved better
performance with respect to higher order FIR filters.
• Serialization due to the lack of enough multiplier units is the main
performance decrease reason for FPGAs.
• Logic resource capacity of an FPGA is another limiting factor to
implement high order FIR filters
• FPGAs have relatively lower prices than GPUs, yet GPUs enjoy the
ease of programmability where FPGAs are still tough to program.
• In general, FPGA performance is higher than GPU when the FIR
filter is fully parallelized on FPGA device.
• However, GPU outperforms FPGA when the FIR filter has to be
implemented with serial parts on FPGA.
• No!

similar documents