ARM-Optimized JPEG Decoder

HW/SW Implementation of
JPEG Decoder
Division of Labor
 Software
 Profiling – Arindam/Eric
 Timing analysis – Arindam/Eric
 Interface to hardware - Arindam
 Test data for hardware - Eric
 Hardware – Mert
 C to Verilog Conversion
 Scheduling & Resource Allocation on FPGA
 Bus Communication Interface
 What is JPEG?
 Project Description
 JPEG Algorithm
 Profile Data
 Software Design
 Hardware Design
 Results
 Conclusion
What is JPEG?
 Image codec released by the Joint Photographic
Experts Group in 1992
Joint committee between the ISO/IEC JTC1 and ITU-T
standards committees
 Informally used to describe the file format JPEG-
encoded images are packed in
Although the file format specified in the original standard,
JPEG Interchange Format (JIF), is rarely used
Exif or JFIF, both based JIF, are commonly used
What is JPEG? (cont.)
 Optimized for realistic images and photographs
 Color transitions should be smooth for best results
 Lossy compression, which can be tuned to produce
compressions of varying quality and size
Up to 20:1 without loss in quality for appropriate images
Better ratios than other algorithms such as GIF, but slower to
compress and decompress
Has lossless mode, but not widely used
Project Description
 Selected an existing software JPEG implementation
we could modify and increase performance
 Criteria
Small enough to be easily understood and modified
Reasonably fast, but not optimized
Project Description (cont.)
 Most common JPEG implementation out there is
libjpeg, from the Independent JPEG Group
Fast, but hard modify due to complexity
 Various other open source implementations
 Tiny Jpeg Decoder
 jpeg-compressor
Project Description (cont.)
 We ended up choosing NanoJPEG, written by Martin
Reasonably fast, but not optimized
Very small code size (< 1000 lines) in a single file
Easy to understand
 I/O
 Decompresses grayscale or YCbCr images
 Outputs grayscale or RGB raw images
 Other details
 Written in C
 No floating point
JPEG Algorithm
 Step 1
 Convert the image to the YCbCr color space (typically
from RGB)
Y for brightness
Cb and Cr for blue and red color components
 The human eye is less sensitive to color changes than
it is too brightness changes
JPEG takes advantage of this
JPEG Algorithm (cont.)
 Step 2
 Downsample the color data (CbCr) by averaging
together rows and vertically
Factor of two on rows
Factor of one or two on column
Data can thus be reduced by 1/2 or 1/3
 Imperceptible loss in quality
JPEG Algorithm (cont.)
 Step 3
 For each component, split the pixel data into 8x8
 Run each block through a discrete cosine transform
 End up with a matrix containing one DC value and
63 AC components
JPEG Algorithm
 Step 4
 Divide each cell of the matrix by values defined in a
quantization matrix, then round to the nearest
 The quantization matrix has values of customizable
The larger the values, the more cells are reduced to zero, and
hence lost
JPEG Algorithm (cont.)
 Step 5
 Take the reduced blocks and perform Huffman
encoding (or Arithmetic encoding) to eliminate
redundant values
Lossless compression
 Step 6
 Wrap data in a standard file format, along with
compression data including quantization and
Huffman tables
JPEG Algorithm (cont.)
 Decoding is simply the reverse of the encoding
Get the reduced matrixes back
Multiply it with the quantization matrix
Run an inverse DCT (IDCT)
Convert to RGB
Profile Data
 Profiled NanoJPEG on sample image with armsd
 55.10% of total time spent converting the image to
RGB upsampling
Logically separate from decode phase
 38.34% of total time spent decoding the 8x8 blocks
 So really 85.39% of time not spend converting/upsampling
 Row and column IDCTs were about half of the block
decode time
Our main focus for speedup, since took about 42% of decode
time, and were an obvious candidate for FPGA implementation
Software Design
Block decoding
code 
Row and column 
IDCT calls
Software Design
Software Design
 Interface –
 Write 8x8 integers to FPGA addresses- D3000100-1FF
 Read 8x8 integers from D3000200-2FF (o/p of RowIDCT)
 Read 8x8 bytes from D3000300-33F (o/p of ColIDCT)
 Code –
 Replace calls to IDCT functions with r/w to FPGA addresses
Hardware Design - Architecture
1. ARM writes row 0
2. Row IDCT: row 0
ARM writes row 1
3. …
4. Row IDCT: row 7
ARM reads row 0
5. Col IDCT: col 0 - 7
ARM reads rest of the block
6. ARM reads colIDCT results
8x8x8b COL_OUT
Register File
8x8x32b BLOCK
Register File
Hardware Design - Optimizations
 Register Files are used instead of RAMs to allow
random access to any word in the block matrix
 Arithmetic operations were distributed in multiple
stages to share resources and therefore reduce area
 Column IDCT and Row IDCT have a lot of common
operations –
  Use only a single datapath for both = Core IDCT
Hardware Design – Core IDCT
Hardware Design – Optimizations (2)
 The hardware speed is limited by the ARM – FPGA
bus transactions (block transfers).
  Optimize bus state machine:
Started with 6 state bus machine of Lab 2
Reduced it to only 3 states !!!
 Total # of FPGA cycles per 8x8 block process:
 3 x (64 Writes + (64+16) Reads ) = 432 Cycles
 432 Cycles for 8 Row and 8 Column IDCTs
 Hardware produces correct outputs in simulation
 Integrated system does not yet match simulation
 Communication overhead between ARM and FPGA
is the major bottleneck
 Expected speed-up:
 ARM: 8 x 60 + 8 x 120 = 1440 ARM Cycles (optimistic appr.)
 FPGA: 3 x (64 Writes + (64+16) Reads ) = 432 FPGA Cycles
 Work Completed
 Parallelized IDCT routines for each block decode in FPGA
 Work to be completed
 Get interface working
 What we would have done differently
 Used DMA to reduce communication overhead even more
 Parallelize ARM and FPGA block processing
 Additional speed-up possible by moving njConvert
(upsampling & color conversion) into FPGA
 Joint Photographic Experts Group
 Introduction to JPEG
 NanoJPEG

similar documents