Report

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science Introduction Scientific simulations on HPC clusters Run on interconnected compute nodes Produce and transfer lots of floating-point data Data storage and transfer are expensive and slow Compute nodes have multiple cores but only one link Interconnects are getting faster Lonestar: 40 Gb/s InfiniBand Speeds of up to 100 Gb/s soon Texas Advanced Computing Center Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 Introduction (cont.) Compression Reduced storage, faster transfer Only useful when done in real time Saturate network with compressed data Requires compressor tailored to hardware capabilities Charles Trevelyan for http://plus.maths.org/ GFC algorithm for IEEE 754 double-precision data Designed specifically for GPU hardware (CUDA) Provides reasonable compression ratio and operates above throughput of emerging networks Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 Lossless Data Compression Dictionary-based (Lempel-Ziv family) [gzip, lzop] Variable-length entropy coders (Huffman, AC) Run-length encoding [fax] Transforms (Burrows-Wheeler) [bzip2] Special-purpose FP compressors [FPC, FSD, PLMI] Prediction and leading-zero suppression None of these offer real-time speeds for state-of-the-art networks Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 GFC Algorithm GPUs require 1000s of parallel activities, but… compression is a generally serial operation Divide data into n chunks, processed in parallel Best perf: choose n to match max number of resident warps Each chunk composed of 32-word subchunks One double per warp thread Use previous subchunk to provide prediction values Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 Dimensionality Many scientific data sets display dimensionality Interleaved coordinates from multiple dimensions Optional dimensionality parameter to GFC Determines index of previous subchunk to use as the prediction Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 GFC Algorithm (cont.) Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 GPU Optimizations Low thread divergence (few if statements) Some short enough to be predicated Coalesce memory accesses by packing/unpacking data in shared memory (for CC < 2.0) Very little inter-thread communication and synchronization Prefix sum only Warp-based implementation Floating-Point Data Compression at 75 Gb/s on a GPU gamedsforum.ca March 2011 Evaluation Method Systems Two quad-core 2.53 GHz Xeons NVIDIA FX 5800 GPU (CC 1.3) 13 datasets: real-world data (19 – 277 MB) Observational data, simulation results, MPI messages Comparisons Compression ratio vs. 5 compressors in common use Throughput vs. pFPC (fastest known CPU compressor) Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 Compression Ratio 1.188 (range: 1.01 – 3.53) Harmonic mean compression ratio 1.25 Low (FP data), but in line with other algos Largely independent of number of chunks 1.20 1.15 When done in real- 1.10 time, compression at this ratio can greatly speed up MPI apps 1.05 1.00 bzip2 gzip lzop FPC pFPC GFC 3% – 98% speed-up [Ke et al., SC’04] Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 Throughput C: 75 – 87 Gb/s Harmonic mean throughput (Gb/s) 100 90 Mean: 77.9 Gb/s GFC 80 compression 70 decompression D: 90 – 121 Gb/s Mean: 96.6 Gb/s 60 50 4x faster than pFPC 40 pFPC on 8 cores (2 CPUs) 30 20 Improvement over 10 0 1.15 1.20 1.25 1.30 1.35 Harmonic mean compression ratio Floating-Point Data Compression at 75 Gb/s on a GPU 1.40 pFPC’s compression ratio vs. performance trend March 2011 NEW: Fermi Throughput Fermi improvements: Faster, simpler memory accesses Hardware support for count- leading-zeros op Compression ratio: 1.187 C: 119 – 219 (HM: 167.5 Gb/s) D: 169 – 219 (HM: 180.3 Gb/s) Compresses over 9.5x faster than pFPC on 8 x86 cores Throughput (Gb/s) 200 150 100 50 compression decompression 0 Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 Summary GFC algorithm Chunks up data, each warp processes a chunk iteratively by 32-word subchunks No communication required between warps Minimum 75 Gb/s – 90 Gb/s (encode-decode) throughput on GTX-285, and 119 Gb/s – 169 Gb/s on Fermi, with a compression ratio of 1.19 CUDA source code is freely available at http://www.cs.txstate.edu/~burtscher/research/GFC/ Floating-Point Data Compression at 75 Gb/s on a GPU March 2011 Conclusions GPU can compress much faster than PCIe bus can transfer the data But… PCIe bus will become faster AMD CPU-GPU increasingly on single die GPU-to-GPU, GPU-to-NIC transfers coming? NVIDIA GFC is the first compressor with the potential to deliver real-time FP data compression for current and emerging network speeds Floating-Point Data Compression at 75 Gb/s on a GPU March 2011