Report

Mesh-free numerical methods for large-scale engineering problems. Basics of GPU-based computing. D.Stevens, M.Lees, P. Orsini Supervisors: Prof. Henry Power, Herve Morvan January 2008 Outline: Meshless numerical methods using Radial Basis Functions (RBFs) Basic RBF interpolation Brief overview of the work done in our research group, under the direction of Prof. Henry Power Example results Large scale future problems GPU computing Introduction to the principle of using graphics processors (GPUs), as floating point co-processors Current state of GPU hardware and software Basic implementation strategy for numerical simulations GABARDINE - EU project GABARDINE: Groundwater Artificial recharge Based on Alternative sources of wateR: aDvanced INtegrated technologies and managEment Aims to investigate the viability of artificial recharge in groundwater aquifers, and produce decision support mechanisms Partners include: University of Nottingham is developing numerical methods to handle: Portugal, Germany, Greece, Spain, Belgium, Israel, Palestine Phreatic aquifers (with associated moving surfaces) Transport processes Unsaturated zone problems (with nonlinear Governing equations) RBF Interpolation Methods N u x j x j j 1 Pm1 x N unknowns… fi f i 1.2 (r ) r ln(r ) m i 1,...,N (r ) r c 2 1 0.8 m 2 2 0.6 0.4 0.2 0 -1 -0.8 -0.6 -0.4 -0.2 Apply the above at each of the N test locations: x i j j fi 0 0.2 0.4 0.6 0.8 1 Kansa’s Method N ux j j 1 L[u ( x)] f ( x) in B[u ( x)] g ( x) on Collocating: eg. 2 L 2 x j x j (Diffusion operator) Bx [ij ] gi L [ ] j f i x ij Hermitian Method L[u ( x)] f ( x) in B[u ( x)] g ( x) on Operators are also applied to basis functions: NB u x j B x j j 1 L x Bx B [ ij ] Bx L [ ij ] gi L B [ ] L L [ ] j f x ij i x ij N j NB 1 j j SYMMETRIC SYSTEM RBF – Features and Drawbacks Partial derivatives are obtained cheaply and accurately by differentiating the (known) basis functions Leads to a highly flexible formulation allowing boundary operators to be implemented exactly and directly Once solution weights are obtained, a continuous solution can be reconstructed over the entire solution domain A densely populated linear system must be solved to obtain the solution weights Leads to high computational cost – O(N2), and numerical conditioning issues with large systems, setting a practical limit of ~1000 points Formulation: LHI method Initial value problem: B[u ( x)] g ( x) u ( x) f ( x) on in u Lu t Hermitian Interpolation using solution values, and boundary operator(s) if present NB u x j B x j j 1 x P N j NB 1 j j m 1 LHI method formulation cont… Form N systems, based on a local support: H ( k ) ( k ) d ( k ) B H Bx Bx B Pm 1T B Pm 1 T Pm 1 Bx Pm 1 0 ui d g ( xi ) 0 H ~ 10 x 10 matrix Hence, can reconstruct the solution in the vicinity of local system k via: u ( k ) ( x) R ( k ) ( x) . where: R( x) x i , B x i , pm1 ( x) CV-RBF (Modified CV Scheme) Classic CV approach CV-RBF approach Apply internal operator Apply Dirichlet operator Apply Boundary operator Polynomial interpolation to compute the flux RBF interpolation to compute the flux L x f x on B x g x on x 3 Simulation Workflow Our Code Dataset Generation GridGen Simulation RBF specific Pre-Processing TecPlot RBF Triangle Meshless Post Processing CAD CVRBF Meshless Convection-Diffusion: Validation Both methods have been validated against a series of 1D and 3D linear and nonlinear advection-diffusion reaction problems, eg: Pe = 200 - LHIM (c = 0.03) Pe = 100 - LHIM (c = 0.08) Pe = 50 - LHIM (c = 0.12) 2.0 2.0 2.0 Analytical Analytical 1.8 Analytical 1.8 Calculated 1.8 Calculated Calculated 1.6 1.6 1.6 1.4 1.4 1.2 1.2 1.0 1.0 1.4 0.8 0.75 0.80 0.85 0.90 0.95 1.00 0.8 0.75 1.2 1.0 0.80 0.85 0.90 0.95 k = 120 - c = 0.04 Calculated Analytic 300 250 200 150 100 50 0 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.8 0.75 0.80 0.85 0.90 0.95 1.00 CV-RBF: Infiltration well + Pumping Lx 75m, Well diameter: Ly 31m, Lz 50m d 3m Pumping location: 25m from the infiltration well (height y=15m) Infiltration-Pumping rate: Soil properties: m3 I P qA 20 h Kx Kz 1.0 m h K y 0.3 m h 0.2 CV-RBF: Infiltration well + Pumping 3D model: Mesh (60000 cells) and BC y 0: Boundary conditions Everywhere else 0 n 31m CV-RBF: Infiltration well + Pumping Piezometric head contour and streamlines at t=30h plane at y=29m plane at z=25m Length scale: 100m Maximum Displacements: dhwell 0.2m dhpump 0.85m LHI: Infiltration model - Setup Solving Richards’ equation: 1 2 ( ) K ( ) (K s )ij t S0 p '( ) xi x j z Infiltrate over 10m x 10m region at ground surface Infiltration pressure = 2m Zero-flux boundary at base (‘solid wall’) Fixed pressure distribution at sides Initial pressure: (8.0 z ) m 1.0 m Saturated zone Unsaturated zone LHI: Infiltration model – Soil properties Ks (1.0, 1.0, 0.2) mh1 Saturated conductivity: Storativity: Using Van-Genuchten soil representation: S0 104 h2 ( ) r ( s r ) 1 2 hg S0 p S0 1 2 ( ) s h2 K ( ) K s 1 2 hg 3 2 Relative Hydraultic Conductivity Water Content 1 0.5 0.9 0.45 Water Content 0.7 0.35 0.3 0.25 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.15 0 0.1 -1.2 -1 -0.8 -0.6 -0.4 -0.2 Water pressure (m) r 0.15, s 0.45, Relative conductivity 0.8 0.4 0 -1.2 -1 -0.8 -0.6 Water pressure (m) -0.4 -0.2 0 LHI: Infiltration model Results 11,585 points arranged in 17 layers ‘Short’ runs: solution to 48 hours ‘Long’ run: solution to 160 hours Richards’ equation - First example Using the steady-state example given in Tracy (2006)* for the solution of Richards’ equation, with: K ( ) K s e ( ) r ( s r )e S0 p 0 On a domain: With: (x) 0 x 15.24m 0 y 15.24m 0 z 15.24m x y ln ehr (1 ehr ) sin sin 15.24 15.24 1 (x) 15.24 Top face All other faces * F.T.Tracy, Clean two- and three-dimensional analytical solutions of Richards’ equation for testing numerical solvers Richards’ equation - First example With: (11 x 11 x 11) and (21 x 21 x 21) uniformly spaced points α = 0.164 N = 11 N = 22 0 0 2 4 6 8 10 12 14 16 0 18 -2 -4 -4 -6 Analytical Calculated -8 -10 -12 Hyd. Head (m) Hyd. Head (m) 0 -2 2 4 6 8 10 12 14 -6 16 18 Analytical Calculated -8 -10 -12 -14 -14 -16 -16 -18 -18 z (m) z (m) α = 0.328 N =11 N = 22 0 0 0 2 4 6 8 10 12 14 16 0 18 -2 -4 4 6 8 10 12 14 16 18 -4 -6 Analytical Calculated -8 -10 -8 -10 -12 -12 -14 -14 -16 -16 -18 Analytical Calculated -6 Hyd. Head (m) Hyd. Head (m) 2 -2 -18 z (m) z (m) Richards’ equation - First example (error analysis) L2 error norm Max error 11*11*11 1.72E-02 3.16E-02 21*21*21 3.52E-03 7.09E-03 11*11*11 1.61E-02 3.20E-02 21*21*21 4.24E-03 8.57E-03 11*11*11 1.77E-02 2.43E-02 21*21*21 5.16E-03 7.26E-03 Alpha = 0.164 Finite Volume – max error Improvement factor: 6.62e-2 9.34 1.11 129.5 5.13 706.6 Alpha = 0.328 Alpha = 0.492 Good improvement over finite volume results from Tracy paper, particularly with rapidly decaying K and θ functions Reasonable convergence rate, with increasing point density Future work Future work will focus on large-scale problems, including: Regional scale models of real-world experiments in Portugal and Greece Country-scale models of aquifer pumping and seawater intrusion across Israel and Gaza The large problem size will require a parallel implementation for efficient solution – hence our interest in HPC and GPUs Practical implementation will require the parallelisation of large, sparsely-populated, iterative matrix solvers To our knowledge, we are the only group working on large-scale hydrology problems using meshless numerical techniques GPU Computing: GPU: Graphics Processing Unit Originally designed to accelerate floating-point heavy calculations in computer games eg. Pixel shader effects (Lighting effects, reflection/refraction, other effects) Geometry setup (character meshes etc) Solid-body physics (…not yet widely adapted) Massively parallel architecture – currently up to 128 floating point processing units Recent hardware (from Nov 2006) and software (Feb 2007) advances have allowed programmable processing units (rather than units specialised for pixel or vertex processing) Has led to "General Purpose GPUs" - ‘GPGPUs’ GPU Computing: GPUs are extremely efficient at handling add-multiply instructions in small ‘packets’ (usually the main computational cost in numerical simulations) FP capacity outstrips CPUs, in both theoretical capacity and efficiency (if properly implemented) GPU Computing: Modern GPUs effectively work like a shared-memory cluster: GPUs have an extremely fast (~1000Mhz vs ~400Mhz), dedicated onboard memory Onboard memory sizes currently range up to 1.5Gb (in addition to system memory) CUDA - BLAS and FFT libraries The CUDA toolkit is a C language development environment for CUDAenabled GPUs. Two libraries implemented on top of CUDA: Basic Linear Algebra System (BLAS) Fast Fourier Transform (FFT) Pre-parallelised routines. Available Examples/Demos: •Parallel bitonic sort •Matrix multiplication •Matrix transpose •Performance profiling using timers •Parallel prefix sum (scan) of large arrays •Image convolution •1D DWT using Haar wavelet •graphics interoperation examples •CUDA BLAS and FFT examples •CPU-GPU C and C++ code integration •Binomial Option Pricing •Black-Scholes Option Pricing •Monte-Carlo Option Pricing •Parallel Mersenne Twister •Parallel Histogram •Image Denoising •Sobel Edge Detection Filter TESLA - GPUs for HPC Deskside (2 GPUs 3GB) and Rackmount (4 GPUs 6GB) options With ~500Gflops per GPU, that is ~2 Teraflops per rack Deskside is listed at around $4200 GPU Computing – Some results GPU specs: GPU model Processors GPU Clock speed Memory size Memory bus COST 8600GTS 32 675Mhz 512Mb 256bit ~£75 8800GT 112 600Mhz 512Mb 256bit ~£150 8800GTX 128 575Mhz 768Mb 320bit ~£250 Use CUDA – nVidia’s C-based API for GPUs Example: Multiplication of densely populated matrices O(N3) algorithm… Matrices are broken down into vector portions, and sent to the GPU stream processors Various CPUs vs GPU Improvement factor over CPU Improvement of 8800GTX over various CPUs 200 Pentium 4 3Ghz 180 E2160 (1.86Ghz dual) - £55 E6850 (3.0Ghz dual) £190 QX9650 (3Ghz quad) - £690 160 140 120 100 80 60 40 20 0 0 1000 2000 3000 4000 5000 6000 Matrix size ( column or row size ) Note: Performance of dual-core and quad-core CPUs is approximated from an idealised parallelisation (ie. 100% efficiency) More GPU propaganda: Good anecdotal evidence for improvement in real-world simulations is available from those who have switched to GPU computing: Dr. John Stone, Beckmann Institute of Advanced Technology, NAMD virus simulations: 110 CPU hours on SGI Itanium supercomputer => 47minutes with a single GPU Represents a 240-fold speedup Dr. Graham Pullan, University of Cambridge, CFD with turbine blades (LES and RANS models) 40x absolute speedup switching from a CPU cluster to ‘a few’ GPUs Use 10 million cells on GPU, up from 500,000 on CPU cluster Closing words GPU performance is advancing at a much faster rate than CPUs. This is expected to continue for some time yet With CUDA and BLAS, exploiting parallelism of the GPUs is in some cases easier than traditional MPI approaches Later this year: 2-3 times the performance of current hardware (over 1 TFLOP per card) Native 64bit capability More info: www.nvidia.com/tesla www.nvidia.com/cuda www.GPGPU.org