### School of Civil Engineering

```Mesh-free numerical methods for large-scale engineering
problems.
Basics of GPU-based computing.
D.Stevens, M.Lees, P. Orsini
Supervisors: Prof. Henry Power, Herve Morvan
January 2008
Outline:

Meshless numerical methods using Radial Basis Functions (RBFs)

Basic RBF interpolation
 Brief overview of the work done in our research group, under the direction
of Prof. Henry Power
 Example results
 Large scale future problems

GPU computing

Introduction to the principle of using graphics processors (GPUs), as
floating point co-processors
 Current state of GPU hardware and software
 Basic implementation strategy for numerical simulations
GABARDINE - EU project

GABARDINE:

Groundwater Artificial recharge Based on Alternative sources of
wateR: aDvanced INtegrated technologies and managEment

Aims to investigate the viability of artificial recharge in groundwater
aquifers, and produce decision support mechanisms

Partners include:

University of Nottingham is developing numerical methods to handle:

Portugal, Germany, Greece, Spain, Belgium,
Israel, Palestine
Phreatic aquifers (with associated moving surfaces)
 Transport processes
 Unsaturated zone problems (with nonlinear Governing equations)
RBF Interpolation Methods

N
 
u  x    j  x   j
j 1
Pm1 x 
N unknowns…
fi  f i 
1.2
 (r )  r ln(r )
m
i  1,...,N

 (r )  r  c
2
1
0.8

m
2 2
0.6
0.4
0.2
0
-1

-0.8
-0.6
-0.4
-0.2
Apply the above at each of the N test locations:
  x     
i
j
j
 fi
0
0.2
0.4
0.6
0.8
1

Kansa’s Method
N
ux    j 
j 1
L[u ( x)]  f ( x)
in 
B[u ( x)]  g ( x)
on 
Collocating:
eg.
2
L  2
x j

x  j

(Diffusion operator)
 Bx [ij ] 
 gi 


 L [  ]   j   f 
 i
 x ij 
Hermitian Method

L[u ( x)]  f ( x)
in 
B[u ( x)]  g ( x)
on 
Operators are also applied to basis functions:
NB
 
u x    j B  x   j
j 1
   L x   
 Bx B [ ij ] Bx L [ ij ] 
 gi 


 L B [ ] L L [ ]   j   f 
x 
ij 
 i
 x  ij
N
j  NB 1
j
j
SYMMETRIC SYSTEM
RBF – Features and Drawbacks

Partial derivatives are obtained cheaply and accurately by
differentiating the (known) basis functions

Leads to a highly flexible formulation allowing boundary operators to
be implemented exactly and directly

Once solution weights are obtained, a continuous solution can be
reconstructed over the entire solution domain

A densely populated linear system must be solved to obtain the
solution weights

Leads to high computational cost – O(N2), and numerical conditioning
issues with large systems, setting a practical limit of ~1000 points
Formulation: LHI method
Initial value problem:
B[u ( x)]  g ( x)
u ( x)  f ( x)
on 
in 
u
 Lu
t

Hermitian Interpolation using solution values, and boundary
operator(s) if present
NB
 
u x    j B  x   j
j 1
    x     P
N
j  NB 1
j
j
m 1
LHI method formulation cont…

Form N systems, based on a local support:
H ( k ) ( k )  d ( k )
 
B  

H   Bx   Bx B  
 Pm 1T B Pm 1 T


Pm 1 

Bx Pm 1 
0 
 ui 
d   g ( xi ) 
 0 
H ~ 10 x 10 matrix
Hence, can reconstruct the solution in the vicinity of local
system k via:
u ( k ) ( x)  R ( k ) ( x) .
where:
R( x)   x  i , B  x  i , pm1 ( x)
CV-RBF (Modified CV Scheme)

Classic CV approach  CV-RBF approach
Apply internal
operator
Apply Dirichlet
operator
Apply Boundary
operator
Polynomial interpolation
to compute the flux
RBF interpolation
to compute the flux
L   x   f  x 
on

B   x   g  x 
on

x 3
Simulation Workflow
Our Code
Dataset Generation
GridGen
Simulation
RBF specific
Pre-Processing
TecPlot
RBF
Triangle
Meshless
Post Processing
CVRBF
Meshless
Convection-Diffusion: Validation

Both methods have been validated against a series of 1D and 3D linear
and nonlinear advection-diffusion reaction problems, eg:
Pe = 200 - LHIM (c = 0.03)
Pe = 100 - LHIM (c = 0.08)
Pe = 50 - LHIM (c = 0.12)
2.0
2.0
2.0
Analytical
Analytical
1.8
Analytical
1.8
Calculated
1.8
Calculated
Calculated
1.6
1.6
1.6
1.4
1.4
1.2
1.2
1.0
1.0
1.4
0.8
0.75
0.80
0.85
0.90
0.95
1.00
0.8
0.75
1.2
1.0
0.80
0.85
0.90
0.95
k = 120 - c = 0.04
Calculated
Analytic
300
250
200
150
100
50
0
0.0
0.2
0.4
0.6
0.8
1.0
1.00
0.8
0.75
0.80
0.85
0.90
0.95
1.00
CV-RBF: Infiltration well + Pumping
Lx  75m,
Well diameter:
Ly  31m,
Lz  50m
d  3m
Pumping location: 25m from the infiltration well (height y=15m)
Infiltration-Pumping rate:
Soil properties:
m3
I  P  qA  20
h
Kx  Kz  1.0 m h
K y  0.3 m h
  0.2
CV-RBF: Infiltration well + Pumping

3D model: Mesh (60000 cells) and BC
y  0:
Boundary conditions
Everywhere else

0
n
  31m
CV-RBF: Infiltration well + Pumping

Piezometric head contour and streamlines at t=30h
plane at y=29m
plane at z=25m
Length scale: 100m
Maximum Displacements:
dhwell  0.2m dhpump  0.85m
LHI: Infiltration model - Setup
Solving Richards’ equation:

 
1
 2  ( ) K ( ) 


 (K s )ij


t  S0 p   '( )  
xi x j
z 

Infiltrate over 10m x 10m region at
ground surface

Infiltration pressure = 2m

Zero-flux boundary at base (‘solid wall’)
Fixed pressure distribution at sides

Initial pressure:
  (8.0  z ) m
  1.0 m
Saturated zone
Unsaturated zone
LHI: Infiltration model – Soil
properties
Ks  (1.0, 1.0, 0.2) mh1

Saturated conductivity:

Storativity:

Using Van-Genuchten soil representation:
S0  104
 h2 
 ( )   r  ( s   r ) 1  2 
 hg 

S0 p  S0
1
2
 ( )
s
 h2 
K ( )  K s 1  2 
 hg 

3
2
Relative Hydraultic Conductivity
Water Content
1
0.5
0.9
0.45
Water Content
0.7
0.35
0.3
0.25
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0.15
0
0.1
-1.2
-1
-0.8
-0.6
-0.4
-0.2
Water pressure (m)
r  0.15,
s  0.45,
Relative conductivity
0.8
0.4
0
-1.2
-1
-0.8
-0.6
Water pressure (m)
-0.4
-0.2
0
LHI: Infiltration model Results

11,585 points arranged in 17 layers

‘Short’ runs: solution to 48 hours

‘Long’ run: solution to 160 hours
Richards’ equation - First
example

Using the steady-state example given in Tracy (2006)* for the
solution of Richards’ equation, with:
K ( )  K s e
 ( )   r  ( s   r )e
S0 p  0

On a domain:

With:
 (x) 
0  x  15.24m
0  y  15.24m
0  z  15.24m

 x   y  
ln ehr  (1  ehr ) sin
 sin
 
 
 15.24   15.24  
1
 (x)  15.24
Top face
All other faces
* F.T.Tracy, Clean two- and three-dimensional analytical solutions of Richards’ equation for testing numerical solvers
Richards’ equation - First
example
With:
(11 x 11 x 11) and (21 x 21 x 21) uniformly spaced points
α = 0.164
N = 11
N = 22
0
0
2
4
6
8
10
12
14
16
0
18
-2
-4
-4
-6
Analytical
Calculated
-8
-10
-12
0
-2
2
4
6
8
10
12
14
-6
16
18
Analytical
Calculated
-8
-10
-12
-14
-14
-16
-16
-18
-18
z (m)
z (m)
α = 0.328
N =11
N = 22
0
0
0
2
4
6
8
10
12
14
16
0
18
-2
-4
4
6
8
10
12
14
16
18
-4
-6
Analytical
Calculated
-8
-10
-8
-10
-12
-12
-14
-14
-16
-16
-18
Analytical
Calculated
-6
2
-2
-18
z (m)
z (m)
Richards’ equation - First
example (error analysis)
L2 error norm
Max error
11*11*11
1.72E-02
3.16E-02
21*21*21
3.52E-03
7.09E-03
11*11*11
1.61E-02
3.20E-02
21*21*21
4.24E-03
8.57E-03
11*11*11
1.77E-02
2.43E-02
21*21*21
5.16E-03
7.26E-03
Alpha = 0.164
Finite Volume
– max error
Improvement
factor:
6.62e-2
9.34
1.11
129.5
5.13
706.6
Alpha = 0.328
Alpha = 0.492

Good improvement over finite volume results from Tracy paper,
particularly with rapidly decaying K and θ functions

Reasonable convergence rate, with increasing point density
Future work

Future work will focus on large-scale problems, including:

Regional scale models of real-world experiments in Portugal and
Greece
 Country-scale models of aquifer pumping and seawater intrusion
across Israel and Gaza

The large problem size will require a parallel implementation for
efficient solution – hence our interest in HPC and GPUs

Practical implementation will require the parallelisation of large,
sparsely-populated, iterative matrix solvers

To our knowledge, we are the only group working on large-scale
hydrology problems using meshless numerical techniques
GPU Computing:

GPU: Graphics Processing Unit

Originally designed to accelerate floating-point heavy calculations in
computer games eg.

Pixel shader effects (Lighting effects, reflection/refraction, other effects)
 Geometry setup (character meshes etc)
 Solid-body physics (…not yet widely adapted)

Massively parallel architecture – currently up to 128 floating point
processing units

Recent hardware (from Nov 2006) and software (Feb 2007)
advances have allowed programmable processing units (rather
than units specialised for pixel or vertex processing)

Has led to "General Purpose GPUs" - ‘GPGPUs’
GPU Computing:

GPUs are extremely efficient at handling add-multiply instructions in
small ‘packets’ (usually the main computational cost in numerical
simulations)

FP capacity outstrips CPUs, in both theoretical capacity and efficiency
(if properly implemented)
GPU Computing:

Modern GPUs effectively work like a shared-memory cluster:

GPUs have an extremely fast (~1000Mhz vs ~400Mhz), dedicated
onboard memory

Onboard memory sizes currently range up to 1.5Gb (in addition to
system memory)
CUDA - BLAS and FFT libraries

The CUDA toolkit is a C language
development environment for CUDAenabled GPUs.

Two libraries implemented on top of
CUDA:

Basic Linear Algebra System
(BLAS)
 Fast Fourier Transform (FFT)

Pre-parallelised routines.
Available Examples/Demos:
•Parallel bitonic sort
•Matrix multiplication
•Matrix transpose
•Performance profiling using timers
•Parallel prefix sum (scan) of large arrays
•Image convolution
•1D DWT using Haar wavelet
•graphics interoperation examples
•CUDA BLAS and FFT examples
•CPU-GPU C and C++ code integration
•Binomial Option Pricing
•Black-Scholes Option Pricing
•Monte-Carlo Option Pricing
•Parallel Mersenne Twister
•Parallel Histogram
•Image Denoising
•Sobel Edge Detection Filter
TESLA - GPUs for HPC

Deskside (2 GPUs 3GB) and Rackmount (4 GPUs 6GB) options

With ~500Gflops per GPU, that is ~2 Teraflops per rack

Deskside is listed at around \$4200
GPU Computing – Some results
GPU specs:
GPU
model
Processors
GPU Clock
speed
Memory
size
Memory
bus
COST
8600GTS
32
675Mhz
512Mb
256bit
~£75
8800GT
112
600Mhz
512Mb
256bit
~£150
8800GTX
128
575Mhz
768Mb
320bit
~£250

Use CUDA – nVidia’s C-based API for GPUs

Example: Multiplication of densely populated matrices


O(N3) algorithm…
Matrices are broken down into vector portions, and sent to the GPU
stream processors
Various CPUs vs GPU
Improvement factor over CPU
Improvement of 8800GTX over various CPUs
200
Pentium 4 3Ghz
180
E2160 (1.86Ghz dual)
- £55
E6850 (3.0Ghz dual) £190
- £690
160
140
120
100
80
60
40
20
0
0
1000
2000
3000
4000
5000
6000
Matrix size ( column or row size )

Note: Performance of dual-core and quad-core CPUs is
approximated from an idealised parallelisation (ie. 100% efficiency)
More GPU propaganda:

Good anecdotal evidence for improvement in real-world simulations is
available from those who have switched to GPU computing:

Dr. John Stone, Beckmann Institute of Advanced Technology, NAMD
virus simulations:

110 CPU hours on SGI Itanium supercomputer => 47minutes with a
single GPU
 Represents a 240-fold speedup

Dr. Graham Pullan, University of Cambridge, CFD with turbine blades
(LES and RANS models)
40x absolute speedup switching from a CPU cluster to ‘a few’ GPUs
 Use 10 million cells on GPU, up from 500,000 on CPU cluster

Closing words

GPU performance is advancing at a much faster rate than CPUs. This is
expected to continue for some time yet

With CUDA and BLAS, exploiting parallelism of the GPUs is in some
cases easier than traditional MPI approaches

Later this year:

2-3 times the performance of current hardware (over 1 TFLOP per card)
 Native 64bit capability
