### Sparse Matrix Methods • Day 1: Overview • Day 2: Direct methods • Day 3: Iterative methods • • • • • • The conjugate gradient algorithm Parallel conjugate gradient and.

```Sparse Matrix Methods
• Day 1: Overview
• Day 2: Direct methods
• Day 3: Iterative methods
•
•
•
•
•
•
Parallel conjugate gradient and graph partitioning
Preconditioning methods and graph coloring
Domain decomposition and multigrid
Krylov subspace methods for other problems
Complexity of iterative and direct methods
SuperLU-dist:
Iterative refinement to improve solution
Iterate:
• r = b – A*x
• backerr = maxi ( ri / (|A|*|x| + |b|)i )
•
•
•
•
•
if backerr < ε or backerr > lasterr/2 then stop iterating
solve L*U*dx = r
x = x + dx
lasterr = backerr
repeat
Usually 0 – 3 steps are enough
(Matlab demo)
• iterative refinement
Convergence analysis of iterative refinement
Let C = I – A(LU)-1
[ so A = (I – C)·(LU) ]
x1 = (LU)-1b
r1 = b – Ax1 = (I – A(LU)-1)b = Cb
dx1 = (LU)-1 r1 = (LU)-1Cb
x2 = x1+dx1 = (LU)-1(I + C)b
r2 = b – Ax2 = (I – (I – C)·(I + C))b = C2b
...
In general, rk = b – Axk = Ckb
Thus rk  0 if |largest eigenvalue of C| < 1.
The Landscape of Sparse Ax=b Solvers
Direct
A = LU
Iterative
y’ = Ay
More General
Nonsymmetric
Symmetric
positive
definite
Pivoting
LU
GMRES,
QMR, …
Cholesky
Conjugate
More Robust
More Robust
Less Storage
D
x0 = 0, r0 = b, p0 = r0
for k = 1, 2, 3, . . .
αk = (rTk-1rk-1) / (pTk-1Apk-1)
xk = xk-1 + αk pk-1
rk = rk-1 – αk Apk-1
βk = (rTk rk) / (rTk-1rk-1)
pk = rk + βk pk-1
step length
approx solution
residual
improvement
search direction
• One matrix-vector multiplication per iteration
• Two vector dot products per iteration
• Four n-vectors of working storage
• Eigenvalues: Au = λu
{ λ1, λ2 , . . ., λn}
• Cayley-Hamilton theorem:
(A – λ1I)·(A – λ2I) · · · (A – λnI) = 0
ciAi =
Σ
0in
Therefore
so
0 for some ci
Σ
1in
A-1 =
(–ci/c0) Ai–1
• Krylov subspace:
Therefore if Ax = b, then x = A-1 b and
x  span (b, Ab, A2b, . . ., An-1b) = Kn (A, b)
• Krylov subspace: Ki (A, b) = span (b, Ab, A2b, . . ., Ai-1b)
for i = 1, 2, 3, . . .
find xi  Ki (A, b)
such that ri = (Axi – b)  Ki (A, b)
• Notice ri  Ki+1 (A, b), so ri  rj for all j < i
• Similarly, the “directions” are A-orthogonal:
(xi – xi-1 )T·A· (xj – xj-1 ) = 0
• The magic: Short recurrences. . .
A is symmetric => can get next residual and direction
from the previous one, without saving them all.
• In exact arithmetic, CG converges in n steps
(completely unrealistic!!)
• Accuracy after k steps of CG is related to:
• consider polynomials of degree k that are equal to 1 at 0.
• how small can such a polynomial be at all the eigenvalues of A?
• Thus, eigenvalues close together are good.
• Condition number: κ(A) = ||A||2 ||A-1||2 = λmax(A) / λmin(A)
• Residual is reduced by a constant factor by
O(κ1/2(A)) iterations of CG.
(Matlab demo)
• CG on grid5(15) and bcsstk08
• n steps of CG on bcsstk08
• Lay out matrix and vectors by rows
• Hard part is matrix-vector product
y = A*x
P0 P1
P2
P3
x
P0
• Algorithm
Each processor j:
Compute y(j) = A(j,:)*x
• May send more of x than needed
• Partition / reorder matrix to reduce
communication
y
P1
P2
P3
(Matlab demo)
• 2-way partition of eppstein mesh
• 8-way dice of eppstein mesh
Preconditioners
•
Suppose you had a matrix B such that:
1. condition number κ(B-1A) is small
2. By = z is easy to solve
•
Then you could solve (B-1A)x = B-1b instead of Ax = b
•
•
•
•
B = A is great for (1), not for (2)
B = I is great for (2), not for (1)
Domain-specific approximations sometimes work
B = diagonal of A sometimes works
•
Or, bring back the direct methods technology. . .
(Matlab demo)
• bcsstk08 with diagonal precond
Incomplete Cholesky factorization (IC, ILU)
x
A
RT
R
• Compute factors of A by Gaussian elimination,
but ignore fill
• Preconditioner B = RTR  A, not formed explicitly
• Compute B-1z by triangular solves (in time nnz(A))
• Total storage is O(nnz(A)), static data structure
• Either symmetric (IC) or nonsymmetric (ILU)
(Matlab demo)
• bcsstk08 with ic precond
Incomplete Cholesky and ILU: Variants
• Allow one or more “levels of fill”
• unpredictable storage requirements
1
4
1
4
2
3
2
3
• Allow fill whose magnitude exceeds a “drop tolerance”
• may get better approximate factors than levels of fill
• unpredictable storage requirements
• choice of tolerance is ad hoc
• Partial pivoting (for nonsymmetric A)
• “Modified ILU” (MIC): Add dropped fill to diagonal of U or R
• A and RTR have same row sums
• good in some PDE contexts
Incomplete Cholesky and ILU: Issues
• Choice of parameters
• good: smooth transition from iterative to direct methods
• tradeoff: time per iteration (more fill => more time)
vs # of iterations (more fill => fewer iters)
• Effectiveness
• condition number usually improves (only) by constant factor
(except MIC for some problems from PDEs)
• still, often good when tuned for a particular class of problems
• Parallelism
• Triangular solves are not very parallel
• Reordering for parallel triangular solve by graph coloring
(Matlab demo)
• 2-coloring of grid5(15)
Sparse approximate inverses
B-1
A
• Compute B-1  A explicitly
• Minimize || B-1A – I ||F
(in parallel, by columns)
• Variants: factored form of B-1, more fill, . .
• Good: very parallel
Support graph preconditioners: example
G(A)
[Vaidya]
G(B)
• A is symmetric positive definite with negative off-diagonal nzs
• B is a maximum-weight spanning tree for A
(with diagonal modified to preserve row sums)
• factor B in O(n) space and O(n) time
• applying the preconditioner costs O(n) time per iteration
Support graph preconditioners: example
G(A)
G(B)
• support each edge of A by a path in B
• dilation(A edge) = length of supporting path in B
• congestion(B edge) = # of supported A edges
• p = max congestion, q = max dilation
• condition number κ(B-1A) bounded by p·q (at most O(n2))
Support graph preconditioners: example
G(A)
G(B)
• can improve congestion and dilation by adding a few
strategically chosen edges to B
• cost of factor+solve is O(n1.75), or O(n1.2) if A is planar
• in recent experiments [Chen & Toledo], often better than
drop-tolerance MIC for 2D problems, but not for 3D.
Domain decomposition (introduction)
A=
B
0
E
0
C
F
ET
FT G
• Partition the problem (e.g. the mesh) into subdomains
• Use solvers for the subdomains B-1 and C-1 to
precondition an iterative solver on the interface
• Interface system is the Schur complement:
S = G – ET B-1E – FT C-1F
• Parallelizes naturally by subdomains
(Matlab demo)
• grid and matrix structure for overlapping 2-way
partition of eppstein
Multigrid
(introduction)
• For a PDE on a fine mesh, precondition using a solution on
a coarser mesh
• Use idea recursively on hierarchy of meshes
• Solves the model problem (Poisson’s eqn) in linear time!
• Often useful when hierarchy of meshes can be built
• Hard to parallelize coarse meshes well
• This is just the intuition – lots of theory and technology
Other Krylov subspace methods
• Nonsymmetric linear systems:
• GMRES:
for i = 1, 2, 3, . . .
find xi  Ki (A, b) such that ri = (Axi – b)  Ki (A, b)
But, no short recurrence => save old vectors => lots more space
• BiCGStab, QMR, etc.:
Two spaces Ki (A, b) and Ki (AT, b) w/ mutually orthogonal bases
Short recurrences => O(n) space, but less robust
• Convergence and preconditioning more delicate than CG
• Active area of current research
• Eigenvalues: Lanczos (symmetric), Arnoldi (nonsymmetric)
The Landscape of Sparse Ax=b Solvers
Direct
A = LU
Iterative
y’ = Ay
More General
Nonsymmetric
Symmetric
positive
definite
Pivoting
LU
GMRES,
QMR, …
Cholesky
Conjugate
More Robust
More Robust
Less Storage
Complexity of direct methods
Time and
space to solve
any problem
on any wellshaped finite
element mesh
n1/2
n1/3
2D
3D
Space (fill):
O(n log n)
O(n 4/3 )
Time (flops):
O(n 3/2 )
O(n 2 )
Complexity of linear solvers
Time to solve
model problem
(Poisson’s
equation) on
regular mesh
n1/2
n1/3
2D
3D
Sparse Cholesky:
O(n1.5 )
O(n2 )
CG, exact arithmetic:
O(n2 )
O(n2 )
CG, no precond:
O(n1.5 )
O(n1.33 )
CG, modified IC:
O(n1.25 )
O(n1.17 )
CG, support trees:
O(n1.20 )
O(n1.75 )
O(n)
O(n)
Multigrid:
```