### Modified CUDA Gaussian Elimination

```By Xinggao Xia and Jong Chul Lee
Applications of Systems of Linear Equations

Biology

Physics

Chemistry
n : the number of elements in matrix
Technique
Multiplications/Divisions
Gauss-Jordan
n3/2
n3/2
Gaussian Elimination
n3/3
n3/3
Cramer’s Rule
n4/3
n4/3
Table 1: Computational Complexity of Various Solving Techniques
Computational complexity increases tremendously as
the dimension of matrix increases.
Gaussian Elimination solver has obvious advantage
in terms of complexity as the matrix size increases.
Ax  B   A B 
Iteration No.1
1

m 21   1 1
m 31   1 1
pivot
1
2
3
1 1 

4  1
9 1 
Iteration No.2
1

pivot
0
m 32   2  0
1
1
2
1 1 

3  2
8 0 
Iteration No.3
1

0
0

1
1
0
1 1 

3  2
2 4 
1

0
0

1
1
0
1 1 

3  2
1 2 
Normalization
Iteration No.1
1
0
:
0
0
0
Iteration No.2
1
0
:
0
0
0
1
0
:
0
0
……
……
Iteration No.N
1
0
:
0
0
0
1
0
:
0
0
1
0 1
: 0 1
0 : 0 1
Inter-iteration parallelism
For Iteration i
A[j][]=A[j][] –m[j][i]*matrix pivot row
Multiplier array m must be determined before each iteration
m0i 1
0
m1i
:
m2i
0
m3i
0
0
Perfectly fit CUDA architecture
Modified Gaussian Elimination is considered for CUDA linear
equations solver
More parallelism
No back substitution
Partial pivoting guarantees accuracy of solution
Initial state
1

m 21   1 1
m 31   1 1
pivot
1
2
3
1 1 

4  1
9 1 
Iteration No.1
1

pivot
0
m 32   2  0
1
1
2
Iteration No.2
1 1 

3  2
8 0 
1

0
0

1
1
0
1 1 

3  2
2 4 
Modified Gaussian Elimination
Initial state
1

m 21   1 1
m 31   1 1
pivot
1
2
3
1 1 

4  1
9 1 
Iteration No.1
1

pivot
0
m 32   2  0
1
1
2
1 1 

3  2
8 0 
Iteration No.2
Iteration No.3
1

0
0

1

0
0

0
1
0
2 3 

3  2
2 4 
0
1
0
0 7 

0  8
1 2 
For iteration ith
Row j=Row j-mj*Row i
Column i
Row i
Gaussian
Elimination
0
:
0
0
1
0
0
:
0
elimination
in modified
Gaussian
Elimination
Back
Substitution
Gaussian Elimination
Iteration No.1
1
0
:
0
0
0
Iteration No.2
1
0
:
0
0
0
1
0
:
0
0
……
1
0
:
……
0
0
0
Iteration No.N-1
1
0
:
0
0
1
0 1
: 0 1
0 : 0 11
Gaussian
Linear
Solver
Modified Gaussian
Elimination
Iteration No.1
1
0
:
0
0
0
Iteration No.2
1
0
:
0
0
0
0
1
0
:
0
0
……
1
0
:
……
0
0
0
Iteration No.N
0
1
0
:
0
0
0
0
1
0
:
0
0
:
0
11
:
0
0
:
:
0
1
0
0
:
:
:
0
1
Modified
Gaussian
Linear
Solver
For (i=0; i<N; i++)
{
Partial pivoting
{
Transfer the ith column back to host;
Search the maximum of this column and return the index; (Host)
Switches rows if necessary; (Device)
}
Determine the multiplier column; (Device)
Modified Gaussian elimination; (Device)
}
Normalize the solution; (Device)
Transfer solution back to host;
 Matrix handling like modified Gaussian Elimination kernel, each thread
handles an operation of A[j][i]=A[j][i]-mj*A[i][j] for iteration ith, use two
dimensional grid and block, total of N*N threads in the kernel
 Row or column handling like partial pivoting and others, each thread for
one elements in the row or column, use one dimentsional grid and block,
total of N threads in the kernel
CudaMemcpy:Device to Host
d_temp
h_temp
c
Kernel1
Host: search
maximum is c
0
:
0
0
a
b
c
:
x
Minimizing Device Host
transportation: Switching
rows by kernel
Kernel2
d_temp
Kernel4
0
:
0
0
a
b
c
:
x
Kernel3
For ith iteration
N
Iteration i data partitioning
B(0,0)
B(1,1)
…… B(N-1,1)
T(0,0) T(0,1) ………………T(0,M-1)
:
:
:
B(i,j)
B(0,N-1)
Column i
Row i
0
:
0
0
1
0
0
:
0
T(0,0)
T(0,1)
:
:
:
:
:
:
:
T(0,M-1)
BLOCK_SIZE
N
B(0,1)
BLOCK_SIZE
Multiplier Column m
N
B(0,0)
B(1,1)
……
B(N-1,1)
N
B(0,1)
:
:
:
B(i,j)
Shared
Memory
B(0,N-1)
Row i
Shared
Memory
For ith iteration: A[j][i]=A[j][i]-mj*A[i][j]
Platform Configuration:
GPU: GeForce 8400 GS
1 SM, 8 cores, Clock rate 1.40GHz
Clock rate 2.39GHz
GPU implementation (Global or
shared) is much slower than CPU
implementation(1SM)
Try to mimic Tesla (16SM) by
scaling GPU time by 16
512
1024
2048
4096
47
403
5214
46098
Serial Modified Gaussian Linear Solver
71
564
8412
69949
Global Memory (1SM)
1718
13488 108916 862580
Shared Memory (1SM)
662
4806
38923
312787
Global Memory (scaled by 16)
107
843
6807
53911
Shared Memory (scaled by 16)
41
300
2433
19549
Time (ms)
80000
70000
60000
50000
40000
30000
20000
10000
0
Time (ms)
2048
4096
Time (ms)
900
800
700
600
500
400
300
200
100
0
Matrix size
80000
70000
60000
50000
40000
30000
20000
10000
0
Matrix size
Modified GE
Global(16SM)
Shared(16SM)
512
1024
Matrix size
Modified GE
Global(16SM)
Shared(16SM)
512
1024
2048
4096
 CPU prefers traditional GE solver
than modified GE solver
 GPU shared implementation is
always 2-3 times faster than global
implementation
 GPU(16SM) shared implementation
is around 2 times speedup compared to
For 1024 case (1SM), global memory implementation time is
13488ms, shared implementation is 4806ms
Method
#Calls
GPU(usec)
%GPU time
Global
GE_kernel
1024
1.3e+07
99.11
Shared
GE_kernel
1024
4.8e+06
97.6
gld uncoalesced
gld coalesced
%uncoalesced rate
Global
1048576
131072
89
Shared
61440
73728
45
Conclusion:
Linear equations solver based on modified Gaussian Elimination
is implemented on CUDA
Shared memory is about 3 times faster than global memory
implementation