### ppt - University of Illinois at Chicago

```Lecture # 3: A Primer on How to
Design Parallel Algorithms
Shantanu Dutt
University of Illinois at Chicago
Parallel Algorithm Design—A Process/Data View


Break up the computation in either a data parallel or functional parallel way depending on
the problem at hand
Data parallel breakup:
Easier to generalize than functional breakup, and more structured
 Input data parallel (e.g., the max and general associative computation, finite-element
computations—the heat equation problem) or
 Output data parallel (e.g., matrix multiplication)
 Each data partition defines a computational partition or process that works on that partition
 1st attempt: breakup somehow, determine how to communicate to solve the problem
 2nd attempt: breakup in a way that the reqd. communication complexity (topology indep.) is min.
 3rd attempt: breakup so that commun. complexity on a particular topology is minimized
I/P data: Linear or unstructured (e.g., associative functions): Easy to partition into P (N/P)-size pieces

1
P
2
I/P data: Dimensional or Structured (e.g., finite-elt. grid): Can be partitioned by same dimensionality or in fewer
dimensions. Choice based on resulting communication pattern and the target topology
Inter-process
communication
Finite elt. grid:
(a) 2D (full dimensional) partition
(b) 1D (reduced dimensional) partition
An example parallel algorithm for a finite element
computation

Simple, structured and sparse communication needed.

Example: Heat Equation -

The initial temperature is zero on the boundaries and high in
the middle

The boundary temperature is held at zero.

The calculation of an element is dependent upon its neighbor
elements
data1 data2
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
3
…...
data N
1.
2.
3.
4.
5.
6.
7.
8.
9.
1.
2.
3.
4.
5.
6.
7.
8.
find out if I am MASTER or WORKER
if I am MASTER
initialize array
send each WORKER starting info and
subarray
do until all WORKERS converge
gather from all WORKERS convergence
data
signal
end do
14.
15.
16.
17.
18.
19.
20.
update border of my portion of solution array
determine if my solution has converged
if so {send MASTER convergence signal
recv. from MASTER convergence signal}
end do }
send MASTER results
endif
Serial Code –
repeat
do y=2, N-1
do x=2, M-1
u2(x,y)=u1(x,y)+cx*[u1(x+1,y) + u1(x-1,y)] +
cy*[u1(x,y+1)} + u1(x,y-1)] /* cx, cy are const.
enddo
enddo
Master (can be one of the workers)
u1 = u2;
until u1 ~ u2.
else if I am WORKER
receive from MASTER starting info and
subarray
do until solution converged {
update time
send (non-blocking?) neighbors my border
info
info
Workers
update interior of my portion of solution
array (see comput. given in the serial code)
wait for non-block. commun. (if any) to
complete
Code from: Fundamentals of Parallel
Processing, A. Agrawal, IIT Kanpur
Problem
Grid
4
Parallel Algorithm Design—A Process/Data View: Dimensional Partitioning
Inter-process
communication
Finite elt. grid:
(a) 2D (full dimensional) partition




(b) 1D (reduced dimensional) partition
A full dimensional partition will have the smallest circumference (amount of data that
needs to be communicated to neighboring processes) to area (total data elements per
process = N/P) ratio, and thus has the least topology-independent communication cost
A 2D partition (Fig. a) of a 2D grid of nxn points, gives us [n/sqrt(P)]x [n/sqrt(P)]
partitions, where n=sqrt(N). Thus total data that needs to be communicated (per
iteration of a finite element computation) is 2* [n/sqrt(P)] (for corner partitions) to 4*
[n/sqrt(P)] (for interior partitions).
A 1D partition (Fig. b) of a 2D grid of nxn points, gives us nx(n/P) partitions, requiring a
data amount of n to 2n to be communicated which is about sqrt(P)/2 times more data
than for a 2D partition.
However, if one considers the topology of the target parallel system, a 1D partition is
suitable for both a linear and 2D topology, while a 2D partition is suitable for a 2D
topology but is expensive for a linear topology in terms of average message hops (and
thus also in terms of more contention/conflict)
An example parallel algorithm for a finite element
computation: Analysis
data1 data2









…...
data N
Analysis:
Let L be the max dist. of a grid point from the heat source. The # of iterations for convergence will be
a function f(L) of L.
Assume linear partition of i/p data and near-neighbor communication only (possible for most
topologies for linear data partitioning).
Per iteration computation time per process = Q(N/P). Comm. time per iter. per process = Q(sqrt(N)).
All computations in an iteration occur in parallel (there is no data dependency among the
computations in each process for the same iteration). Commun. also occurs completely in parallel in
an iteration. Also, near-neighbor  no contention. So parallel time per iteration is Q(N/P + sqrt(N))
[is sometimes also written as Q(max(N/P, sqrt(N))].
Number of parallel iterations?
Parallel Time Tp(P) =
Seq. time Tseq = Tp(1) =
Speedup Sp(P) = Efficiency Ep(P) =
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
6
Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor
Finite elt. grid:
1
5
2
6
3
7
4
Processes
mapped to data
(& thus comput.)
1
2
3
4
5
6
7
8
8
Row major
(a) 2D (full dimensional) partition
order is used
to map 2D 4-hop messages
data partition
to a 1D
processor
1
2
3
array
(b) 1D (reduced dimensional) partition
4
5
6
7
8
(c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue) to a linear array.
 Assumptions:



Total grid points: N in a sqrt(N)xsqrt(N) arrangement
Commun. time = ts + k*th + m*tw, where ts is the startup time at source and target procs, k = # of
hops, th the decision time per switch, m = msg. size in “flits” (# of bits that can be transmitted in
parallel), tw is propagation delay per flit. ts does not affect congestion, so for congestion/delay
purposes we approx. commun. time here by k*th + m*tw
Analysis: 1D data partition & linear topology: All commun. is near neighbor and
hence parallel per iteration and w/o contention: Parallel commun. time per
iteration in a finite-element computation = th + sqrt(N)*tw ~ Q(sqrt(N)) for large N
Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor
Finite elt. grid:
1
5
Row major
order is used
to map 2D
data partition
to a 1D
processor
array
2
6
3
7
4
Processes
mapped to data
(& thus comput.)
2
3
2
3
4
5
6
7
8
8
(a) 2D (full dimensional) partition
4-hop messages
1
1
(b) 1D (reduced dimensional) partition
4
5
6
7
8
(c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue) to a linear array
 Analysis: 2D data partition & linear topology:
 Near neighbor commun. part is parallel and w/o contentions. So parallel commun. time per iteration of
FE computation= th + (sqrt(N)/2)*tw = Q(sqrt(N)/2), for large N.
 2nd dim. communication is for a distance of 4 hops (generally about sqrt(P)-hops):



Msg. from 4  8 will take time 4* th + (sqrt(N)/4)*tw ~ Q(sqrt(N)/4) for large N. Assuming cut-
through/wormhole routing in which an entire path is first reserved and then the data passes through via multiple
interconnects as if it is a single interconnect, msgs from 3  7, 2  6, 1  5 are blocked until 4  8
communication is completed. After this is finished, say, 3  7 comm. occurs, while the others are blocked.
So commun. is sequentialized for 4-hop messages along an intersecting route. So commun. along dim. 2 of the data
is not fully parallel (they are started in parallel but get sequentialized due to contention) and takes a total parallel
time of 4(4* th + (sqrt(N)/4)*tw ) ~ Q(sqrt(N) for large N, similar in order notation to 1D partiton, but will
actually take more time in terms of the th terms (16* th vs. th ).
Total parallel commun. time is Q(1.5 sqrt(N)), worse than for 1D partitioning.What would this be for a general P?
Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor
Finite elt. grid:
1
5
2
6
3
7
4
Processes
mapped to data
(& thus comput.)
1
2
3
4
5
6
7
8
8
(a) 2D (full dimensional) partition
(b) 1D (reduced dimensional) partition
1
2
3
4
1
2
3
4
5
6
7
8
8
7
6
5
(c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue), & corresponding
processes to a 2D array/mesh. The latter is equivalent to embedding a linear array onto a 2D array to minimize average
dilation (the factor by which the length of a near-neighbor commun. or link increases in the host/target topology.
 Analysis: 1D data partition & 2D topology: All commun. is near neighborr and
hence parallel per iteration and w/o contention: Parallel commun. time per
iteration in a finite-element computation = th + sqrt(N)*tw ~ Q(sqrt(N)) for large N
 Analysis: 2D data partition & 2D topology: All commun. is near neighborr and
hence w/o contention. However, each processor has two communication steps
that are partly sequentialized:

Worst-case: Parallel commun. time per iteration in a FE computation = th + (sqrt(N)/2)*tw + th +
(sqrt(N)/4)*tw ~ Q(0.75*sqrt(N)) for large N

Best (close to Average) case: The bulk of both commun. which is data transfer takes place in
parallel but the startup time (ts) is sequentialized. So parallel commun. time per iteration
= max(th + (sqrt(N)/2)*tw , th + (sqrt(N)/4)*tw ) ~ Q(0.5*sqrt(N)) for large N. For general P?
Parallel Algorithm Design—A Process/Data View: O/P Data Partitioning
•
Output data partitioning: E.g., Matrix multiplication
A
X
B
=
Each process computes
a block of data of C (one or
more Cij’s)
C
Issues of input data distribution among the processes:
 If to compute elt. Cij, row Ai and col Bj are available at the process then no
communication needed to move input data to the right processes
 If each process computing a block of C has similar blocks of A, B data, then data
needs to be moved around the processes so that the one computing Cij has Ai
and Bj available. This will require partial all-to-all broadcasts; see below.
block of A data it has to processes in its
“row” and similarly the block of B data to
processes in its column.
B
• After the above stage, every process
A
computing each elt. Cij, will have row Ai
and col. Bj
• The time complexity of the above phase
will depend on the interconn. topology of
the multiprocessor

Parallel Algorithm Design—Strategy Summary




Data partitioning: defines processes; based partly on amount & pattern of
communication needed and target topology
Determine communication pattern required
Granularity of breakup: N/P for i/p data parallel or M/P for o/p data parallel can
either be decided based on given P (straightforward) or based on overhead (i.e., P
to be determined) to minimize either parallel time Tp or efficiency Ep or work
efficiency WEp (see below)
Map processes to processors in target topology to:




reduce average message hops (try to maximize near-neighbor communication)
maximize parallelization of message passing across processes
minimize conflicts among message paths (conflict depends partly on type of routing “flow
control”—store-and-forward or wormhole/cut-through; we assumed the latter in previous
analysis)
Overall Goals:




minimize parallel time Tp (or speedup Sp)
maximize efficiency Ep = Sp/P
maximize work efficiency WEp = Sp/WSp, where WSp is total work scale-up (correlated to
total power consumed) across the P processors compared to the total sequential work Ws in
the computation = Wp/Ws, where Wp is the total work done across P processors (= total
time across P procs)
More on parallelization metrics later
MIMD/SPMD Coding Style
Loop
Do some computation;
send data and/or