### 3 并行性问题

```Foundation to Parallel
Programming

•
•
•
•
Introduction to Parallel Programming
Approaches for Parallel Programs
Parallel Programming Model
2
Parallel Programming is a Complex Task
• Parallel software developers handle issues such as:
–
–
–
–
–
–
–
–
–
–
Non-determinism
Communication
Synchronization
data partitioning and distribution
fault-tolerance
Heterogeneity
shared or distributed memory
race conditions.
3
Users’ Expectations from Parallel Programming
Environments
• Parallel computing can only be widely successful if
parallel software is able to meet expectations of the
users, such as:
–
–
–
–
–
–
–
–
–
provide architecture/processor type transparency
provide network/communication transparency
be easy-to-use and reliable
provide support for fault-tolerance
accommodate heterogeneity
assure portability
provide support for traditional high-level languages
be capable of delivering increased performance
to provide parallelism transparency
4
2 并行程序构造方法

for ( i= 0; i<N; i++ ) A[i]=b[i]*b[i+1];
for (i= 0; i<N; i++) c[i]=A[i]+A[i+1];
(a) 使用库例程构造并行程序
id=my_process_id();
p=number_of_processes();
for ( i= id; i<N; i=i+p) A[i]=b[i]*b[i+1];
barrier();
for (i= id; i<N; i=i+p) c[i]=A[i]+A[i+1];

(c) 加编译注释构造并行程序的方法
#pragma parallel
#pragma shared(A,b,c)
#pragma local(i)
{
# pragma pfor iterate(i=0;N;1)
for (i=0;i<N;i++) A[i]=b[i]*b[i+1];
# pragma synchronize
# pragma pfor iterate (i=0; N; 1)
for (i=0;i<N;i++)c[i]=A[i]+A[i+1];
}

(b) 扩展串行语言
my_process_id,number_of_processes(), and barrier()
A(0:N-1)=b(0:N-1)*b(1:N)
c=A(0:N-1)+A(1:N)

5
2 并行程序构造方法

MPI, PVM

Fortran90

6
3 并行性问题
3.1 进程的同构性
SIMD: 所有进程在同一时间执行相同的指令
MIMD:各个进程在同一时间可以执行不同的指令
SPMD: 各个进程是同构的，多个进程对不同的数据执

MPMD:各个进程是异构的， 多个进程执行不同的代码

7
3 并行性问题

parbegin S1 S2 S3 …….Sn parend
S1 S2 S3 …….Sn可以是不同的代码

parbegin S1 S2 S3 …….Sn parend
S1 S2 S3 …….Sn是相同代码

parfor (i=1; i<=n, i++) S(i)
8
3 并行性问题
SPMD程序的构造方法

parfor (i=0; i<=N, i++) foo(i)

parfor (i=0; i<=N, i++) {
C[i]=A[i]+B[i];
}

pid=my_process_id();
numproc=number_of _processes();
parfor (i=pid; i<=N, i=i+numproc) foo(i)

C=A+B

forall (i=1,N) C[i]=A[i]+B[i]

run A –numnodes N
9
3 并行性问题
MPMD程序的构造方法

parbegin S1 S2 S3 parend

parbegin S1 S2 S3 parend

run S1 on node1
run S2 on node2
run S3 on node3

parfor (i=0; i<3, i++) {
if (i=0) S1
if (i=1) S2
if (i=2) S3
}
S1, S2和S3是顺序语言程

10
3 并行性问题
3.2 静态和动态并行性

parbegin P, Q, R parend

while (C>0) begin
fork (foo(C));
C:=boo(C);
end

11
3 并行性问题

Process A:
begin
Z:=1
fork(B);
T:=foo(3);
end
Process B:
begin
fork(C);
X:=foo(Z);
join(C);
output(X+Y);
end
Process C:
begin
Y:=foo(Z);
end
Fork: 派生一个子进程
Join: 强制父进程等待子进程
12
3 并行性问题
3.3 进程编组

3.4 划分与分配

13
3 并行性问题

指令级并行
块级并行
进程级并行
任务级并行

14
Levels of Parallelism
PVM/MPI
Compilers
CPU
func1 ( )
{
....
....
}
a ( 0 ) =..
b ( 0 ) =..
+
func2 ( )
{
....
....
}
a ( 1 )=..
b ( 1 )=..
x
func3 ( )
{
....
....
}
a ( 2 )=..
b ( 2 )=..
Code-Granularity
Code Item
Large grain
Program
Medium grain
(control level)
Fine grain
(data level)
Loop (Compiler)
Very fine grain
(multiple issue)
With hardware 15
Responsible for Parallelization
Grain Size
Code Item
Parallelised
by
Very Fine
Instruction
Processor
Fine
Loop/Instruction block
Compiler
Medium
(Standard one page) Function
Programmer
Large
Program/Separate heavy-weight
process
Programmer
16
4 交互／通信问题

4.1 交互的类型
通信：两个或多个进程间传送数的操作

共享变量
父进程传给子进程(参数传递方式)
消息传递
17
4 交互／通信问题
同步：
原子同步
控制同步(路障,临界

数据同步(锁,条件临

parfor (i:=1; i<n; i++) {
atomic{x:=x+1; y:=y-1}
}

parfor(i:=1; i<n; i++){
Pi
barrier
Qi
}

parfor(i:=1; i<n; i++){
critical{x:=x+1; y:=y+1}
}

parfor(i:=1; i<n; i++){
lock(S);
x:=x+1;
y:=y-1;
unlock(S)
}
18
4 交互／通信问题
聚集(aggregation)：用一串超步将各分进程计算所得

归约
扫描

parfor(i:=1; i<n; i++){
X[i]:=A[i]*B[i]
inner_product:=aggregate_sum(X[i]);
}
19
4 交互／通信问题
4.2 交互的方式
P1 P2

…
Pn

…
20
4 交互／通信问题
•4.3 交互的模式
一对一:点到点(point to point)
多对一:收集(gather), 归约(reduce)
多对多:全交换(Tatal Exchange), 扫描
(scan) , 置换/移位(permutation/shift)
21
4 交互／通信问题
P1
1
P1
1
P1
1
P1
1,1
P2
3
P2
3
P2
3
P2
3,1
P3
5
P3
5,1
P3
5
P3
5,1
(a) 点对点(一对一): P1发送一个值给P3
(b) 广播(一对多): P1发送一个值给全体
P1
1,3,5
P1
1
P1
1,3,5
P2
P2
3
P2
3
P2
3
P3
P3
5
P3
5
P3
5
P1
1,3,5
(c) 播撒(一对多): P1向每个结点发送一个值
(d) 收集(多对一): P1从每个结点接收

22
4 交互／通信问题
P1
1,2,3
P1
1,4,7
P1
1
P1
1,5
P2
4,5,6
P2
2,5,8
P2
3
P2
3,1
P3
7,8,9
P3
3,6,9
P3
5
P3
5,3
(e) 全交换(多对多): 每个结点向每个

P1
1
P1
1,9
P2
3
P2
3
P3
5
P3
5
(g) 归约(多对一): P1得到和1+3+5=9
(f) 移位(置换, 多对多): 每个结

P1
1
P1
1,1
P2
3
P2
3,4
P3
5
P3
5,9
(h) 扫描(多对多): P1得到1, P2得到
1+3=4, P3得到1+3+5=9
23
Approaches for Parallel Programs
Development
• Implicit Parallelism
– Supported by parallel languages and parallelizing
compilers that take care of identifying parallelism, the
scheduling of calculations and the placement of data.
• Explicit Parallelism
– based on the assumption that the user is often the
best judge of how parallelism can be exploited for a
particular application.
– programmer is responsible for most of the
parallelization effort such as task decomposition,
mapping task to processors, the communication
structure.
24
Strategies for Development
25
Parallelization Procedure
Assignment
Decomposition
Sequential Computation
Process Elements
Mapping
Orchestration
Processors
26
Sample Sequential Program
FDM (Finite Difference Method)
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] +
+ a[i+1][j] + a[i][j]);
}
}
a[i][j+1] + a[i-1][j]
}
…
27
Parallelize the Sequential Program
• Decomposition
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1]
+ a[i-1][j] + a[i+1][j] + a[i][j]);
}
}
}
…
28
Parallelize the Sequential Program
• Assignment
PE
equally among
process elements
PE
PE
PE
29
Parallelize the Sequential Program
• Orchestration
PE
need to communicate
and to synchronize
PE
PE
PE
30
Parallelize the Sequential Program
• Mapping
PE
PE
PE
PE
Multiprocessor
31
Parallel Programming Models
• Sequential Programming Model
• Shared Memory Model (Shared Address Space Model)
– DSM
• Message Passing Model
– PVM
– MPI
• Hybrid Model
– Mixing shared and distributed memory model
– Using OpenMP and MPI together
32
Sequential Programming Model
• Functional
– Naming: Can name any variable in virtual address
space
• Hardware (and perhaps compilers) does translation to
– Ordering: Sequential program order
33
Sequential Programming Model
• Performance
– Rely on dependences on single location (mostly):
dependence order
– Compiler: reordering and register allocation
– Hardware: out of order, pipeline bypassing, write
buffers
– Transparent replication in caches
34
Programming Model
System
(Process)
(Process)
write(X)
X
Shared variable
35
•Naturally provided on wide range of platforms
–History dates at least to precursors of mainframes in
early 60s
–Wide range of scale: few to hundreds of processors
•Popularly known as shared memory machines or
model
–Ambiguous: memory may be physically distributed
among processors
36
•Any processor can directly reference any memory
location
–Communication occurs implicitly as result of loads and
stores
•Convenient:
–Location transparency
–Similar programming model to time-sharing on
uniprocessors
• Except processes run on different processors
• Good throughput on multiprogrammed workloads
37
• Process: virtual address space plus one or more threads of control
• Portions of address spaces of processes are shared
collection of processes communicating
P1
Pn pr i v at e
Pn
P2
Common physical
P0
St or e
Shared portion
Private portion
P2 pr i vat e
P1 pr i vat e
P0 pr i vat e
• Writes
• Natural extension of uniprocessor model: conventional memory
operations for comm.; special atomic operations for synchronization
38
• OS uses shared memory to coordinate processes
Communication Hardware for SAS
• Also natural extension of uniprocessor
• Already have processor, one or more memory
modules and I/O controllers connected by
hardware interconnect of some sort
– Memory capacity increased by adding modules, I/O by
I/O
devices
controllers
Mem
Mem
Mem
Mem
Interconnect
Processor
I/O ctrl
I/O ctrl
Interconnect
Processor
39
History of SAS Architecture
• “Mainframe” approach
– Motivated by multiprogramming
– Extends crossbar used for mem bw and I/O
– Originally processor cost limited to small
• later, cost of crossbar
P
P
I/O
C
I/O
C
– Bandwidth scales with p
– High incremental cost; use multistage instead
M
M
M
M
M
M
\$
\$
P
P
 “Minicomputer” approach






Almost all microprocessor systems have bus
Motivated by multiprogramming
Used heavily for parallel computing
Called symmetric multiprocessor (SMP)
Latency larger than for uniprocessor
Bus is bandwidth bottleneck
 caching is key: coherence problem
 Low incremental cost
I/O
I/O
C
C
40
CPU
P-Pro
module
256-KB
Interrupt
L2 \$
controller
Bus interface
P-Pro
module
P-Pro
module
PCI
bridge
PCI bus
PCI
I/O
cards
PCI
bridge
PCI bus
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
Memory
controller
MIU
1-, 2-, or 4-way
interleaved
DRAM
– Highly integrated,
targeted at high
volume
– Low latency and
bandwidth
41
Example: SUN Enterprise
P
\$
P
\$
\$2
\$2
CPU/mem
cards
Mem ctrl
Bus interface/switch
Gigaplane bus (256 data, 41 address, 83 MHz)
I/O cards
–
2 FiberChannel
SBUS
SBUS
Memory on processor cards themselves
•
–
–
SBUS
100bT, SCSI
Bus interface
16 cards of either type: processors + memory, or I/O
But all memory accessed over bus, so symmetric
Higher bandwidth, higher latency bus
42
M
M
Scaling Up

M
Net work
Net work
\$
\$
 \$
M \$
M \$
P
P
P
P
P
“Dance hall”

M \$
P
Distributed memory
– Problem is interconnect: cost (crossbar) or bandwidth (bus)
– Dance-hall: bandwidth still scalable, but lower cost than
crossbar
• latencies to memory uniform, but uniformly large
– Distributed memory or non-uniform memory access (NUMA)
• Construct shared address space out of simple message transactions
43
Model
• Naming
– Any process can name any variable in shared space
• Operations
– Loads and stores, plus those needed for ordering
• Simplest Ordering Model
– Within a process/thread: sequential program order
– Across threads: some interleaving (as in time-sharing)
44
Synchronization
• Mutual exclusion (locks)
– No ordering guarantees
• Event synchronization
– Ordering of events to preserve dependences
– e.g. producer —> consumer of data
– 3 main types:
• point-to-point
• global
• group
45
MP Programming Model
Node A
Node B
process
process
send (Y)
Y
Y’
message
46
Message-Passing Programming Model
Match
Send X, Q, t
–
–
–
–
–
Local process
Local process
ProcessP
Process Q
Send specifies data buffer to be transmitted and receiving process
Recv specifies sending process and application storage to receive into
Memory to memory copy, but need to name processes
User process names only local data and entities in process/tag space
In simplest form, the send/recv match achieves pairwise synch event
– Many overheads: copying, buffer management,
protection
47
Message Passing Architectures
• Complete computer as building block, including I/O
– Communication via explicit I/O operations
• Programming model: directly access only private
address space (local memory), comm. via explicit
• Programming model more removed from basic
hardware operations
– Library or OS intervention
48
Evolution of Message-Passing Machines
• Early machines: FIFO on each link
– synchronous ops
– Replaced by DMA, enabling nonblocking ops
• Buffered by system at destination until
recv
101
100
• Diminishing role of topology
– Store&forward routing
– Cost is in node-network interface
– Simplifies programming
001
000
111
011
110
010
49
Example: IBM SP-2
Power 2
CPU
IBM SP-2 node
L2 \$
Memory bus
General interconnection
network formed fr om
8-port switches
4-way
interleaved
DRAM
Memory
controller
MicroChannel bus
I/O
DMA
i860
NI
DRAM
NIC
– Made out of essentially complete RS6000 workstations
– Network interface integrated in I/O bus (bw limited by I/O bus)
• Doesn’t need to see memory references
50
Example Intel Paragon
i860
i860
L1 \$
L1 \$
Intel
Paragon
node
Memory bus (64-bit, 50 MHz)
Mem
ctrl
DMA
Driver
Sandia’ s Intel Paragon XP/S-based Super computer
2D grid network
with processing node
attached to every switch
NI
4-way
interleaved
DRAM
8 bits,
175 MHz,
bidirectional
– Network interface integrated in memory bus, for performance
51
Message Passing Programming Model
• Naming
– Processes can name private data directly.
• Operations
– Explicit communication: send and receive
– Send transfers data from private address space to
another process
– Receive copies data from process to private
– Must be able to name processes
52
Message Passing Programming Model
• Ordering
– Program order within a process
– Send and receive can provide pt-to-pt synch
between processes
• Can construct global address space
space
– But no direct operations on these names
53
Design Issues Apply at All Layers
• Programming model’s position provides
constraints/goals for system
• In fact, each interface between layers supports
or takes a position on
– Naming model
– Set of operations on names
– Ordering model
– Replication
– Communication performance
54
Naming and Operations
• Naming and operations in programming model
can be directly supported by lower levels, or
translated by compiler, libraries or OS
• Example
– Shared virtual address space in programming model
• Hardware interface supports shared physical
– Direct support by hardware through v-to-p mappings,
no software layers
55
Naming and Operations (Cont’d)
• Hardware supports independent physical
– system/user interface: can provide SAS through
OS
• v-to-p mappings only for data that are local
• remote data accesses incur page faults; brought in via
page fault handlers
– Or through compilers or runtime, so above
sys/user interface
56
Naming and Operations (Cont’d)
• Example: Implementing Message Passing
• Direct support at hardware interface
– But match and buffering benefit from more flexibility
• Support at sys/user interface or above in
software (almost always)
– Hardware interface provides basic data transport
– Send/receive built in sw for flexibility (protection, buffering)
– Or lower interfaces provide SAS, and send/receive built on
57
Naming and Operations (Cont’d)
• Need to examine the issues and tradeoffs at
every layer
– Frequencies and types of operations, costs
• Message passing
– No assumptions on orders across processes except
• SAS
– How processes see the order of other processes’
references defines semantics of SAS
• Ordering very important and subtle
58
Naming and Operations (Cont’d)
• Uniprocessors play tricks with orders to gain
parallelism or locality
• These are more important in multiprocessors
• Need to understand which old tricks are valid,
and learn new ones
• How programs behave, what they rely on, and
hardware implications
59
Message Passing Model / Shared Memory Model
Message Passing
Shared Memory
Architecture any
Programming difficult
SMP or DSM
easier
Performance
good
better (SMP)
worse (DSM)
Cost
less expensive
very expensive
SunFire15K
\$4,140,830
60
•
•
•
•
•
Single-Program Multiple-Data (SPMD)
Pipelining
Divide and Conquer
Speculation.
61
Master Worker/Slave Model
• Master decomposes the
distributes to workers and
gathers partial results to
produce the final result.
– Static
– Dynamic
• When number of tasks are
larger than the number of
CPUs / they are know at
runtime / CPUs are
heterogeneous.
Static
62
Single-Program Multiple-Data
• Most commonly used
model.
• Each process executes
the same piece of
code, but on different
parts of the data.—
splitting the data
among the available
processors.
• geometric/domain
decomposition, data
parallelism.
63
Pipelining
• Suitable for fine grained parallelism.
• Also suitable for application involving
multiple stages of execution, but need to
operate on large number of data sets.
64
Divide and Conquer
• A problem is divided into two or
more sub problems, and each of
these sub problems are solved
independently, and their results
are combined.
• 3 operations: split, compute, and
join.
divide and conquer with master
doing both split and join
operation.
• (seems like a “hierarchical”
master-work technique)
65
Speculative Parallelism
• It used when it is quite difficult to achieve
parallelism through one of the previous