### Parallel Programming

```FOUNDATION TO PARALLEL
PROGRAMMING
2
CONTENT
• 并行程序设计简介
• 并行程序设计模型
• 并行程序设计范型
3
Parallel Programming is a Complex Task
• 并行软件开发人员面临的问题:
– 不确定性
– 通讯
– 同步
– 划分与分发
– 负载平衡
– 容错
– 竞争
– 死锁
– ...
4
Levels of Parallelism
PVM/MPI
Compilers
CPU
func1 ( )
{
....
....
}
a ( 0 ) =..
b ( 0 ) =..
+
func2 ( )
{
....
....
}
a ( 1 )=..
b ( 1 )=..
x
func3 ( )
{
....
....
}
a ( 2 )=..
b ( 2 )=..
Code-Granularity
Code Item
Large grain
Program
Medium grain
(control level)
Fine grain
(data level)
Loop (Compiler)
Very fine grain
(multiple issue)
With hardware
5
Responsible for Parallelization
Grain Size Code Item
Parallelised by
Very Fine
Instruction

Fine
Loop/Instruction block

Medium
(Standard one page) Function

Large
Program/Separate heavy-weight
process

6
Parallelization Procedure
Assignment
Decomposition
Sequential Computation
Process Elements
Mapping
Processors
Orchestration
7
Sample Sequential Program
FDM (Finite Difference Method)
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] +
+ a[i+1][j] + a[i][j]);
}
}
}
…
a[i][j+1] + a[i-1][j]
8
Parallelize the Sequential Program
• Decomposition
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1]
+ a[i-1][j] + a[i+1][j] + a[i][j]);
}
}
}
…
9
Parallelize the Sequential Program
• Assignment
PE
equally among
process elements
PE
PE
PE
10
Parallelize the Sequential Program
• Orchestration
PE
need to
communicate
and to
synchronize
PE
PE
PE
11
Parallelize the Sequential Program
• Mapping
PE
PE
PE
PE
Multiprocessor
12
Parallel Programming Models
• Sequential Programming Model
• Shared Memory Model (Shared Address Space
Model)
•
•
•
•
DSM
Cilk
• Message Passing Model
• PVM
• MPI
• Functional Programming
• MapReduce
13
Parallel Programming Models
• Partitioned Global Address Space Programming
(PGAS) Languages
•
UPC, Coarray Fortran, Titanium
• Languages and Paradigm for Hardware
Accelerators
•
CUDA, OpenCL
• Hybrid: MPI + OpenMP + CUDA/OpenCL
trends
Scalar Application
Vector
MPP System, Message Passing: MPI
Distributed
memory
Multi core nodes: OpenMP,…
Shared
Memory
Accelerator (GPGPU,
FPGA): Cuda,
OpenCL,..
Hybrid codes
15
Sequential Programming Model
• Functional
• Naming: Can name any variable in virtual
• Hardware (and perhaps compilers) does
• Ordering: Sequential program order
16
Sequential Programming Model
• Performance
• Rely on dependences on single location
(mostly): dependence order
• Compiler: reordering and register allocation
• Hardware: out of order, pipeline bypassing,
write buffers
• Transparent replication in caches
17
Programming Model
System
(Process)
(Process)
write(X)
X
Shared variable
18
Model
• 变量命名
• 任何进程在共享空间里可以命名任何变量
• Operations
• Loads and stores, plus those needed for ordering
• Simplest Ordering Model
• 在进程/线程内: sequential program order
• 线程之间: 存在交叉 (类似于分时里面的交叉)
19
Synchronization
• Mutual exclusion (locks)
• No ordering guarantees
• Event synchronization
• Ordering of events to preserve dependences
• e.g. producer —> consumer of data
20
MP Programming Model
Node A
Node B
process
process
send (Y)
Y
Y’
message
21
Message-Passing Programming
Model
Match
Send X, Q, t
Local process
Local process
ProcessP
Process Q
• Send指定待传输的数据缓存以及接受进程
• Recv指定发送进程以及存放接受数据的存储空间
• 用户进程只能在进程地址空间里命名局部变量和实体
• 存在许多开销：拷贝、缓存管理、保护
22
Message Passing Programming Model
• 命名
– 进程可以直接命名局部变量
– 不存在共享地址空间
• Operations
– Send从私有空间传输数据到另外一个进程
– 必须能够命名进程
23
Message Passing Programming Model
• Ordering
• 进程里面由程序确定顺序
• 可以构建全局地址空间
• 例如：进程id + 进程地址空间内部地址
• 但对其不存在直接操作
Functional Programming
• 函数操作不会更改数据结构，而是创建新的数据结

• 原来数据始终未改
• 数据流动未明确在程序设计中确定
• 操作的顺序并不重要
Functional Programming
fun foo(l: int list) =
sum(l) + mul(l) + length(l)
Order of sum() and mul(), etc does not matter – they do
not modify l
GPU
• Graphical Processing Unit
• 一个GPU由大量的核组成，比如上百个核.
• 但通常CPU包含 2, 4, 8或12个核
• Cores? – 芯片里至少共享内存或L1 cache的处理

• General Purpose computation using GPU in
applications other than 3D graphics
• GPU accelerates critical path of application
CPU v/s GPU
GPU and CPU
• Typically GPU and CPU coexist in a heterogeneous
setting
• “Less” computationally intensive part runs on CPU
(coarse-grained parallelism), and more intensive parts run
on GPU (fine-grained parallelism)
• NVIDIA’s GPU architecture is called CUDA (Compute
Unified Device Architecture) architecture, accompanied by
CUDA programming model, and CUDA C language
What is CUDA?
CUDA: Compute Unified Device
Architecture.
A parallel computing architecture
developed by NVIDIA.
The computing engine in GPU.
instruction set and memory of the
parallel computation elements in GPUs.
Processing Flow
CUDA的处理流:
从主存拷贝数据到GPU内

CPU启动GPU上的计算进

GPU在每个核上并行执行
从GPU内存拷贝结果到主

CUDA Programming Model
Definitions:
Device = GPU
Host = CPU
Kernel =
function that
runs on the
device
CUDA Programming Model
A kernel is executed by a grid of thread
blocks
that can cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution
cooperate
Parallel portions of an application are
executed on the device as kernels
One kernel is executed at a time
Differences between CUDA and CPU threads
Instant switching
CUDA uses 1000s of threads to achieve efficiency
Multi-core CPUs can use only a few
A CUDA kernel is executed by an array of threads
All threads run the same code
Each thread has an ID that it uses to compute memory
Minimal Kernels
Manage memory
CPU v/s GPU
• Most parallel programs are written using either:
• Message passing with a SPMD model (MPI)
• Usually for scientific applications with C++/Fortran
• Scales easily
• Usually for non-scientific applications
• Easier to program, but less scalable performance
• Partitioned Global Address Space (PGAS) Languages take the
best of both
• SPMD parallelism like MPI
• Local/global distinction, i.e., layout matters
39/86
How does PGAS compare to other models?
Message passing
Shared Memory
PGAS
MPI
OpenMP
UPC, CAF, X10
• 计算在多个places执行.
• Place包含可以被运端进程

• 数据在生命周期里存在于

• 一个place的数据可以指向另

• 数据结构 (e.g. arrays) 可以分

A place expresses locality.
40
PGAS Overview
• “Partitioned Global
View” (or PGAS)
Space: 每一线程可

• Partitioned: 将全局

• 实现
• GA Library from PNNL
• Unified Parallel C (UPC),
FORTRAN 2009
• X10, Chapel
• 概念
• 内存和结构
• Partition and mapping
• Local and non-local
accesses
• Collective operations and
“Owner computes”
41
Memories and Distributions
• Software Memory
• Distinct logical storage area in a computer
program (e.g., heap or stack)
• For parallel software, we use multiple
memories
• Structure
• Collection of data created by program
execution (arrays, trees, graphs, etc.)
• Partition
• Division of structure into parts
• Mapping
• Assignment of structure parts to memories
42
Software Memory Examples
• Executable Image at
right
• Memories
• Static memory
• data segment
• Heap memory
• Holds allocated structures
• Explicitly managed by
programmer (malloc, free)
• Stack memory
• Holds function call records
• Implicitly managed by
runtime during execution
43
Affinity and Nonlocal Access
• Affinity是线程与内存的关

• 如果线程与内存存在关系，

• 这些的内存称为局部内存
• 非局部访问
• Part B in Memory 1

• 非局部访问通常通过进程

45
Programming Methods
Count
Memory
Count
Nonlocal Access
1
1
N/A
Either 1 or p
1
N/A
p
p
No. Message required.
1 (host) +
p (device)
2 (Host +
device)
No. DMA required.
UPC, FORTRAN
p
p
Supported.
X10
n
p
Supported.
Sequential
OpenMP
MPI
CUDA
Hybrid (MPI+OpenMP+CUDA+…
• Take the positive off all models
• Exploit memory hierarchy
• Many HPC applications are adopting this model
• Mainly due to developer inertia
• Hard to rewrite million of source lines
Hybrid parallel programming
Python: Ensemble simulations
MPI: Domain partition
OpenMP: External loop partition
48
Design Issues Apply at All Layers
• Programming model’s position provides
constraints/goals for system
• In fact, each interface between layers
supports or takes a position on
– Naming model
– Set of operations on names
– Ordering model
– Replication
– Communication performance
49
Naming and Operations
Naming and operations in programming
model can be directly supported by lower
levels, or translated by compiler, libraries
or OS
Example: Shared virtual address space in
programming model
Hardware interface supports shared physical
Direct support by hardware through v-to-p mappings, no
software layers
50
Naming and Operations (Cont’d)
Hardware supports independent physical
system/user interface: can provide SAS through
OS
 v-to-p mappings only for data that are local
 remote data accesses incur page faults; brought in via page
fault handlers
Or through compilers or runtime, so above sys/user
interface
51
Naming and Operations (Cont’d)
Example: Implementing Message Passing
Direct support at hardware interface
Support at sys/user interface or above in
software (almost always)
Hardware interface provides basic data transport
Send/receive built in software for flexibility
(protection, buffering)
Or lower interfaces provide SAS, and
send/receive built on top with buffers and
52
Naming and Operations (Cont’d)
• Need to examine the issues and tradeoffs at every
layer
• Frequencies and types of operations, costs
• Message passing
• No assumptions on orders across processes except those
• SAS
• How processes see the order of other processes’ references
defines semantics of SAS
• Ordering very important and subtle
53
Ordering model
• Uniprocessors play tricks with orders to gain
parallelism or locality
• These are more important in multiprocessors
• Need to understand which old tricks are valid,
and learn new ones
• How programs behave, what they rely on, and
hardware implications
54
• Single-Program Multiple-Data (SPMD)
• Pipelining
• Divide and Conquer
• Speculation.
Master Worker/Slave Model
• Master将问题分解

• 映射/负载平衡
• 静态
• 动态
Static
55
Single-Program Multiple-Data
• 每一进程执行同样的

• 领域分解，数据并行
56
57
Pipelining
• 适合
• 细粒度的并行
• 多阶段执行的应用

• 问题分解成多个子问题，每

• 3种操作: split, compute, 和
join.