8. Cache Coherence, Consistency, Synchronization

Report
Altix 4700
ccNUMA Architecture
• Distributed Memory - Shared address space
Altix HLRB II – Phase 2
• 19 partitions with 9728 cores
• Each with 256 Itanium dual-core processors, i.e., 512 cores
– Clock rate 1.6 GHz
– 4 Flops per cycle per core
– 12,8 GFlop/s (6,4 GFlop/s per core)
• 13 high-bandwidth partitions
– Blades with 1 processor (2 cores) and 4 GB memory
– Frontside bus 533 MHz (8.5 GB/sec)
• 6 high-density partitions
– Blades with 2 processors (4 cores) and 4 GB memory.
– Same memory bandwidth.
• Peak Performance: 62,3 TFlops (6.4 GFlops/core)
• Memory: 39 TB
Memory Hierarchy
• L1D
• 16 KB, 1 cycle latency, 25,6 GB/s bandwidth
• cache line size 64 bytes
• L2D
• 256 KB, 6 cycles, 51 GB/s
• cache line size 128 bytes
• L3
• 9 MB, 14 cycles, 51 GB/s
• cache line size 128 bytes
Interconnect
• NUMAlink 4
• 2 links per blade
• Each link 2*3,2 GB/s bandwidth
• MPI latency 1-5µs
Disks
• Direct attached disks (temporary large files)
• 600 TB
• 40 GB/s bandwidth
• Network attached disks (Home Directories)
• 60 TB
• 800 MB/s bandwidth
Environment
• Footprint: 24 m x 12 m
• Weight: 103 metric tons
• Electrical power: ~1 MW
NUMAlink Building Block
NUMALink 4
Router
Level 1
8 cores
(high bandwidth)
16 cores
(high-density)
NUMALink 4
Router
Level 1
PCI/FC
B
L
A
D
E
B
L
A
D
E
B
L
A
D
E
B
L
A
D
E
NUMALink 4
Router
Level 1
I
O
B
L
A
D
E
B
L
A
D
E
B
L
A
D
E
B
L
A
D
E
B
L
A
D
E
NUMALink 4
Router
Level 1
I
O
B
L
A
D
E
SAN
Switch
10 GE
Blades and Rack
Interconnection in a Partition
Interconnection of Partitions
• Gray squares
• 1 partition with 512 cores
• L: Login B:Batch
• Lines
• 2 NUMALink4 planes with 16 cables
• each cable: 2 * 3,2 GB/s
Interactive Partition
4 OS
16 Login
12 Login
16
16
• Login cores
4 Login
16
12 Batch
16
16
16
16
16
16
• 32 for compile & test
• Interactive batch jobs
• 476 cores
• managed by PBS
– daytime interactive usage
– small-scale and nighttime
batch processing
– single partition only
• High-density blades
• 4 cores per memory
18 Batch Partitions
• Batch jobs
4 OS
8 (16)
8 (16)
8 (16)
8 (16)
8 (16)
8 (16)
6 (12)
8 (16)
8 (16)
8 (16)
8 (16)
8 (16)
•
•
•
•
510 (508) cores
managed by PBS
large-scale parallel jobs
single or multi-partition jobs
• 5 partitions with highdensity blades
• 13 partitions with highbandwidth blades
Bandwidth
Bandwidth (MB/s) Intra-Node
3000
Intranode
2500
2000
1500
1000
500
0
Internode
Coherence Implementatioin
• SHUB2 supports up to 8192 SHUBs (32768 cores)
• Coherence domain up to 1024 SHUBs(4096 cores)
• SGI term: "Sharing mode"
• Directory with one bit per SHUB
• Multiple shared copies are supported.
• Accesses of other coherence domains
•
•
•
•
SGI term: "Exclusive sharing mode"
Always translated in exclusive access
Only single copy is supported
Directory stores the address of SHUB(13 bits)
SHMEM Latency Model for Altix
• SHMEM get latency is sum of:
•
•
•
•
•
80 nsec for function call
260 nsec for memory latency
340 nsec for first hop
60 nsec per hop
20 nsec per meter of NUMAlink cable
• Example
• 64 P system: max hops is 4, max total cable length is 4.
• Total SHMEM get latency is:
1000 nsec = 80 + 260 + 340 + 60x4 + 20x4
Parallel Programming Models
Intra-Host (512 cores)
Altix® System
OpenMP
Pthreads
Linux Image 1
MPI
Intra-Coherency
Domain (4096 cores)
and across entire machine
MPI
SHMEM
SHMEMTM
Linux Image 2
Global
segments
Coherency
Domain 1
Coherency
Domain 2
Global
Segments
Barrier Synchronization
• Frequent in OpenMP, SHMEM, MPI single sided ops
(MPI_Win_fence)
• Tree-based implementation using multiple fetch-op
variables to minimize contention on SHUB.
• Using uncached load to reduce NUMAlink traffic.
CPU
HUB
Fetch-op
CPU
variable
ROUTER
Programming Models
• OpenMP on an Linux image
• MPI
• SHMEM
• Shared segments (System V und Global Shared
Memory)
SHMEM
• Can be used for MPI programs where all processes
execute same code.
• Enables access within and across partitions.
• Static data and symmetric heap data (shmalloc or shpalloc)
• info: man intro_shmem
Example
#include <mpp/shmem.h>
main()
{
long source[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
static long target[10];
MPI_Init(…)
if (myrank == 0) {
/* put 10 elements into target on PE 1 */
shmem_long_put(target, source, 10, 1);
}
shmem_barrier_all(); /* sync sender and receiver */
if (myrank == 1)
printf("target[0] on PE %d is %d\n", myrank,target[0]);
}
Global Shared Memory Programming
• Allocation of a shared memory segment via collective
GSM_alloc.
• Similar to memory mapped files or System V shared
segments. But these are limited to a single OS instance.
• GSM segment can be distributed across partitions.
– GSM_ROUNDROBIN: Pages are distributed in roundrobin
across processes
– GSM_SINGLERANK: Places all pages near to a single process
– GSM_CUSTOM_ROUNDROBIN: Each process specifies how
many pages should be placed in its memory.
• Data structures can be placed in this memory segment
and accessed from all processes with normal load and
store instructions.
Example
#include <mpi_gsm.h>
placement = GSM_ROUNDROBIN;
flags = 0; size = ARRAY_LEN * sizeof(int); int *shared_buf;
rc = GSM_Alloc(size, placement, flags, MPI_COMM_WORLD,&shared_buf);
// Have one rank initialize the shared memory region
if (rank == 0) {
for(i=0; i < ARRAY_LEN; i++) {
shared_buf[i] = i;
}
}
MPI_Barrier(MPI_COMM_WORLD);
// Have every rank verify they can read from the shared memory
for (i=0; i < ARRAY_LEN; i++) {
if (shared_buf[i] != i) {
printf("ERROR!! element %d = %d\n", i, shared_buf[i]);
printf("Rank %d - FAILED shared memory test.\n", rank);
exit(1);
}
}
Summary
• Altix 4700 is a ccNUMA system
• >60 TFlop/s
• MPI messages sent with two-copy or single-copy
protocol
• Hierarchical coherence implementation
• Intranode
• Coherence domain
• Across coherence domains
• Programming models
•
•
•
•
OpenMP
MPI
SHMEM
GSM
The Compute Cube of LRZ
Rückkühlwerke
Hö
Höchstleistungsrechner
(säulenfrei)
(sä
Zugangsbrü cke
Zugangsbrücke
Server/Netz
Archiv/Backup
Archiv/Backup
Klima
Klima
Elektro

similar documents