pptx

Report
NCAR’s
Data-Centric Supercomputing Environment
Yellowstone
December 21, 2011
Anke Kamrath, OSD Director/CISL
[email protected]
Overview
• Strategy
– Moving from Process to Data-Centric Computing
• HPC/Data Architecture
– What we have today at ML
– What’s planned for NWSC
• Storage/Data/Networking – Data in Flight
– WAN and High-Performance LAN
– Central Filesystem Plans
– Archival Plans
2
Evolving the Scientific Workflow
• Common data movement issues
– Time consuming to move data between systems
– Bandwidth to archive system is insufficient
– Lack of sufficient disk space
• Need to evolve data management techniques
– Workflow management systems
– Standardize metadata information
– Reduce/eliminate duplication of datasets (ie – CMIP5)
• User Education
– Effective methods for understanding workflow
– Effective methods for streamlining workflow
Traditional Workflow
Process Centric Data Model
Evolving Scientific Workflow
Information Centric Data Model
Current Resources @ Mesa Lab
GLADE
BLUEFIRE
LYNX
NWSC: Yellowstone Environment
Geyser &
Caldera
Yellowstone
GLADE
HPC resource, 1.55 PFLOPS peak
Central disk resource
11 PB (2012), 16.4 PB (2014)
DAV clusters
High Bandwidth Low Latency HPC and I/O Networks
FDR InfiniBand and 10Gb Ethernet
NCAR HPSS Archive
100 PB capacity
~15 PB/yr growth
1Gb/10Gb Ethernet (40Gb+ future)
Science Gateways
RDA, ESG
Data Transfer
Services
Remote Vis
Partner Sites
XSEDE Sites
NWSC-1 Resources in a Nutshell
• Centralized Filesystems and Data Storage Resource (GLADE)
– >90 GB/sec aggregate I/O bandwidth, GPFS filesystems
– 10.9 PetaBytes initially -> 16.4 PetaBytes in 1Q2014
• High Performance Computing Resource (Yellowstone)
– IBM iDataPlex Cluster with Intel Sandy Bridge EP† processors with
Advanced Vector Extensions (AVX)
– 1.552 PetaFLOPs – 29.8 bluefire-equivalents – 4,662 nodes – 74,592
cores
– 149.2 TeraBytes total memory
– Mellanox FDR InfiniBand full fat-tree interconnect
• Data Analysis and Visualization Resource (Geyser & Caldera)
– Large Memory System with Intel Westmere EX processors
• 16 nodes, 640 cores, 16 TeraBytes memory, 16 NVIDIA Kepler GPUs
– GPU-Computation/Vis System with Intel Sandy Bridge EP processors
with AVX
• 16 nodes, 128 SB cores, 1 TeraByte memory, 32 NVIDIA Kepler GPUs
– Knights Corner System with Intel Sandy Bridge EP processors with AVX
• 16 nodes, 128 SB cores, 992 KC cores, 1 TeraByte memory - Nov’12 delivery
† “Sandy
Bridge EP” is the Intel® Xeon® E5-2670
GLADE
• 10.94 PB usable capacity  16.42 PB usable (1Q2014)
Estimated initial file system sizes
– collections ≈ 2 PB RDA, CMIP5 data
– scratch
≈ 5 PB shared, temporary space
– projects
≈ 3 PB long-term, allocated space
– users
≈ 1 PB medium-term work space
•
Disk Storage Subsystem
– 76 IBM DCS3700 controllers & expansion drawers
• 90 2-TB NL-SAS drives/controller
• add 30 3-TB NL-SAS drives/controller (1Q2014)
•
GPFS NSD Servers
– 91.8 GB/s aggregate I/O bandwidth; 19 IBM x3650 M4 nodes
•
I/O Aggregator Servers (GPFS, GLADE-HPSS connectivity)
– 10-GbE & FDR interfaces; 4 IBM x3650 M4 nodes
•
High-performance I/O interconnect to HPC & DAV
– Mellanox FDR InfiniBand full fat-tree
– 13.6 GB/s bidirectional bandwidth/node
NCAR Disk Storage Capacity Profile
Total Centralized Filesystem Storage (PB)
GLADE (NWSC)
GLADE (ML)
bluefire
18
16
Total Usable Storage (PB)
14
12
10
GLADE (at NWSC)
8
6
4
2
GLADE (Mesa Lab)
0
Jan-10
Jan-11
Jan-12
Jan-13
Jan-14
Jan-15
Jan-16
Yellowstone
NWSC High-Performance Computing Resource
• Batch Computation
–
–
–
–
–
4,662 IBM dx360 M4 nodes – 16 cores, 32 GB memory per node
Intel Sandy Bridge EP processors with AVX – 2.6 GHz clock
74,592 cores total – 1.552 PFLOPs peak
149.2 TB total DDR3-1600 memory
29.8 Bluefire equivalents
• High-Performance Interconnect
–
–
–
–
Mellanox FDR InfiniBand full fat-tree
13.6 GB/s bidirectional bw/node
<2.5 µs latency (worst case)
31.7 TB/s bisection bandwidth
• Login/Interactive
– 6 IBM x3650 M4 Nodes; Intel Sandy Bridge EP processors with AVX
– 16 cores & 128 GB memory per node
NCAR HPC Profile
Peak PFLOPs at NCAR
IBM iDataPlex/FDRIB (yellowstone)
30x Bluefire performance
1.5
yellowstone
Cray XT5m (lynx)
IBM Power 575/32
(128)
POWER6/DDR-IB
(bluefire)
1.0
IBM p575/16 (112)
POWER5+/HPS
(blueice)
IBM p575/8 (78)
POWER5/HPS
(bluevista)
0.5
IBM BlueGene/L
(frost)
bluesky
frost
bluevista
blueice
IBM
POWER4/Colony
(bluesky)
lynx
bluefire
frost upgrade
0.0
Jan-04
Jan-05
Jan-06
Jan-07
Jan-08
Jan-09
Jan-10
Jan-11
Jan-12
Jan-13
Jan-14
Jan-15
Jan-16
Geyser and Caldera
NWSC Data Analysis & Visualization Resource
• Geyser: Large-memory system
– 16 IBM x3850 nodes – Intel Westmere-EX processors
– 40 cores, 1 TB memory, 1 NVIDIA Kepler Q13H-3 GPU
per node
– Mellanox FDR full fat-tree interconnect
• Caldera: GPU computation/visualization system
– 16 IBM x360 M4 nodes – Intel Sandy Bridge EP/AVX
– 16 cores, 64 GB memory, 2 NVIDIA Kepler Q13H-3 GPUs
per node
– Mellanox FDR full fat-tree interconnect
• Knights Corner system (November 2012 delivery)
– Intel Many Integrated Core (MIC) architecture
– 16 IBM Knights Corner nodes
– 16 Sandy Bridge EP/AVX cores, 64 GB memory,
1 Knights Corner adapter per node
– Mellanox FDR full fat-tree interconnect
Erebus
Antarctic Mesoscale Prediction System (AMPS)
0°
• IBM iDataPlex Compute Cluster
–
–
–
–
–
84 IBM dx360 M4 Nodes; 16 cores, 32 GB
Intel Sandy Bridge EP; 2.6 GHz clock
1,344 cores total – 28 TFLOPs peak
Mellanox FDR InfiniBand full fat-tree
0.54 Bluefire equivalents
• Login Nodes
– 2 IBM x3650 M4 Nodes
– 16 cores & 128 GB memory per node
• Dedicated GPFS filesystem
– 57.6 TB usable disk storage
– 9.6 GB/sec aggregate I/O bandwidth
90° E
90° W
180°
Erebus, on Ross Island, is Antarctica’s
most famous volcanic peak and is one
of the largest volcanoes in the world –
within the top 20 in total size and
reaching a height of 12,450 feet.
Yellowstone Software
• Compilers, Libraries, Debugger & Performance Tools
– Intel Cluster Studio (Fortran, C++, performance & MPI libraries,
trace collector & analyzer) 50 concurrent users
– Intel VTune Amplifier XE performance optimizer 2 concurrent users
– PGI CDK (Fortran, C, C++, pgdbg debugger, pgprof) 50 conc. users
– PGI CDK GPU Version (Fortran, C, C++, pgdbg debugger, pgprof)
for DAV systems only, 2 concurrent users
– PathScale EckoPath (Fortran C, C++, PathDB debugger)
20 concurrent users
– Rogue Wave TotalView debugger 8,192 floating tokens
– IBM Parallel Environment (POE), including IBM HPC Toolkit
• System Software
– LSF-HPC Batch Subsystem / Resource Manager
• IBM has purchased Platform Computing, Inc. (developers of LSF-HPC)
–
–
–
–
Red Hat Enterprise Linux (RHEL) Version 6
IBM General Parallel Filesystem (GPFS)
Mellanox Universal Fabric Manager
IBM xCAT cluster administration toolkit
NCAR HPSS Archive Resource
• NWSC
– Two SL8500 robotic libraries (20k cartridge capacity)
– 26 T10000C tape drives (240 MB/sec I/O rate each) and
T10000C media (5 TB/cartridge, uncompressed) initially;
+20 T10000C drives ~Nov 2012
– >100 PB capacity
– Current growth rate ~3.8 PB/year
– Anticipated NWSC growth rate ~15 PB/year
• Mesa Lab
– Two SL8500 robotic libraries (15k cartridge capacity)
– Existing data (14.5 PB):
• 1st & 2nd copies will be ‘oozed’ to new media @
NWSC, begin 2012
– New data @ Mesa:
• Disaster-recovery data only
– T10000B drives & media to be retired
– No plans to move Mesa Lab SL8500 libraries (more costly
to move than to buy new under AMSTAR Subcontract)
Plan to release an “AMSTAR-2” RFP 1Q2013, with target for first
equipment delivery during 1Q2014 to further augment the NCAR HPSS
Archive.
Yellowstone Physical Infrastructure
Resource
# Racks
Yellowstone 65 - iDataPlex Racks (72 nodes per rack)
9 - 19” Racks (9 Mellanox FDR core switches)
1 - 19” Rack (login, service, management nodes)
GLADE
20 - NSD Server, Controller and Storage Racks
1 - 19” Rack (I/O aggregator nodes, management , IB & Ethernet
switches)
Geyser &
Caldera
1 - iDataPlex Rack (GPU-Comp & Knights Corner)
2 - 19” Racks (Large Memory, management , IB switch)
Erebus
(AMPS)
1 - iDataPlex Rack
1 - 19” Rack (login, IB, NSD, disk & management nodes)
Total Power Required
~2.13 MW
Yellowstone
~1.9 MW
GLADE
0.134 MW
Geyser & Caldera
0.056 MW
Erebus (AMPS)
0.087 MW
Yellowstone allocations (% of resource)
NCAR’s 29% represents 170 million core-hours per year for Yellowstone alone
(compared to less than 10 million per year on Bluefire) plus a similar fraction
of the DAV and GLADE resources.
Yellowstone Schedule
Current Schedule
Production Science
ASD & early users
Acceptance Testing
15 Oct
1 Oct
17 Sep
3 Sep
20 Aug
6 Aug
23 Jul
9 Jul
25 Jun
11 Jun
28 May
14 May
30 Apr
16 Apr
2 Apr
19 Mar
5 Mar
20 Feb
6 Feb
23 Jan
9 Jan
26 Dec
12 Dec
28 Nov
14 Nov
31 Oct
Integration &
Checkout
Production Systems
Delivery
Test Systems Delivery
& Installation
Storage & InfiniBand
Delivery & Installation
Infrastructure
Preparation
Data in Flight to NWSC
• NWSC Networking
• Central Filesystem
– Migrating from GLADE-ML to GLADE-NWSC
• Archive
– Migrating HPSS data from ML to NWSC
20
a
21
BiSON to NWSC




Initially three 10G circuits active

Two 10G connections back to Mesa Lab for internal traffic

One 10G direct to FRGP for general Internet2 / NLR /
Research and Education traffic
Options for dedicated 10G connections for high performance
computing to other BiSON members
System is engineered for 40 individual lambdas

Each lambda can be a 10G, 40G, or 100G connection
Independent lambdas can be sent each direction around the
ring (two ADVA shelves at NWSC – one for each direction)

With a major upgrade system could support 80 lambdas

100Gbps * 80 channels * 2 paths = 16Tbps
22
High performance LAN




Data Center Networking (DCN)

high speed data center computing with 1G and 10G client
facing ports for supercomputer, mass storage, and other
data center components

redundant design (e.g., multiple chassis and separate
module connections)

future option for 100G interfaces
Juniper awarded RFP after NSF approval

Access switches: QFX3500 series

Network core and WAN edge: EX8200 series
switch/routers

Includes spares for lab/testing purposes
Juniper training for NETS staff early Jan-2012
Deploy Juniper equipment late Jan-2012
23
Moving Data… Ugh
24
Migrating GLADE Data
• Temporary work spaces (/glade/scratch, /glade/user)
– No data will automatically be moved to NWSC
• Allocated project spaces (/glade/projxx)
– New allocations will be made for the NWSC
– No data will automatically be moved to NWSC
– Data transfer option so users may move data they need
• Data collections (/glade/data01, /glade/data02)
– CISL will move data from ML to NWSC
– Full production capability will need to be maintained during the
transition
 Network Impact
- Current storage max performance is 5GB/s
- Can sustain ~2GB/s for reads while under a production load
- Will move 400TB in a couple of days, however we will saturate a
20Gb/s network link
25
Migrating Archive
• What Migrates????
•
•
•
•
Data: 15PBs and counting…
Format: MSS to HPSS
Tape: Tape B (1TB tapes) to Tape C (5TB tapes)
Location: ML to NWSC
• HPSS at NWSC to become primary site in Spring
2012
•
•
•
•
1 day outage when metadata servers get moved
ML HPSS will remain as Disaster Recovery Site
Data Migration will take until early 2014
Will throttle migration to not overload network
26
WHEW… A LOT OF WORK AHEAD!
QUESTIONS?

similar documents