View Presentation Slides - SDSC Education

Report
SSD – Applications, Usage Examples
Gordon Summer Institute
August 8-11, 2011
Mahidhar Tatineni
San Diego Supercomputer Center
SAN DIEGO SUPERCOMPUTER CENTER
Overview
• Introduction to flash hardware and benefits
• Flash usage scenarios
• Examples of applications tested on Dash,
Trestles compute nodes and Dash/Gordon I/O
nodes.
• Flash access/remote mounts on Dash, Trestles,
and Gordon.
SAN DIEGO SUPERCOMPUTER CENTER
Gordon Architecture Bridges the
Latency Gap
I/O to traditional HPC FS
1.00E+00
Data Oasis
Lustre 4PB PFS
1.00E-01
64 I/O nodes
300 TB Intel SSD
(lower is
better)
Latency (seconds)
1.00E-02
1.00E-03
1.00E-05
L3 Cache
MB
Application
1.00E-06
1.00E-07
1.00E-08
I/O to flash node FS
Quick Path
Interconnect
10’s of GB
1.00E-04
L1 Cache
KB
1.00E-09
1.00E-05
1.00E-03
QDR InfiniBand
Interconnect
100’s of GB
Space
DDR3
Memory
10’s of GB
L2 Cache
KB
1.00E-01
1.00E+01
1.00E+03
Data Capacity (GB)
(higher is
better)
SAN DIEGO SUPERCOMPUTER CENTER
1.00E+05
1.00E+07
1.00E+09
Flash Drives are a Good Fit for Data Intensive Computing
Flash Drive
Typical
HDD
Good for
Data
Intensive
Apps
< .1 ms
10 ms
✔
250 /170 MB/s
100 MB/s
✔
35,000/
2000
100
✔
2-5 W
6-10 W
✔
1M hours
1M hours
-
Price/GB
$2/GB
$.50/GB
-
Endurance
2-10PB
N/A
✔
Latency
Bandwidth (r/w)
IOPS (r/w)
Power consumption
MTBF
Total Cost of Ownership
*** The jury is still out ***
.
Apart from the differences between HDD and SSD it is not common to find local storage “close”
to the compute. We have found this to be attractive in our Trestles cluster, which has local flash
on the compute, but is used for traditional HPC applications (not high IOPS).
SAN DIEGO SUPERCOMPUTER CENTER
• Dash uses Intel X25E SLC drives and Trestles has X25-M MLC
drives.
• The performance specs of the Intel flash drives to be deployed in
Gordon are similar to those of the X25-M except that they will
have higher endurance
SAN DIEGO SUPERCOMPUTER CENTER
Flash Usage Scenarios
• Node local scratch for I/O during a run
• Very little or no code changes required
• Ideal if there are several threads doing I/O simultaneously
and often.
• Examples: Gaussian, Abaqus, QCHEM
• Caching of partial or complete dataset in
analysis, search, and visualization tasks
• Loading entire database into flash
• Use flash via a filesystem
• Use raw device [DB2]
SAN DIEGO SUPERCOMPUTER CENTER
Flash as Local Scratch
• Applications which do a lot of local scratch I/O
during computations. Examples: Gaussian,
Abaqus, QCHEM
• Using flash is very straightforward. For example
on Trestles where local SSDs are available:
• Gaussian:
GAUSS_SCRDIR=/scratch/$USER/$PBS_JOBID
• Abaqus: scratch=/scratch/$USER/$PBS_JOBID
• When a lot of cores (up to 32 on Trestles) are
doing I/O and reading/writing constantly, the
SSDs can make a significant difference.
• Parallel filesystems not ideal for such I/O.
SAN DIEGO SUPERCOMPUTER CENTER
Flash as local scratch space provides 1.5x1.8x speedup over local disk for Abaqus
• Standard Abaqus test cases (S2A1, S4B) were run on Dash with 8 cores
to compare performance between local hard disk and SSDs. Benchmark
performance was as follows:
Benchmark
Local disk
SSDs
S4B
2536s
1748s
S2A1
811s
450s
SAN DIEGO SUPERCOMPUTER CENTER
Reverse-Time-Migration Application
• Acoustic Imaging Application
•
•
•
Used to create images of sub-surface structures
Oil and Gas companies use RTM to plan drilling investments
This is a computation research that is sponsored by a commercial user
• Correlation between source data and recorded data
•
•
•
forward-propagated seismic waves
backward-propagated seismic waves
Correlation between seismic waves illuminates reflection/diffraction points
• Temporary Storage Requirements
•
•
Snapshots stored for correlation
Example
•
•
•
Example: Computation-IO Profile
4003 max grid points
20000 msec
~60GB temporary storage used
400x20000 on HDD
26%
Write
54%
20%
SAN DIEGO SUPERCOMPUTER CENTER
Read
Computation
Reverse-Time-Migration on Flash*
• Storage comparison on batch nodes
• Spinning disk (HDD), flash drives (flash), parallel file system (GPFS)
• Local flash drive outperforms other storages
• Avg 7.2x IO speedup vs HDD
1600
• Avg 3.9x IO speedup vs GPFS
1400
• IO-node RAID’d-flash
•
•
•
•
IO time (sec)
1200
1000
800
HDD
600
GPFS
200
Comparison with RAID’d drives
0
16 Intel drives
400x20000 800x2480 1200x720
Test case
4 Fusion-io cards
Raided flash achieves 2.2x speedup compared to single drive
* Done by Pietro Cicotti, SDSC
SAN DIEGO SUPERCOMPUTER CENTER
Flash
400
1600x304
Local SSD to Cache Partial/Full Dataset
• Load partial/full dataset into flash.
• Typically needs application modification to write
data into flash and do all subsequent reads from
flash.
• Example: Munagala-Ranade Breadth First
Search (MR-BFS) code:
• Generation phase -> puts the data in flash.
• Multiple MR-BFS runs read and process data.
• Multiple threads reading, benefits from low latency of
SSDs.
SAN DIEGO SUPERCOMPUTER CENTER
Flash case study – Breadth First Search*
MR-BFS serial performance
134217726 nodes
3000
Benchmark problem: BFS on
graph containing 134 million
nodes
2500
I/O time
t (s)
2000
Implementation of Breadthfirst search (BFS) graph
algorithm developed by
Munagala and Ranade
non-I/O time
Use of flash drives reduced
I/O time by factor of 6.5x. As
expected, no measurable
impact on non-I/O operations
1500
1000
500
Problem converted from I/O
bound to compute bound
0
SDDs
HDDs
* Done by Sandeep Gupta, SDSC
SAN DIEGO SUPERCOMPUTER CENTER
Flash for caching: Case study – Parallel Streamline
Visualization
Camp et al, accepted to IEEE Symp. on Large-Scale Data Analysis and
Visualization (LDAV 2011)
SAN DIEGO SUPERCOMPUTER CENTER
Databases on Flash
• Database performance benefits from low latency I/O
from flash
• Two options for setting up database:
• Load database on flash based filesystem, already tested on Dash I/O
nodes.
• DB2 with direct native access to flash memory (coming soon!).
SAN DIEGO SUPERCOMPUTER CENTER
LIDAR Data Lifecycle
Waveform Data
D. Harding,
NASA
Full-featured
DEM
Portal
Point Cloud Dataset
Bare earth
DEM
OpenTopography is a “cloud” for
topography data and tools
SAN DIEGO SUPERCOMPUTER CENTER
LIDAR benchmarking* and experiments
on a Dash I/O node
•
•
Experiments with LIDAR point cloud data with data sizes ranging from 1GB
to 1TB using DB2.
Experiments to be performed include:
•
•
•
•
Load times: time to load each dataset
Single user Selection times: for selecting 6%, 12%, 50% of data
Single user Processing times: for DEM generation on selected data4.
Multiuser: for a fixed dataset size (either 100GB or 1TB), run selections and
processing for multiple concurrent users, e.g. 2, 4, 8, 16 concurrent users
• Logical nodes testing: for a fixed dataset size (100GB or 1TB), db2 has the
option of creating multiple “logical nodes” on a given system (“physical
node”). Test what is optimal number of logical nodes on an SSD node
*Chaitan Baru’s group at SDSC.
SAN DIEGO SUPERCOMPUTER CENTER
Flash case study – LIDAR
4000
3500
SSDs
HDDs
3000
t (s)
2500
Remote sensing technology
used to map geographic
features with high
resolution
Benchmark problem: Load
100 GB data into single
table, then count rows.
DB2 database instance
2000
1500
1000
500
0
100GB Load
100GB Load 100GB Count(*) 100GB Count(*)
FastParse
Cold
Warm
SAN DIEGO SUPERCOMPUTER CENTER
Flash drives 1.5x (load) to
2.4x (count) faster than
hard disks
Flash case study – LIDAR
1200
Remote sensing technology
used to map geographic
features with high resolution
SSDs
HDDs
1000
Comparison of runtimes for
concurrent LIDAR queries
obtained with flash drives
(SSD) and hard drives (HDD)
using the Alaska DenaliTotschunda data collection.
t (s)
800
600
400
Impact of SSDs was modest,
but significant when executing
multiple simultaneous queries
200
0
1 Concurrent
4 Concurrent
8 Concurrent
SAN DIEGO SUPERCOMPUTER CENTER
PDB – protein interaction query
• First step in analysis involves reduction of 150 million
row data base table to one million rows. Use of flash
drives reduced query time to 3 minutes, 10x speedup
over hard disk
• Dash I/O node configuration
• Four 320 GB Fusion-io Drives configured as 1.2 TB
RAID 0 device running an XFS file system
• Two quad-core Intel Xeon E5530 2.40 GHz
processors and 48 GB of DDR3-1066 memory
SAN DIEGO SUPERCOMPUTER CENTER
Accessing Flash/SSDs on Dash, Trestles
System
Dash –
batch*
Dash –
vSMP*
Configuration
HDD
PFS
64GB, node
local
Yes
GPFS/
Data
Oasis
IB-DDR
16 nodes; 2 quad-core Intel
1TB (64x16
Nehalem (8 cores/node);
aggregated)
48GB/node. Memory
aggregated to 768GB via vSMP
N/A
GPFS
IB-DDR
16 nodes; 2 quad-core Intel
Nehalem (8 cores/node);
48GB/node
SSD
Network
Dash I/O
node
4 nodes; 2 quad-core Intel
Nehalem (8 cores/node);
48GB/node; large SSD
1 TB (64*16)
per node
N/A
N/A
N/A
Trestles*
324 nodes; 4, eight-core AMD
Magny- Cours/node
(32cores/node); 64GB/node
120GB, node
local drives
N/A
Data
Oasis
IB-QDR
SAN DIEGO SUPERCOMPUTER CENTER
Sample Script on Dash
#!/bin/bash
#PBS -N PBStest
#PBS -l nodes=1:ppn=8
#PBS -l walltime=01:00:00
#PBS -o test-normal.out
#PBS -e test-normal.err
#PBS -m e
#PBS -M [email protected]
#PBS -V
#PBS –q batch
cd /scratch/mahidhar/$PBS_JOBID
cp -r /home/mahidhar/COPYBK/input /scratch/mahidhar/$PBS_JOBID
mpirun_rsh -hostfile $PBS_NODEFILE -np 8 test.exe
cp out.txt /home/mahidhar/COPYBK/
SAN DIEGO SUPERCOMPUTER CENTER
Dash Prototype vs. Gordon
Dash
Gordon
Number of Compute Nodes
64
1,024
Number of I/O Nodes
4
64
Intel Nehalem
Intel Sandy Bridge
48 GB
64 GB
Intel X25E SLC
Intel eMLC
1 TB
4.8 TB
16 nodes/768GB
32 nodes/2TB
Single Rail, Fat Tree, DDR
Dual Rail, 3D Torus, QDR
Torque
SLURM
Compute node processors
Compute node memory
I/O node flash
Flash Capacity per Node
vSMP Supernode Size
InfiniBand Network
Resource Management
When considering benchmark results and scalability, keep in mind that nearly every
major feature of Gordon will be an improvement over Dash.
SAN DIEGO SUPERCOMPUTER CENTER
Accessing Flash on Gordon
• Majority of the flash disk will be in the 64 Gordon I/O
nodes. Each I/O node will have ~ 4.8TB of flash.
• Flash from I/O nodes will be made available to non-vSMP
compute nodes via the IB network and iSER
implementations. Two options will be available:
• XFS filesystem mounted locally on each node.
• Oracle Cluster Filesystem (OCFS)
• vSMP software will aggregate the flash from the I/O
node(s) included in the vSMP nodes. The aggregated
flash filesystem will be available as local scratch on the
node.
SAN DIEGO SUPERCOMPUTER CENTER
Flash performance needs to be freed
from the I/O nodes
Application
is here
Flash is here
SAN DIEGO SUPERCOMPUTER CENTER
Alphabet Soup of networking
protocols, and file systems
•
•
•
•
•
•
•
•
•
•
SRP - SCSI over RDMA
iSER - iSCSI over RDMA
NFS over RDMA
NFS/IP over IB
Xfs – via iSER devices
Lustre
OCFS – via iSER devices
PVFS
OrangeFS
Others…
In our effort to maximize flash
performance we have tested
most of these.
BTW: Very few people doing this!
SAN DIEGO SUPERCOMPUTER CENTER
Exporting Flash Performance using
iSER: Sequential
iSER Implementations: Sequential
4000
3500
3000
MB/s
2500
TGTD
2000
TGTD+
1500
1000
500
0
mt-seq-read
ep-seq-read
SAN DIEGO SUPERCOMPUTER CENTER
mt-seq-write
ep-seq-write
Exporting Flash Performance using
iSER: Random
iSER Implementation: Random
300000
250000
IOPS
200000
TGTD
150000
TGTD+
100000
50000
0
mt-rnd-read
ep-rnd-read
SAN DIEGO SUPERCOMPUTER CENTER
mt-rnd-write
ep-rnd-write
Flash performance – parallel file system
OCFS Sequential access
3500
Bandwidth (MB/s)
3000
2500
MT-RD
2000
MT-WR
1500
EP-RD
1000
EP-WR
500
0
1-node
2-node
4-node
OCFS Random access
250000
IOPS
200000
MT-RD
150000
MT-WR
100000
EP-RD
EP-WR
50000
0
1-node
2-node
4-node
SAN DIEGO SUPERCOMPUTER CENTER
Performance of Intel
Postville Refresh
SSDs
(16 drives  RAID 0)
with OCSF (Oracle
Cluster File System)
I/O done
simultaneously from 1,
2, or 4 compute nodes
MT = multi-threaded
EP = embarrassingly
parallel
Flash performance – serial file system
XFS Sequential access
1600
Bandwidth (MB/s)
1400
1200
1000
MT-RD
800
MT-WR
600
EP-RD
400
EP-WR
200
I/O done simultaneously
from 1, 2, or 4 compute
nodes
0
1-node
2-node
4-node
XFS Random access
160000
140000
IOPS
120000
100000
MT-RD
80000
MT-WR
60000
EP-RD
40000
EP-WR
20000
0
1-node
2-node
4-node
SAN DIEGO SUPERCOMPUTER CENTER
Performance of Intel
Postville Refresh SSDs
(16 drives  RAID 0)
with XFS
MT = multi-threaded
EP = embarrassingly
parallel
Summary
• The early hardware has allowed us to test
applications, protocols and file systems.
• I/O profiling tools and running different
application flash usage scenarios have helped
optimize application I/O performance.
• Performance test results point to iSER, OCFS,
and XFS as the right solutions for exporting
flash.
• Further work required to integrate into user
documentation, systems scripts, and the SLURM
resource manager.
SAN DIEGO SUPERCOMPUTER CENTER
Discussion
• Attendee I/O access pattern/method.
SAN DIEGO SUPERCOMPUTER CENTER
Thank you!
For more information
http://gordon.sdsc.edu
[email protected]
Mahidhar Tatineni
[email protected]
SAN DIEGO SUPERCOMPUTER CENTER

similar documents