An Analysis of Node Sharing on HPC Clusters using

Report
An Analysis of Node Sharing on HPC
Clusters using XDMoD/TACC_Stats
Joseph P White, Ph.D
Scientific Programmer - Center for Computational Research
University at Buffalo, SUNY
XSEDE14 JULY 13– 18, 2014
Outline
•
•
•
•
•
•
Motivation
Overview of tools (XDMOD, tacc_stats)
Background
Results
Conclusions
Discussion
TECHNOLOGY AUDIT SERVICE
CoAuthors
•
•
•
•
•
•
•
•
•
•
Robert L. DeLeon (UB)
Thomas R. Furlani (UB)
Steven M. Gallo (UB)
Matthew D Jones (UB)
Amin Ghadersohi (UB)
Cynthia D. Cornelius (UB)
Abani K. Patra (UB)
James C. Browne (UTexas)
William L. Barth (TACC)
John Hammond (TACC)
TECHNOLOGY AUDIT SERVICE
Motivation
• Node sharing benefits:
– increases throughput by up to 26%
– increases energy efficiency by up to 22% (Breslow et al.)
• Node sharing disadvantages:
– resource contention
• Number of cores per node increasing
• Ulterior motive:
– Prove toolset
•
A. D. Breslow, L. Porter, A. Tiwari, M. Laurenzano, L. Carrington, D. M. Tullsen, and A. E. Snavely. The case for
colocation of hpc workloads. Concurrency and Computation: Practice and Experience, 2013
http://dx.doi.org/10.1002/cpe.3187
TECHNOLOGY AUDIT SERVICE
Tools
• XDMoD
– NSF funded open source tool that provides a wide range of
usage and performance metrics on XSEDE systems
– Web-based interface
– Powerful charting features
• tacc_stats
– low-overhead collection of system-wide performance data
– Runs on every node on a resource collects data at job start,
end and periodically during job
•
•
•
•
CPU usage
Hardware performance counters
Memory usage
I/O usage
TECHNOLOGY AUDIT SERVICE
Data flow
TECHNOLOGY AUDIT SERVICE
Data flow
TECHNOLOGY AUDIT SERVICE
XDMoD Data Sources
TECHNOLOGY AUDIT SERVICE
Background
• CCR's HPC resource "Rush"
–
–
–
–
–
8000+ cores
Heterogeneous cluster 8, 12, 16 or 32 cores per node
InfiniBand
Panasas parallel filesystem
SLURM resource manager
• node sharing enabled by default
• cgroup plugin to isolate jobs
• Academic computing center: higher % of smaller
jobs than large XSEDE resources
• All data from Jan - Feb 2014 (~370,000 jobs)
TECHNOLOGY AUDIT SERVICE
Number of jobs by job size
TECHNOLOGY AUDIT SERVICE
Results
• Exclusive jobs: where no other jobs ran concurrently on
the allocated node(s) (left hand side of plots)
• Shared jobs: where at least one other job was running
on the allocated node(s) (right hand side)
–
–
–
–
–
–
Process memory usage
Total OS memory usage
LLC read miss rates
Job exit status
Parallel filesystem bandwidth
InfiniBand interconnect bandwidth
TECHNOLOGY AUDIT SERVICE
Memory usage per core
• (MemUsed - FilePages - Slab) from
/sys/devices/system/node/node0/meminfo
Memory usage per core GB
Exclusive jobs
Memory usage per core GB
Shared jobs
TECHNOLOGY AUDIT SERVICE
Total memory usage per core
(4GB/core nodes)
Total memory usage per core GB
Exclusive jobs
Total memory usage per core GB
Shared jobs
TECHNOLOGY AUDIT SERVICE
Last level cache (LLC) read miss rate per socket
• UNC_LLC_MISS:READ on Intel Westmere uncore
• Gives upper bound estimate of DRAM bandwidth
LLC read miss rate 106/s
Exclusive jobs
LLC read miss rate 106/s
Shared jobs
TECHNOLOGY AUDIT SERVICE
Job exit status reported by SLURM
Exit status
1
0.9
Fraction of Jobs
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Successful
Killed
Exclusive jobs
Shared jobs
TECHNOLOGY AUDIT SERVICE
Failed
Panasas parallel filesystem
write rate per node
Write rate per node B/s
Exclusive jobs
Write rate per node B/s
Shared jobs
TECHNOLOGY AUDIT SERVICE
InfiniBand write rate per node
• Peaks truncated:
• ~45,000 for Exclusive jobs
Write rate Log10(B/s)
Exclusive jobs
• ~80,000 for shared jobs
Write rate Log10(B/s)
Shared jobs
TECHNOLOGY AUDIT SERVICE
Conclusions
• Little difference on average between the
shared and exclusive jobs on Rush
• Majority of jobs have resource usage much
less than max available
• Have created data collection/processing
software that facilitates easy evaluation of
system usage
TECHNOLOGY AUDIT SERVICE
Discussion
• Limitations of current work
– Unable to determine impact (if any) on job wall
time
– Comparing overall average values for jobs
– Shared node job statistics are convolved
– Exit code not reliable way to determine failure
TECHNOLOGY AUDIT SERVICE
Future work
• Use Application Kernels to get detailed
analysis of interference
• Many more metrics now available:
– FLOPS
– CPU clock cycles per instruction (CPI)
– CPU clock cycles per L1D cache load (CPLD)
• Add support for per job metrics on shared
nodes.
• Study classes of applications
TECHNOLOGY AUDIT SERVICE
Questions
• BOF: XDMoD: A Tool for Comprehensive
Resource Management of HPC Systems
– 6:00pm - 7:00pm tomorrow. Room A602
• XDMoD
– https://xdmod.ccr.buffalo.edu/
• tacc_stats
– http://github.com/TACCProjects/tacc_stats
• Contact info – [email protected]
TECHNOLOGY AUDIT SERVICE
Acknowledgments
• This work is supported by the National Science
Foundation under grant number OCI 1203560
and grant number OCI 1025159 for the
technology audit service (TAS) for XSEDE
TECHNOLOGY AUDIT SERVICE

similar documents