Market Update - HPC User Forum

April 2010
Dearborn, MI
Panel Members
• Alex Akkerman, Ford Motor Company
• Sharan Kalwani, KAUST
• Steve Feldman, CD-adapco
• Matt Dunbar, Simulia
• Uwe Schramm, Altair Engineering
• Li Zhang, Livermore Software Technology
• Barbara Hutchings, ANSYS, Inc.
• Martin McNamee, MSC Software
Panel Format
•4 Questions
Provided ahead of time
•2 minutes per question for each
•Follow-up and Audience after
each participant had a chance to
Q1. Applications Scalability…..
•Please share with the
audience, briefly – the issues
surrounding Applications
Scalability and how is this
being addressed?
Q1. Applications Scalability…..
Our solvers scale reasonably well to 512 cores or
more with very large problems.
Very few actually use this many on one analysis
Primary solver bottlenecksMemory bandwidth
Unbalanced work loads
Untapped speed potentials
Parallel Meshing
Parallel post-processing
Parallel I/O
Different algorithms, re-evaluation of methods
Q1. Applications Scalability…..
 Fundamental limitations on scalability remain scalar
sections of code (Amdahl’s law) and load balance
Solution is still developer time and effort which is being
Looking at ways to improve developer efficiency through use
of better programming models (primarily from Intel and
Past programming model changes were either too limited
(OpenMP), too immature, or simply ineffective
Newer models driven by need to bring multi-core execution to
commodity applications have more promise
Q1. Applications Scalability…..
 We’re talking finite element solver applications
 Two classes of solvers
 Interative schemes
 Matrix inversion schemes
 Issues: Scalability, Quality, Repeatability, Data
transfer, Hardware configurations, Hardware access
 Addressed: Optimal domain decomposition,
Computational methods that scale well, Solver
architecture, Focus on certain hardware,
Q1. Applications Scalability…..
Data summation order for different MPI causes
errors – LS-DYNA uses fixed order.
Modified/refined model decomposes in different way
changing results – ‘Cut lines’ are preserved from 1st
Scaling for 128 processors not always good.
Hybrid LS-DYNA runs SMP within processor and
MPP between processors.
Results consistent with increased # SMP threads.
Simple command line to execute.
Q1. Applications Scalability…..
 Solver scaling continues to expand
 CFD to 1000, Structures to 100
 Especially key to accelerating transients
 Need to address the scalar bottlenecks
Across full simulation process (meshing, I/O, certain
solver physics, visualization)
Hybrid parallel algorithms for multi-core/mixed-core
 Distributed/shared memory, OpenMP vs. MPI
Support for latest communication technologies
 QDR IB, iWARP for 10gigE, etc.
Q1. Applications Scalability…..
Q2. Licensing Model.....
•As hardware technology shift to multicore processors continues and even
accelerates, the licensing models of many
ISV codes become a serious problem for
your customers. Per core licensing becomes
exceedingly unaffordable and limiting in
ability to improve and even maintain the
levels of performance of recent past.
Panel participants – How can you help your
customers become more competitive given current
technology trends?
Q2. Licensing Model.....
Q2. Licensing Model.....
 Modified ANSYS HPC licensing in 2009
 Tied to the value of HPC
 New scalable licensing enables ‘extreme/unlimited
parallel’ for high fidelity; minimizes the licensing
“penalty” on higher core count processors
Enterprise access is key
 Hardware located anywhere, users located
Owned, rented, IaaS
 Interchangeable across physics
 Buy once, deploy once
Q2. Licensing Model.....
One Code Strategy – LS-OPT, LS-PrePost, Dummies,
barriers & head forms FEA models all available as part
of LS-DYNA distribution with no additional license keys
Ultimate Value: Multi-physics & multi-stage
capabilities in one scalable code.
Flexibility: 4 core license allows 4 one core jobs or
one 4 core job.
Steeply decreasing licensing fees per core as the #
processors increase.
Unlimited core site license.
Q2. Licensing Model.....
 Not have licensing based on number of
 Per use token-licensing
 Addressing thru special license decay
Multi-run environments
Massive computation
Q2. Licensing Model.....
Have asked this question internally and am presenting
collective responses
Two factors in license price
Parallel development and testing are more expensive than scalar
With SIMULIA typical sale is annual license, so, on the one hand,
sales force is motivated to maintain a good relationship with
customer, but on the other hand the sales force is fearful of
“revenue erosion” from “free parallel”
SIMULIA sales team view existing licensing model which
rewards parallel execution with lower “per core hour” execution
Requires greater base token pool
Requires “revenue neutral” shift in licensing model
Great volume of sales (more customers)
Q2. Licensing Model.....
Our “Power” session licenses are independent of the
number of cores used for a single analysis
Our “Cloud license” model is also independent of the
number of cores and of the number of simultaneous
analyses. You pay only for what you use and we do not
care where you run.
We make our clients more competitive by adding
value with each release:
1. Cut the total engineering time required for analysis.
Engineering time is far more expensive than
computer time.
2. Enlarge the universe of problems that our tools can
be employed to analyze while working to make all our
analysis more accurate.
Q3. New Technology Adoption….…
•We notice a considerable lag in
adoption of new technologies (e.g.
FPGA, GPGPU) in the
Manufacturing CAE space. Please
elaborate on what are the issues
and your response.
Q3. New Technology Adoption….…
New technologies come with lots of hype and little
infrastructure. It takes time for languages, compilers,
debuggers,… to mature and standardize. We are not in a
position to rewrite 1M lines in assembly language every
time some new device appears.
New technologies are not always applicable to our
particular needs.
When we see a technology with reasonable potential
for return on investment, we partner with the
technology providers, watch the literature, assign
researchers… and it does not always pay off.
Q3. New Technology Adoption….…
Adoption of GPGPU, and to greater extent, FPGA has high
programming cost for large, general purpose codes
Result is that GPGPU focus tends to be on acceleration of
obvious bottlenecks, preferably, with low code line counts
Drawback for parallel codes is that often greater parallel gains
are in same areas, so gains from GPGPU are considerably less
for parallel codes than for scalar codes
Even where adoption is underway, keeping x86 and GPGPU
code (CUDA/OpenCL) results in two code bases
SIMULIA is accelerating obvious code with GPGPU, and
working internally and with partners to find better programming
Q3. New Technology Adoption….…
 Technology need to be fit a for certain
computational methods – memory, data
 We’re trying, but the gains do not justify
the effort
 Technology is not where it needs to be
 Lack of standards
Q3. New Technology Adoption….…
LSTC currently is evaluating the impact of GPUs
on the performance of implicit LS-DYNA. It is applied
to the innermost computational kernel of the sparse
matrix factorization.
GPUs offer high performance for certain
computational kernels.
Performance is subjected to overhead cost of
transferring the data to the GPU and results back
from the GPU.
Performance will no longer degrade for REAL*8
arithmetic when the Nvidia Fermi GPUs become
LSTC hopes to have the GPU implementation in
Implicit around mid-year.
Q3. New Technology Adoption….…
• Establishing ROI is critical - and unclear
 Moving technology target (CPUs vs GP-GPUs)
• Substantial investment required
 Only a subset of operations map to GPU without
significant algorithm changes
– Bottleneck associated with memory access
to/from off-CPU boards; Not enough memory to
offload “entire algorithm”
 Lack of ‘off the shelf” vendor libraries; multiple
development environments (OpenCL / CUDA)
•Some “low hanging fruit” (e.g., matrix factorization)
 Available now (beta) on GP-GPUs
Q3. New Technology Adoption….…
Q4. Breakthrough Performance…..
•Could you please comment on
how your products could
potentially evolve near- or midterm leading to substantially
higher levels of performance for
your customers?
Q4. Breakthrough Performance…..
Q4. Breakthrough Performance…..
 New, more scalable solvers
 With promise to extend scaling to 1000’s+ core
 Robustness is key (takes time)
 Vector processing paradigms (multicore, GPU)
Parallel execution of multiple design points
 Full automation of parametric updates
 Human productivity and compute throughput
Q4. Breakthrough Performance…..
New features continuously implemented (Electromagnetics,
Acoustics, Frequency response, Compressible/incompressible
fluids, Isogeometric elements).
Multiscale capabilities under development to have initial
release this year.
Hybrid MPI/OPENMP promises major scalability boost at high
# processors for both explicit and implicit solutions – scaling
to 1000’s of nodes for both explicit & implicit solvers.
Replace prototype testing by simulation:
Strict modeling guidelines for analysts
A single FE model for crash, NVH, durability, etc.
Advance in Constitutive models, Contact, FSI with SPH, ALE,
Particle methods, Sensors and Control Systems, and complete
compatibility with NASTRAN
Manufacturing simulations (in LS-DYNA, Moldflow, etc.) to
provide initial conditions for crash simulations.
Q4. Breakthrough Performance…..
 Not only the solver runtime is important,
but how the solver use impacts design
 New paradigms of designing products
 Integration of design methods with solvers
 Advancing use of multi-CPU
 Advancing numerical techniques
Q4. Breakthrough Performance…..
Abaqus/Explicit is unlikely to see major breakthrough in near to mid
term, but will show steady incremental improvement
Customer base execution of Abaqus/Standard for large jobs
exceeded customer adoption about 3 years ago (large model scaling
to 128 to 256 cores)
Takes time for customers to get credible performance data and to
change hardware available in order to adapt to a shift in scalability
For implicit FEA hard to get away from “Nastran node” for several
SIMULIA investigating next possible jumps in performance
For Abaqus/Standard working on “strong scaling” gains (i.e. deliver
scalability throughout problem size range)
Beyond the “more cores” approach potential of GPGPU is great, but
need a programming breakthrough
Q4. Breakthrough Performance…..
Our goal is to cut down the TOTAL simulation time.
Meshing/CAD interfacing. We have gone from weeks
of preparation time to hours. We have made enormous
breakthroughs in our ability to process “dirty” CAD.
Post-processing – recently cut the time to output a
specific set of plots from 40 hours to 1 hour.
Strategies to deal with larger models and transients,
including use of parallel I/O.
Customization and integration with the client’s own
workflows and processes.
Solver efficiency alone is not the only important
measure of performance.

similar documents