ESX Performance Troubleshooting

Report
ESX Performance Troubleshooting
VMware Technical Support
Broomfield, Colorado
Confidential
© 2009 VMware Inc. All rights reserved
What is slow performance?
•What does slow performance mean?
• Application responds slowly - latency
• Application takes longer time to do a job – throughput
Both related
to time
•Interpretation varies wildly
• Slower than expectation
• Throughput is low
• Latency is high
• Throughput, latency fine but uses excessive resources (efficiency)
•What are high latency, low throughput, and excessive
resource usage?
• These are subjective and relative
Bandwidth, Throughput, Goodput, Latency
Bandwidth vs. Throughput
• Higher Bandwidth does not guarantee Throughput.
• Low Bandwidth is a bottleneck for higher Throughput
Throughput vs. Goodput
• Higher Throughput does not mean higher Goodput
• Low Throughput is indicative of lower Goodput
Efficiency = Goodput/Bandwidth
Throughput vs. Latency
• Low Latency does not guarantee higher Throughput and vice versa
• Throughput or Latency alone can dominate performance
Bandwidth, Throughput, Goodput, Latency
Bandwidth
Latency
Goodput
Throughput
How to measure performance?
Higher throughput does not necessarily mean higher
performance – Goodput could be low
Throughput is easy to measure, but Goodput is not
How do we measure performance?
• Performance is actually never measured
• We could only quantify different metrics that affect performance.
These metrics describe the state of: CPU, memory, disk and
network
Performance Metrics
CPU
• Throughput: MIPS (%used), Goodput: useful instructions
• Latency: Instruction Latency (cache latency, cache miss)
Memory
• Throughput: MB/Sec, Goodput: useful data
• Latency: nanosecs
Storage
• Throughput: MB/Sec, IOPS/Sec, Goodput: useful data
• Latency: Seek time
Networking
• Throughput: MB/Sec, IO/Sec, Goodput: useful traffic
• Latency: microseconds
Hardware and Performance
CPU
• Processor Architecture: Intel XEON, AMD Opteron
• Processor cache – L1, L2, L3, TLB
• Hyperthreading
• NUMA
Hardware and Performance
Processor Architecture
• Clock Speeds from one architecture is not comparable with other
 P-III outperforms P4 on a clock by clock basis
 Opteron outperforms P4 on a clock by clock basis
• Higher clock speeds is not always beneficial
 Bigger cache or better architecture may outperform higher clock speeds
• Processor memory communication is often the performance
bottleneck
 Processor wastes 100’s of instruction cycles while waiting on memory
access
 Caching alleviates this issue
Hardware and Performance
Processor Cache
• Cache reduces memory access latency
• Bigger cache increases cache hit probability
• Why not build bigger cache ?
 Expensive
 Cache access latency increases with cache size
• Cache is built into stages – L1, L2, L3 with varying cache access
latency
• ESX benefits from larger cache sizes
• L3 cache seems to boost performance of networking workloads
Hardware and Performance
TLB – Translation Lookaside Buffer
• Every running process needs virtual address (VA) to physical
address (PA) translation
• Historically this translation table was done entirely from memory
• Since memory access is significantly slower and process needs
access to this table on every context switch, TLB was introduced
• TLB is a hardware circuitry that caches VA to PA mappings
• When VA is not available in TLB, Page Fault occurs and OS needs
to bring the address to TLB (load latency)
• Performance of application depends on effective use of TLB
• TLB is flushed during context switch
Hardware and performance
Hyperthreading
• Introduced with Pentium 4 and Xeon processors
• Allows simultaneous execution of two threads on a single processor
• HT maintains separate architectural states for the same processor
but shares underlying processor resources like execution unit,
cache etc
• HT strives to improve throughput by taking advantage of processor
stalls on the logical processor
• HT performance could be worse than UniProcessor (non-HT)
performance if the threads have higher cache hit (more than 50%)
Hardware and Performance
Multicores
• Cores have their own L1 Cache
• L2 Cache is shared between processors
• Cache coherency is relatively faster compared to SMP systems
• Performance scaling is same as SMP systems
Hardware and performance
NUMA
• Memory contention increases as the number of processors increase
• NUMA alleviates memory contention by localizing memory per
processor
Hardware and Performance - Memory
Node Interleaving
•
Opteron processors supports two type of memory access –
NUMA and Node Interleaving mode
•
Node interleaving mode alternates memory pages between
processor nodes so that the memory latencies are made uniform.
This can offer performance improvements to systems that are not
NUMA aware
•
NUMA on single core Opteron systems contains only one core
per NUMA node.
•
SMP VM on ESX running on a single core Opteron systems will
have to access memory across the NUMA boundary. So SMP
VMs may benefit from Node Interleaving
•
On dual core Opteron systems a single NUMA node will have two
cores. So NUMA mode could be turned on.
Hardware and Performance – I/O devices
I/O Devices
• PCI-E, PCI-X, PCI
 PCI at 66MHz – 533 MB/s
 PCI-X at 133 MHz – 1066 MB/s
 PCI-X at 266 MHz – 2133 MB/s
 PCI-E bandwidth depends on the number of Lanes, x16 Lanes - 4GB/s,
each Lane adds 250 MB/s.
• PCI bus saturation – dual port, quad port devices
 In PCI protocol the bus bandwidth is shared by all the devices in the bus.
Only one device could communicate at a time.
 PCI-E allows parallel full duplex transmission with the use of Lanes
Hardware and Performance – I/O Devices
SCSI
• Ultra3/Ultra 160 SCSI – 160 MB/s
• Ultra320 SCSI – 320 MB/s
• SAS 3Gbps– 300 MB/s duplex
FC
• Speed constrained by Medium, Laser wavelength
• Link Speeds: 1G FC – 200 MB/s, 2G – 400 MB/s, 4G – 800 MB/s,
8GB – 1600 MB/s
ESX Architecture
Performance Perspective
17
Confidential
ESX Architecture – Performance Perspective
CPU Virtualization – Virtual Machine Monitor
• ESX doesn’t trap and emulate every instruction, x86 arch does not
allow this
• System calls and Faults are trapped by the monitor
• Guest code runs in one of three contexts
 Direct execution
 Monitor code (fault handling)
 Binary Translation (BT - non virtualizable instructions)
• BT behaves much like JIT
• Previously translated code fragments are stored in translation cache
and reused – saves translation overhead
ESX Architecture – Performance Implications
Virtual Machine Monitor – Performance implications
• Programs that don’t fault or invoke system calls run at near native
speeds – ex. Gzip
• Micro-benchmarks that do nothing but invoke system calls will incur
nothing but monitor overhead
• Translation overhead varies with different Privileged instructions.
Translation cache tries to offset some of the overhead.
• Applications will have varying amount of monitor overhead
depending on their call stack profile.
• Call stack profile of an application can vary depending on its
workload, errors and other factors.
• It is hard to generalize monitor overheads for any workload. Monitor
overheads measured for an application are strictly applicable only to
“Identical” test conditions.
ESX Architecture – Performance Perspective
Memory virtualization
• Modern OS’es set up page tables for each running process. x86
paging hardware (TLB) caches VA - PA mappings
• Page table shadowing – additional level of indirection
 VMM maintains PA – MA mappings in a shadow table
 Allows the guest to use x86 paging hardware with the shadow table
• MMU updates
 VMM write protects shadow page tables (trace)
 When the guest updates page table, monitor kicks in (page fault) and
keeps shadow page table consistent with the physical page table
• Hidden page faults
 Trace faults are hidden to the guest OS - monitor overhead.
 Hidden page faults are similar to TLB misses on native environments
ESX Architecture – Performance Perspective
Page table shadowing
ESX Architecture – Performance Implications
Context Switches
• On Native hardware TLB is flushed during a context switch. Newly
switched process will incur TLB miss on first memory access.
• VMM caches Page Table Entries (PTE) during context switches
(caching MMU). We try to keep the Shadow PTE consistent with the
Physical PTE
• If there are lots of processes running in the guest, and they context
switch frequently, VMM may run out of PT caching.
Workload=terminalservices increases this cache size (vmx).
Process creation
• Every new process created requires new PT mapping. MMU
updates are frequent
• Shell Scripts that spawns commands can cause MMU overhead
ESX Architecture – Performance Perspective
I/O Path
ESX Architecture – Performance Perspective
I/O Virtualization
• I/O devices are non virtualizable and therefore they are emulated in
the guest OS
• VMkernel handles Storage and Networking devices directly as they
are performance critical in server environments. CDROM, floppy
devices are handled by the service console.
• I/O is interrupt driven and therefore incurs monitor overhead. All I/O
goes through VMkernel and involves a context switch from VMM to
VMKernel
• Latency of networking device is lower and therefore delay due to
context switches can hamper throughput
• VMkernel fields I/O interrupts and delivers it to correct VM. From
ESX 2.1, VMKernel delivers the interrupts to the idle processor.
ESX Architecture – Performance Perspective
Virtual Networking
• Virtual NICs
 Queue buffer could overflow
- if the pkt tx/rx rate is high
- VM is not scheduled frequently
 VMs are scheduled when they have packets for delivery
 Idle VMs still receive broadcast frames. Wastes CPU resources.
 Guest Speed/duplex settings is irrelevant.
• Virtual Switches don’t learn MAC address
 VMs register MAC address, virtual switch knows the location of the MAC
• VMnics
 Listens for the MAC addresses that are registered by the VMs.
 Layer 2 Broadcast frames are passed above
ESX Architecture – Performance Perspective
NIC Teaming
• Teaming only provides outbound load balancing
• NICs with different capabilities could be teamed
 Least common Capability in the bond is used
• Out-MAC mode scales with number of VMs/virtual NICs. Traffic from
a single virtual NIC is never load balanced.
• Out-IP scales with the number of Unique TCP/IP sessions.
• Incoming traffic can come on the same NIC. Link aggregation on the
physical switches provides inbound load balancing.
• Packet reflections can cause performance hits in the guest OS. No
empirical data available.
• We Failback when the Link comes alive again.
 Performance could be affected if the Link flips flops.
ESX Architecture – Performance Perspective
vmxnet optimizations
• vmxnet handles cluster of packets at once – reduces context
switches and interrupts
• Clustering kicks in only when the packet receive/transmit rate is
high.
• vmxnet shares memory area with VMkernel – reduces copying
overhead
• vmxnet can take advantage of TCP checksum and Segmentation
offloading (TSO)
• NIC Morphing – allows loading vmxnet driver for valance virtual
device. Probes a new register with the valance device.
• Performance of a NIC Morphed vlance device is same as the
performance of vmxnet virtual device.
ESX Architecture – Performance Perspective
SCSI performance
• Queue depth determines the SCSI throughput. When the queue is
full, SCSI I/O’s are blocked limiting effective throughput.
• Stages of Queuing
 Buslogic/LSILogic -> VMkernel Queue -> VMkernel Driver Queue depth > Device Firmware Queue -> Queue depth of the LUN
• Sched.numrequestOutstanding – number of outstanding I/O
commands per VM – see KB 1269
• Buslogic driver in windows limits the queue depth size to 1 – see KB
1890
• Registry settings available for maximizing queue depth for LSILogic
adapter (Maximum Number of Concurrent I/Os)
ESX Architecture – Performance Perspective
VMFS
• Uses larger block sizes (1MB default)
 Larger block size reduces Metadata size – metadata is completely cached
in memory
 Near native speeds is possible, because metadata overhead is removed
 Fewer I/O operations. Improves read-ahead cache hits for sequential
reads
• Spanning
 Data is filled to the other LUN sequentially after overflow. There is no
striping.
 Does not offer performance improvements.
• Distributed Access
 Multiple ESX hosts can access the VMFS volume, only one ESX host
updates the meta-data
ESX Architecture – Performance Perspective
VMFS
• Volume Locking
 Metadata updates are performed through locking mechanism
 SCSI reservation is used to lock the volume
 Do not confuse this locking with the file level locks implemented in the
VMFS volume for different access modes
• SCSI reservation
 SCSI reservation blocks all I/O operations until the lock is released by the
owner
 SCSI reservation is held usually for a very short time and released as
soon as the update is performed
 SCSI reservation conflict happens when SCSI reservation is attempted on
a volume that is already locked. This usually happens when multiple ESX
hosts contend for metadata updates
ESX Architecture – Performance Perspective
VMFS
• Contention for metadata updates
 Redo log updates from multiple ESX hosts
 Template deployment with redo log activity
 Anything that changes/modifies file permission on every ESX host
• VMFS 3.0 uses new volume locking mechanism that significantly
reduces the number of SCSI reservations used
ESX Architecture – Performance Perspective
Service Console
• Service console can share Interrupt resources with VMkernel.
Shared interrupt lines reduce performance of I/O devices – KB 1290
• MKS is handled in the service console in ESX 2.x. and its
performance is determined by the resources available in the COS
• The default Min CPU allocated is 8% and may not be sufficient if
there are lots of VMs running
• Memory recommendations for service console do not account
memory that will be used by the agents
• Scalability of VMs is limited by COS in ESX 2.x. ESX 3.x avoids this
problems with userworlds for VMkernel.
Understanding ESX Resource
Management & Over-Commitment
33
Confidential
ESX Resource Management
Scheduling
• Only one VCPU runs on a CPU at any time
• Scheduler tries to run the VM on the same CPU as much as possible
• Scheduler can move VMs to others Processors when it has to meet the CPU
demands of the VM
Co-scheduling
• SMP VMs are co-scheduled, i.e. all the VCPUs run on their own
PCPUs/LCPUs simultaneously
• Co-scheduling facilitates synchronization/communication between
processors, like in the case of spinlock wait between CPUs
• Scheduler can run a VCPU without the other for a short period of time (1.5
ms)
• Guest could halt the co-scheduled CPU, if it is not using it, but Windows
doesn’t seem to halt the CPU – wastes CPU cycles
ESX Resource Management
NUMA Scheduling
• Scheduler tries to schedule the world within the same NUMA node
so that cross NUMA migrations are fewer
• If a VM’s memory pages are split between NUMA nodes, the
memory scheduler slowly migrates all the VM’s pages to the local
node. Over time the system becomes completely NUMA balanced.
• On NUMA architecture, CPU utilization per NUMA node gives better
idea of CPU contention
• While factoring %ready, factor the CPU contention within the same
NUMA node.
ESX Resource Management
Hyperthreading
• Hyperthreading support was added in ESX 2.1, recommended
• Hyperthreading increases scheduler’s flexibility especially in the
case of running SMP VMs with UP VMs
• A VM scheduled on a LCPU is charged only half the “package
seconds”
• Scheduler tries to avoid scheduling a SMP VM onto the logical
CPUS of the same package
• A high priority VM may be scheduled to a package with one its of
LCPU halted – this prevents other running worlds from using the
same package
ESX Resource Management
HTSharing
• Controls hyperthreading behavior with individual VMs.
• htsharing=any
 Virtual CPUs could be scheduled on any LCPUs. Most flexible option for the
scheduler.
• htsharing=none
 Excludes sharing of LCPUs with other VMs. VM with this option gets a full package
or never gets scheduled.
 Essentially this excludes the VM from using logical CPUs (useful for the security
paranoid). Use this option if an application in the VM is known to perform poorly with
HT.
• htsharing=internal
 Applies to SMP VMs only. This is same as none, but allows sharing the same
package for the VCPUs of the same VM. Best of both worlds for SMP VMs.
 For UP VMs this translates to none
ESX Resource Management
HT Quarantining
• ESX uses P4 Performance counters to constantly evaluate HT
performance of running worlds
• If a VM appears to interact badly with HT, the VM is automatically
placed into a quarantining mode (i.e. htsharing is set to none)
• If the bad events disappear, the VM is automatically pulled back
from quarantining mode
• Quarantining is completely transparent
ESX Resource Management
CPU affinity
• Defines a subset of LCPUs/PCPUs that a world could run on
• Useful to
 partition server between departments
 troubleshoot system reliability issues
 For manually setting NUMA affinity in ESX 1.5.x
 applications that benefit from cache affinity
• Caveats
 Worlds that don’t have affinity can run on any CPU, so they have better chance of
getting scheduled
 Affinity reduces Schedulers capability to maintain fairness – min CPU guarantees
may not be possible under some circumstances
 NUMA optimizations (page migrations) are excluded for VMs that have CPU affinity
(can enforce manual memory affinity)
 SMP VMs should not be pinned to LCPUs
 Disallows vMotion operations
ESX Resource Management
Proportional Shares
• Shares are used only when there is a resource contention
• Unused shares (shares of a halting/idling VM) are partitioned across
active VMs.
• In ESX 2.x shares operate on a flat namespace
• Changing shares of one world affects the effective CPU cycles
received by other running worlds.
• If VMs use a different share scale then shares for other worlds
should be changed to the same scale
ESX Resource Management
Minimum CPU
• Guarantees CPU resources when the VM requests for it
• Unused resources are not wasted, and is given to other worlds that
requires it.
• Setting min CPU to 100% (200% in case of SMP) ensures that the
VM is not bound by the CPU resource limits
• Using min CPU is favored over using CPU affinity or proportional
shares
• Admission control verifies if Min CPUs could be guaranteed when
the VM is powered on or VMotioned
ESX Resource Management
Demystifying “Ready” time
• Powered on VM could be either running, halted or in a ready state
• Ready time signifies the time spent by a VM on the run queue waiting to be
scheduled
• Ready time accrues when more than one world wants to run at the same
time on the same CPU
 PCPU, VCPU over-commitment with CPU intensive workloads
 Scheduler constraints - CPU affinity settings
• Higher ready time reduces response times or increases job completion time
• Total accrued ready time is not useful
 VM could have accrued ready time during their runtime without incurring performance
loss (for example during boot)
• %ready = ready time accrual rate
ESX Resource Management
Demystifying “Ready” time
• There are no good/bad values for %ready.
 Depends on the priority of the VMs - latency sensitive applications may
require less or no ready time
• Ready time could be reduced by increasing the priority of the VM
 Allocate more shares, set minCPU, remove CPU affinity
ESX Resource Management
Unexplained “Ready” time
• If the VM accrues ready time while there are enough CPU resources
then it is called “Unexplained Ready time”
• There are some belief in the field that such a thing actually exists –
hard to prove or disprove
• Very hard to determine if CPU resources are available when ready
time accrues
 CPU utilization is not a good indicator of CPU contention
 Burstiness is very hard to determine
 NUMA boundaries – All VMs may contend within the same NUMA node
 Misunderstanding of how scheduler works
ESX Resource Management
Resource Management in ESX 3.0
• Resource Pools
 Extends hierarchy. Shares operate within the resource pool domain.
• MHz
 Resource allocation are absolute based on clock cycles. % based
allocation could vary with processor speeds.
• Clusters
 Aggregates resources from multiple ESX hosts
Resource Over-Commitment
CPU Over-Commitment
• Scheduling
 Too many things to do!
 Symptoms: high %ready
 Judicious use of SMP
• CPU utilization
 Too much to do!
 Symptoms: 100% CPU
 Things to watch
- Misbehaving applications inside the guest
- Do not rely on Guest CPU utilization – halting issues, timer interrupts
- Some applications/services seem to impact guest halting behavior. No longer tied
to SMP HALs.
Resource Over-Commitment
CPU Over-Commitment
• Higher CPU utilization does not necessarily mean lesser
performance.
 Application’s progress is not affected by higher CPU utilization
 However if higher CPU utilization is due to monitor overheads then it may
impact performance by increasing latency
 When there is no headroom (100% CPU), performance degrades
• 100% CPU utilization and %ready are almost identical – both delay
application progress
• CPU Over-Commitment could lead to other performance problems
 Dropped network packets
 Poor I/O throughput
 Higher latency, poor response time
Resource Over-Commitment
Memory Over-Commitment
• Guest Swapping - Warning
 Guest page faults while swapping.
 Performance is affected by both guest swapping and due to monitor overhead
handling page faults.
 Additional disk I/O
• Ballooning – Serious
• VMkernel Swapping - Critical
• COS Swapping - Critical
 VMX process could stall and affect the progress of the VM
 VMX could be a victim of random process killed by the kernel
 COS requires additional CPU cycles, for handling frequent page faults and disk I/O
• Memory shares determine the rate of ballooning/swapping
Resource Over-Commitment
Memory Over-Commitment
• Ballooning
 Ballooning/swapping stalls processor, increases delay
 Windows VMs touches all allocated memory pages during boot. Memory
pages touched by the guest could be reclaimed only by ballooning
 Linux guest touches memory pages on demand. Ballooning kicks in only
when the guest is under complete memory pressure
 Ballooning could be avoided by using min=max
 /proc/vmware/sched/mem
- size <>sizetgt indicates memory pressure
- mctl > mctlgt – ballooning out (giving away pages)
- mctl < mctlgt – ballooning in (taking in pages)
 Memory shares affect ballooning rate
Resource Over-Commitment
Memory Over-Commitment
• VMKernel Swapping
 Processor stalls due to VMkernel swapping are more expensive than
ballooning (due to disk I/O)
 Do not confuse this with
- Swap reservation: Swap is always reserved for worst case scenario if
min<> max, reservation = max – min
- Total swapped pages: Only current swap I/O affects performance
 /proc/vmware/sched/mem-verbose
- swpd – total pages swapped
- swapin, swapout – swap I/O activity
 SCSI I/O delays during VMKernel I/O swapping could result in system
reliability issues
Resource Over-Commitment
I/O bottlenecks
• PCI Bus saturation
• Target device saturation
 Easy to saturate storage arrays if the topology is not designed correctly for load
distribution
• Packet drops
 Effective throughput reduces
 Retransmissions can cause congestion
 Window size scales down in the case of TCP
• Latency affects throughput
 TCP is very sensitive to Latency and packet drops
• Broadcast traffic
 Multicast and broadcast traffic sent to all VMs.
• Keep an eye on Pkts/sec and IOPS and not just bandwidth consumption
ESX Performance
Application Performance issues
52
Confidential
ESX Performance – Application Issues
Before we begin
• From VM perspective, an running application is just a x86 workload.
• Any Application performance tuning that makes the application to run more
efficiently will help
• Application performance can vary between versions
 New version could be more or less efficient
 Tuning recommendations could change
• Application behavior could change based on its configuration
• Application performance tuning requires intimate knowledge on how the
application behaves
• Nobody at VMware specializes on application performance tuning
 Vendors should optimize their software with the thought that the hardware resources
could be shared by other Operating Systems.
 TAP program
- SpringSource (unit of VMware) – Provides developer support for API scripting
ESX Performance – Application issues
Citrix
• Roughly 50-60% monitor overhead – takes 50-60% more CPU cycles than
on the native machine
• The maximum number of users limit is hit when the CPU is maxed out –
roughly 50% of users as would be seen on native environment with an
apples to apples comparison.
• Citrix Logon delays
 This could happen even on native machines when roaming profiles are configured.
Refer Citrix and MS KB articles
 Monitor overhead can introduce logon delays
• Workarounds
 Disable com ports, workload=terminalservices, disable unused apps, scale
horizontally
• ESX 3.0 improves Citrix performance – roughly 70-80% of native
performance
ESX Performance – Application issues
Database performance
• Scales well with vSMP – recommended
 Exceptions: Pervasive SQL – not optimized for SMP
• Two key parameters for database workloads
 Response time
- Transaction logs
 CPU utilization
• Understanding SQL performance is complex. Most enterprise
databases run some sort of query optimizer that changes the SQL
Engine parameters dynamically
 Performance will vary with run time. Typically benchmarking is done after
priming the database
• Memory resource is key. SQL performance can vary a lot depending
on the available memory.
ESX Performance – Application Issues
Lotus Domino Server
• One of the better performing workloads. 80-90% of direct_exec
• CPU and I/O intensive
• Scalability issues – Not a good idea to run all domino servers on the
same ESX server.
ESX Performance – Application Issues
16-bit applications
• 16 bit applications on windows NT/2000 and above run in a
Sandboxed Virtual Machine
• 16 bit apps depend on segmentation – possible monitor overhead.
• Some 16-bit apps seem to spin idle loop instead of halting the CPU
 Consumes excessive CPU cycles
• No performance studies done yet
 No compelling application
ESX Performance – Application Issues
Netperf – throughput
• Max Throughput is bound by a variety of parameters
 Available Bandwidth, TCP window size, available CPU cycles
• VM incurs additional CPU overhead for I/O
• CPU utilization for networking varies with
 Socket buffer size, MTU – affects the number of I/O operations performed
 Driver – vmxnet consumes lesser CPU cycles
 Offloading features – depending on the driver settings and NIC
capabilities
• For most applications, throughput is not the bottleneck
 Measuring throughput and improving it may not always resolve the
underlying performance issue
ESX Performance – Application Issues
Netperf – Latency
• Latency plays an important role for many applications
• Latency can increase
 When there are too many VMs to schedule
 VM is CPU bound
 Packets are dropped and then re-transmitted
ESX Performance – Application Issues
Compiler Workloads
• MMU intensive: Lots of new processes created, context switched,
and destroyed.
• SMP VM may hurt performance
 Many compiler workloads are not optimized by SMP
 Process threads could ping-pong between the vCPUs
• Workarounds:
 Disable NPTL
 Try UP (don’t forget to change the HAL)
 Workload=terminalservices might help
ESX Performance Forensics
61
Confidential
ESX Performance Forensics
Troubleshooting Methodology
• Understand the problem.
 Pay attention to all the symptoms
 Pay less attention to subjective metrics.
• Know the mechanics of the application
 Find how the application works
 What resources it uses, and how it interacts with the rest of the system
• Identify the key bottleneck
 Look for clues in the data and see if that could be related to the symptoms
 Eliminate CPU, Disk I/O, Networking I/O, Memory bottlenecks by running
tests
• Running the right test is critical.
ESX Performance Forensics
Isolating memory bottlenecks
• Ballooning
• Swapping
• Guest MMU overheads
ESX Performance Forensics
Isolating Networking Bottlenecks
• Speed/Duplex settings
• Link state flapping
• NIC Saturation /Load balancing
• Packet drops
• Rx/Tx Queue Overflow
ESX Performance Forensics
Isolating Disk I/O bottlenecks
• Queue depth
• Path thrashing
• LUN thrashing
ESX Performance Forensics
Isolating CPU bottlenecks
• CPU utilization
• CPU scheduling contention
• Guest CPU usage
• Monitor Overhead
ESX Performance Forensics
Isolating Monitor overhead
• Procedures for release builds
 Collect performance snapshots
• Monitor Components
ESX Performance Forensics
Collecting Performance Snapshots
• Duration
• Delay
• Proc nodes
• Running esxtop on performance snapshots
ESX Performance Forensics
Collecting Benchmarking numbers
• Client side benchmarks
• Running benchmarks inside the guest
ESX Performance
Troubleshooting - Summary
70
Confidential
ESX Performance Troubleshooting - Summary
Key points
• Address real performance issues. Lots of time could be spent on spinning
wheels on theoretical benchmarking studies
• Real performance issues could be easily described by the end user who
uses the application
• There is no magical configuration parameter that will solve all performance
problems
• ESX performance problems are resolved by
 Re-architecting the deployment
 Tuning application
 Applying workarounds to circumvent bad workloads
 Moving to a newer version that addresses a known problem
• Understanding Architecture is the key
 Understanding both ESX and application architecture is essential to resolve
performance problems
Questions?
Reference links
http://www.vmware.com/files/pdf/perf-vsphere-memory_management.pdf
http://www.vmware.com/resources/techresources/10041
http://www.vmware.com/resources/techresources/10054
http://www.vmware.com/resources/techresources/10066
http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf
http://www.vmware.com/pdf/RVI_performance.pdf
http://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf
http://www.vmware.com/files/pdf/perf-vsphere-fault_tolerance.pdf

similar documents