VMware vCenter Operation Manager

VMware vCenter Operation Manager
Karoly Szalai, Technical Support Engineer
© 2009 VMware Inc. All rights reserved
 What is vCOPs and why is it good for me?
 An example scenario
 Counters and badges
Managing Performance/Capacity in vSphere: the basics
What is vCOPs? Is this just an another monitoring system? Boring!
We already have the best (nagios, zabbix, HP openview, etc.)
No, it’s more than just a monitoring system!
Is it healthy?
• Every VM & ESX
performing well?
Network, Disk?
• Are they behaving
• Any fault on any
Is it enough?
• Enough CPU, RAM,
Network, Disk?
Future risk?
• Time remaining?
• Capacity
• Where are the
“Stress points”
in time?
Is it optimised?
• Which VMs need
• What are my key
• How much can I
claim back from
“fat” VMs?
• How many more
VMs can I put
without impacting
vCOPs is built to complement vCenter
Is it healthy = Health
• Workload
• Anomalies
• Faults
Is it enough = Risk
Time remaining
Capacity remaining
Stress period
Is it optimised = Efficiency
What we can reclaim?
Density, key ratio!
Daily update at midnight!
Bird-eye view
This is a small environment
 1 vCenter
 1 Datacenter
 2 clusters
 4 hosts
 9 VMs (including off)
 2 datastore
Visibility across vCenters
Everyday task: performance troubleshooting
 You got an email from the app team, saying the main intranet application was slow
• The email was 1 hour ago. The email stated it was slow for about 1 hour and it was ok after that
• (So it was slow between 1-2 hours ago, but it’s ok now. Helpful, isn’t it?)
• You just checked. Everything is indeed ok in the past 1 hour.
• The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM
• You are not familiar with the applications. You don’t know what apps runs on each VM as you have no access to
the Guest OS
• Your environment: 1 VC, 4 clusters, 35 hosts 500+ VM, 30 datastores, 1 midrange array, 10GE FCoE
How do you approach/solve this with just vCenter?
What do you do?
A: smile, as this will be a nice challenge for your TAM/BCS/MCS engineer
B: no sweat, you’re VCDX,CCIE, ITIL master + you can fix your storage fw with a hex editor. You’re born for this
C: send a text: “Honey, this evening is cancelled, I got a better offer”
D: Buy a dinner to app team, and tell them to keep quiet.
Everyday task: performance troubleshooting
 The minimum you need to prove
• Performance problem is not caused by your infrastructure, not by your VMware
• Infrastructure: VMware + Storage + Network
• Application: VM + App inside the VMs
 What you should be able to prove
• For each VM, the following was ok during the incident: CPU, RAM, Disk,
• The shared infrastructure was also healthy: ESXi, datastore, overall platform
 Ideally you can prove
• Show the exact application level counter that are slow, with the underlying
infrastructure-level counter that caused it = Root Cause Analysis
Challenge 1: details are lost after 1 hour
The first problem is: vCenter stores only 1 hour worth of data in
depth. After an hour, a lot of details are no longer available!
In real time performance we have 2 cores
info + 16 different counters
In past day stats we have only CPU info of VM
and 6 counters only! A typical ESX host has 1224 cores. What if the problem with vSMP?
Challenge 1: details are lost after 1 hour
Memory Counters
<1 hour
>1 hour
Disk Counters
<1 hour
>1 hour
In the meantime in vCOPs
Challenge 2: vSphere and applications
Here is the second challenge: vSphere has no application-awareness!
 You have a little idea what the 10 VMs make up the application
 What services are running on each VM
 Only thing you can do is to group them via vAPP like vCOPs:
In the meantime in vCOPs
Same application
• Health is 89, so it’s good
• It’s been good in the past
6 hours
• The app consists 4
components: distribution,
analysis, collection and
• We know there are only
2 VMs. So you’re getting
app-level data here!
• You can double click on
each metric to dig
deeper, but full HD
resolution recommended
• You can configure your
tab as you like it.
Another plus is Infrastructure navigator
 Infrastructure navigator is a separate
component in vCOPs (enterprise or
higher level)
 VIN can answer for the following
• How many VMs make up this application?
• What services are running on each VM?
• Who are talking to who? Using what
• Which VMs are protected with DR? You
can even tell which SRM protection group
and SRM protection plan are involved.
 VIN requires vCenter 5, as it relies on
web client (new UI standards)
Analyse data in vCenter can be hard or misguiding
Hey! There is an
alarm with high
memory usage!
It’s above 90% for
more than 5 mins!
Analyse data in vCenter can be hard or misguiding
Let’s check the performance data in vCenter!
Here is a common example of
why a deep understanding of
vSphere make big difference.
As we can see, this host
needs more RAM, doesn’t it?
It’s using 92% for more than a
In the meantime in vCOPs
Configured memory: 16.383 MB
Demand: 5.574 MB (36% of Usable)
Usage: 15.147MB (98% of Usable)
Usable: 15.43 MB
Normal demand: 4.672 – 8.843 MB
Plenty of headroom! It just saves us from
a costly RAM upgrade project!
Counters and badges
 A vCenter farm with only 50 ESXi host and
500 VM will have 10000< counters!
• It is impossible to look at them, so let vCOPS to
analyse them.
 vCenter presents raw counters
• i.e. what does Ready time of 1500 in Real Time
chart mean? Is value of 2000 in Real Time chart
better than value of 75000 in Daily chart?
Derived counters
Standardises the scale into 0 - 100
1 universal unit, minimse the
“translation” in our head
• Is memory usage at 90% at ESXi level good or bad?
Can be >100 if demand is unmet
• Is IOPs of 300 good or bad for datastore XYZ?
Universal. Apply to CPU, RAM,
Disk, Net etc.
 Single counter can be misleading
• Low CPU usage does not mean VM is getting the
CPU, if there is limit, contention and co-stop.
• Disk performance measured with different counters
at multiple layers (VM, kernel, physical)
 Different counters have different units
• GHz, %, MB, kbps, IO/s, ms
• This make analysis even more complex
Counters derived using
sophisticated formula, not just
For the same counter, different
objects use different formula
Thresholds: vCOPs does differently
 vCenter sets static threshold, which can be misleading
• During peak time, it is common for VM to reach high utilisation
• Static threshold will generate alerts when it should not
• vSphere admins quickly learn to ignore them, defeating the purpose of alert to begin with
• During non-peak, it might be abnormal for VM to reach even 50% utilisation
• Static threshold will not generate alerts when it should have
 vCenter only sets high threshold
• Do you have any threshold when CPU or RAM utilisation drops below 5%?
• A drop in entire array storage IOPs might be a sign of terrible day ahead
• Will not alert when:
• Utilisation drops from 75% to 1% when it should not
• Utilisation change from 5% to 75% when it should not
• We need to plots both upper and lower range!
 Each VM differs. The same VM differs depending on day/time
• Intelligence required to analyse each metrics and their expected “normal” behaviour
Dynamic threshold & alerts
 vCenter Operations uses dynamic threshold
• It is dynamic and personalized down to individual metric.
• Varies from object to object. 1000 VM will have their own threshold.
• Varies from time to time. The same CPU Usage counter has different threshold at different time. This cater for
peak. See the chart below.
• Varies from metric to metric. An ESX with 12 cores, each core can have its own CPU Usage threshold.
• You can fix hard thresholds if you need to.
• This needs Enterprise edition. It comes with no static threshold defined.
• Steps  http://virtual-red-dot.blogspot.com/2012/01/vcenter-operations-5-hard-threshold.html
Notice the range varies
in size
Badges – Health
 Answers complex questions like:
• How is the entire virtual data center doing?
• For every cluster, host, datastore, what’s their health?
 Health is the current operational state
• It represents what is wrong now and should be addressed
within 1 day. Thus Health needs to be scored such that if
it’s red, then it really needs attention.
 Weather Map
• Simple way to check that entire farm is healthy
• Shows health of all parent and child objects
• Each square can be VM, ESX, datastore, cluster datacenter,
75 – 100
Normal behaviour
50 – 75
The object experience some problems.
25 – 50
The object might have serious problems.
Check, and take action as soon as possible
0 – 25
The object is either not functioning properly or
will stop functioning soon
Badges – Workload
 Answers complex questions like:
• For every object how is Demand vs Spply?
• For every single VM, is CPU/Memory/Disk/Network
• Any VM is not getting what they are entitled/required?
• What’s the normal workload range for every object in
our vDC?
 Workload is not utilisation or usage
• More accurate than utilisation as it takes many factors
than just utilisation
 Workload = (Demand/Entitlement)
• Entitlement is dynamic. Affected by shares, limit, etc.
• Demand ≠ Usage
• Usage may mean passive usage (RAM page is there but no
0 – 80
Workload is not high.
80 – 90
The object is experiencing some
high resource workloads.
90 – 95
Workload on the object is
approaching its capacity in ≥1 areas.
write/read at all
• Score is Max(CPU, RAM, Disk IO, Net IO)
Workload on the object is at or over its
capacity in ≥1 areas.
Badges – Anomalies
 Answers complex questions like:
• Is our vDC doing as usual? Are there any unexpected
changes (as we have dynamic environment)?
• Which VMs, ESX, cluster, datastore etc are behaving
• … and exactly which counters are the culprits?
 Identifying metric abnormalities
• It needs to learn dynamic ranges of “Normal” for each
metric, so give it >3 cycle per metric
• A month-end job means it needs 3 months
• Normal range changes after configuration or application
 Anomalies score
• High number of anomalies:
• Usually an indication of problem
• Demand change
0 – 50
Normal Anomaly range
50 – 75
The score exceeds the normal range.
75 – 90
The score is very high.
• Application team changed code/app
• KPI (Key performance Indicator) metrics impacts the
anomalies more than non KPI metrics
> 90
Most of the metrics are beyond their
thresholds. This object might not be
working properly or will stop working
Badges – Faults
 Answers complex questions like:
• What fault do we experience in our vDC?
• For every object, what faults does it have?
0 – 25
No fault is registered on the object
25 – 50
Faults of low importance happens on
50 – 75
Faults of high importance happens on
 Specific knowledge of which vCenter events
• Which events affect Availability and Performance of
which object?
• Pulled from active vCenter events
• Example:
• Loss of redundancy in NICs or HBAs
• Memory checksum errors
• HA failover problems.
• Each fault has a default score
• Highest individual Fault Score drives the Fault object
 Best Practices
• Do not change Fault Threshold
• Use Alerts View to manage Faults. You can Filter it to
just show Faults.
> 75
Faults of critical importance happens on
Badges – Risk
 Answers complex questions like:
• Do we have risk from performance or capacity in our
vDC? If yes, where are they and how serious?
• Which objects are at risk? What is the specific risk?
 Risk Score takes into account
• Time Remaining
• Capacity Remaining
• Stress
 Risk is an early warning system
• Identifies potential problems that could eventually hurt
the performance
• The Risk Chart shows Risk score over the last 7 days,
giving a view of trend
0 – 50
No problems are expected in the future.
50 – 75
There is a low chance of future problems or a
potential problem might occur in the far future.
75 – 100
There is a chance of a more serious problem or a
problem might occur in the medium-term future.
The chances of a serious future problem are high
or a problem might occur in the near future
Badges – Time remaining
 Answer complex questions like:
• How much time do we have before we need to buy
more server, storage, network before performance
starts to degrade or we run out of capacity?
• For every cluster, VM, datastore, how much time do we
 Measures time remaining before each
resource type reaches its capacity
• Memory
• Disk (IOPS & Space)
• Network I/O
 Early warning of upcoming provisioning
• Based on Score Provisioning buffer. Default value is 30
• Set in “Capacity & Time Remaining” section
Time remaining
50 – 100
> 2x SP Buffer (60 days)
25 – 50
< 2x SP Buffer
Near SP Buffer
< SP buffer (30 days)
Badges – Capacity remaining
 Answer complex questions like:
• How many more VM can we put without impacting
performance or using up capacity?
• For every cluster, VM, datastore, which components (CPU,
RAM, Disk, Network) would run out first?
 Early warning system
• A low score of 1 mean you still have >30 days.
• Measures how many more VMs can be placed on the object
 Percentage of Total VM “Slots” Remaining
• Based on the average size of the VM on the object (e.g. VM
• Each object has its OWN VM profile size: Host, Cluster,
Datacenter, Etc.
 From the table, notice value is not linear
• It is also not the same with Time Remaining threshold.
• A value of 30 means >120 days for capacity but around 40
days for time.
Capacity remaining
>120 days
5 – 10
60 – 120 days
30 – 60 days
<30 days
Capacity remaining calculation
 Determine capacity constraint resources
 Deployed or Powered On VMs
• Powered off VMs only use disk space resources
• Powered off VMs use ALL of the 4 resources
 Calculation example:
• The limit is 40 more VMs
• We have 9 deployed VMs
• 40/(40+9) = 81%
 You can drill down to see details
• You can check all 9 components as shown on right
• This helps to answer the question which components have how
many days or VM left
• Summary = min (all 9 components)
Badges – Stress
 Answer complex questions like:
• In our vDC, do we have stress points or periods? How bad is it?
• For every cluster, VM, datastore, which ones are experiencing
stress and how bad is it?
 Measures long-term or chronic workload (6
• Chart shows weeks break down of Stress for each day/hour
averaged over the last 6 Weeks
• Workloads > 70% = “Stressed”
• Threshold Configurable as per screenshot below
Normal score. No action needed
Some of the object resources are
not enough to meet the demands.
5 – 30
The object is experiencing regular
resource shortage.
Most of the resources on the object are
constantly insufficient. The object might
stop functioning properly.
Stress Calculation
Stress Zone
 Stress Score is a % and is based on area of Workload Above “Stress Line”
Threshold compared to the Total Capacity of the object
• Stress Score = (Stress area / Stress Zone) *100
• But max value can be > 100% as the workload can be >100.
 Example
• Stress Line is 70% Workload
• 12% of the area is above the 70% threshold
• Stress Score is 12
Badges – Efficiency
 Answer complex questions like:
• Are there optimization opportunities in our vDC?
• How well do we do in terms of VM provisioning? Do
we get them right?
 Efficiency Score factors
• Reclaimable waste
• Density ratio
 Graph Depicts VMs by Percent
• Optimal – Optimally Provisioned VMs
• Waste – Over Provisioned VMs
The efficiency is good. The resource use
on the selected object is optimal.
10 – 25
The efficiency is good, but can be
improved. Some resources are not fully
0 – 10
The resources on the selected object are
not used in the most optimal way.
The efficiency is bad. Many resources are
• Stress – Under Provisioned VMs
• Not used in Efficiency Calculation (see Risk)
Badges – Reclaimable waste
 Answer complex questions like:
• Do we over provisioned the VMs in terms of CPU, RAM and
Disk? If yes, what’s the degree of over provisioning?
• For every cluster, VM, datastore, what can we reclaim?
 It identifies the amount of reclaimable
• Memory
• Disk
 Reclaimable Waste = Reclaimable Capacity /
Deployed Capacity
0 – 50
No resources are wasted on the
selected object.
• Waste Score = Max(CPU Waste Score, RAM Waste Score,
50 – 75
Some resource can be used better.
75 – 100
Many resources are underused
Disk Space Waste Score)
• Disk calculation can also include old snapshots and templates
Most of the resources on the selected
object are wasted.
Badges – Density
 Answer complex questions like:
• How high can we push our consolidation
ratio before we experience performance
• Now that’s a million dollar question! 
• For every datacenter, cluster, ESXi, what
are our key ratios and how much head room
do we have?
 Contrasts Actual vs Ideal Density
• Identify Optimal Resource Deployment
Before Contention Occurs
• Ideal is based on demand, not simple
• High Density is good. 100 is not too high.
Good consolidation
10 – 25
Some resources are not fully
0 – 10
The consolidation for many resources is
The resource consolidation is extremely
Badge thresholds
There are 2 different threshold:
VM and Infra (ESXi, Cluster,
Datastore, etc)
Notice that Major badge has
different threshold to its minor
Even “similar” badges have
different threshold. Notice Time
remaining and Capacity
remaining have very different
Using badges together
 Workload High & Anomalies Low & Stress High
• Workload – Object is Running Hot. Potentially Starving
for Resources
Add resources
• Anomalies – Normal Behavior for this timeframe
• Stress – Object is often running under high Workload.
 Workload High & Anomalies Low & Stress Low
• Workload – Object is Running Hot. Potentially Starving
for Resources
• Anomalies – Normal Behavior for this timeframe
Not likely a big problem…
a cyclical workload spike?
• Stress – Object usually has enough resources
 Workload High & Anomalies High
• Workload – Object is Running Hot. Potentially Starving
for Resources
• Anomalies – Abnormal behavior for this timeframe
 If there are Alert and Fault too, then it is a sign
of major issue
Something is a miss!
Immediate attention.
… at the end
 This is not all! We are just scratching the surface.
• Heat map / Cold map: 2 dimensional chart, great way to show a lot of info on 1
screen about all cluster/host/VM
• Planning: gives visibility for the next 6 month. CPU/memory demand, Disk I/O,
Network I/O
• Alerts: normal vs smart alert
• Smart alert relies on the advanced analytics instead of simple raw counters. Not
static, based on Dynamic Threshold. Can do SNMP, SMTP, file.
• Performance chart!
• Capacity management
• Historical utilization trends, resources have been requested vs. needed, how many
VMs fit in my farm?
• Forecast: when will I run out of capacity? What if I add/remove/reconfigure capacity?
• Change events correlated with Performance: enable operations to quickly
understand and resolve performance issues

similar documents