Performance Management in the virtual world v2.1

Report
Performance Management in the Virtual World
Singapore, Q1 2013
1
Document Information
 This deck is part of a series.
• Part 1 is “Management in the Virtual World: a technical introduction.”
• http://communities.vmware.com/docs/DOC-17841
• Part 2 is “Resource Management in the Virtual World”
• http://communities.vmware.com/docs/DOC-17417
• Part 3 is “Performance Management in the Virtual World”
• http://communities.vmware.com/docs/DOC-22034
• Part 4 is “Capacity Management in the Virtual World”
• http://communities.vmware.com/docs/DOC-21791
• Part 5 is “Chargeback in the Virtual World”
• http://communities.vmware.com/docs/DOC-18593
• Part 6 is “Configuration Management in the Virtual World”
• To be written 
 Related documents
• DR 2.0: a new school of thought
• http://communities.vmware.com/docs/DOC-19992
• Sample Designs for vSphere
• http://communities.vmware.com/docs/DOC-19627
2
This is a very long &
technical material.
Use the Section feature
to see how it is
organised.
Use the speaker notes.
Authors & Audience
Iwan ‘e1’ Rahabok VCAP-DCD, TOGAF Certified
Staff SE, Strategic Accounts, VMware
[email protected] | Linkedin.com/in/e1ang
Co author wanted, needed, appreciated. 
This presentation is
created for VMware
Administrators.
It assumes knowledge
of vSphere and
vCenter Operations
Co author wanted, needed, appreciated. 
3
VM CPU: The 4 States
5
How a VM gets its resource
Provisioned
Limit
Entitlement
Contention
Usage
Demand
Reservation
0
6
Contention: Derived Metric
CPU Co-Stop
7
millisecond
CPU Latency
%
RAM Balloon
KB
RAM Zipped
KB
RAM Swap
KB
CPU Contention
%
RAM Contention
%
What do we care at each layer?
CPU:
RAM:
Disk:
Network:
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
Utilisation (%) of each vCore, Run-Queue
Utilisation (%), Swap (%), Ballooning (%)
Utilisation (Mbps), IOPS, Latency (ms)
Utilisation (Mbps), Packet drop (qty)
SDDC
CPU:
RAM:
Disk:
Network:
8
Utilisation (%) of each pCore, Latency (%), Co-Stop
Utilisation (%), Ballooning (%)
Utilisation (Gbps), IOPS, Latency (ms)
Utilisation (Gbps), Packet drop
The 2 side of Performance
Performance Troubleshooting
9
Performance Management
Fixing
Preventing
Specific issue
Big Picture
Now
Future
Performance Troubleshooting: a day in the life…
 You got an email from the app team, saying the main Intranet application was slow.
• The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that.
• So it was slow between 1-2 hours ago, but ok now.
• You did a check. Everything is indeed ok in the past 1 hour.
• The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM
• You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest
OS.
• Your environment: 1 VC, 4 clusters, 30 hosts, 300 VM, 20 datastores, 1 midrange array, 10 GE
Test your vSphere knowledge!
How do you solve/approach this with just vSphere?
What do you do?
 A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE 
 B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this.
 C: SMS your wive, “Honey, I’m staying overnight at the datacenter  “
 D: Take a blood pressure medicine so it won’t shoot up.
 E: Buy the app team very nice dinner, and tell them to keep quiet.
10
Performance Troubleshooting: a day in the life…
 The minimum you need to prove
• Performance is not caused by your infrastructure, or at least not by your VMware.
• Infrastructure = VMware + Storage + Network
• Application = VM + App inside the VM
 What you need to prove
• For each of the 10 VM, the following was ok between 1-2 hours ago: CPU, RAM, Disk, network
• To strengthen the above, prove that:
• The shared infrastructure was also healthy: relevant ESX, relevant Datastore
• The overall platform was also healthy.
• No relevant faults that happened 1-2 hours ago.
 What challenges do you face in vSphere to do the above?
• Group discussion: what limitations do you face, if vCenter + vMA + PowerGUI + RVTools is all you have?
 The ideal you need to prove
• Give the list of ports (that the 10 VM use) to network team to ensure the firewall is not dropping them.
• Show the exact application-level counter that are slow, with the underlying infrastructure-level counter that
caused it. Another word, application-specific + root-cause-analysis
11
Performance Troubleshooting: Approach
Is the VM configured
with enough resource?
•
•
•
•
CPU: low run queue
RAM: no swapping
Network: no drop packet
Disk: latency below SLA
What is the
VM & platform utilisation?
•
Is the platform
a bottleneck?
•
•
•
•
12
CPU Latency + CPU Co-Stop is below SLA.
CPU is not waiting for Memory or Disk.
•
RAM: 0 swapping, 0 Balloon, 0 compression
•
Disk: KAVG + DAVG is below SLA
Network: no drop packet
Disk: 0 abort, 0 reset.
•
VM utilisation
•
CPU, RAM, Disk, Net
Platform utilisation
•
CPU, RAM, Disk, Net
VM: is it configured with enough resource?
 CPU
• Run-Queue within the Guest
• Contention
• Latency
• Co-Stop
 RAM
• Swapping within the Guest
SLA for Tier 1 differs to Tier 3.
• Contention
How will you prove that VM’s
Demand is being met?
• Balloon
• Compression
• Swap
 Disk
• Read Latency & Write Latency
 Network
• Dropped packet for both sending and receiveing.
13
All these counters should be
below SLA.
Infra: Is the platform a bottleneck?
 CPU
• Contention
• Co-Stop
• Latency
• Demand
All these counters should be
below SLA.
 RAM
• Contention
• Balloon
• Compression
• Swap
 Disk
• DAVG: Read Latency & Write Latency
• KAVG: Kernel Latency
 Network
• Dropped packet for both sending and receiveing.
14
SLA for Tier 1 differs to Tier 3.
How will you prove that your
Infra is coping well?
Demo
 Custom Dashboards
• Dashboard 1: Performance Troubleshooting
• Dashboard 2: Utilisation
• Dashboard 3: Generic Performance troubleshooting
• Custom Dashboard creation
• Demo of how the above dashboards were created.
 Application dependancy
• Provided by vCenter Infrastructure Navigator
15
Custom Dashboard: Performance Troubleshooting
16
Custom Dashboard: Utilisation
17
Performance Management: Approach
Is it healthy?
18
Is it enough?
• Every VM & ESX
performing well?
CPU, RAM,
Network, Disk?
• Enough CPU, RAM,
Network, Disk?
Future risk?
• Are they behaving
expectedly?
• Capacity
remaining?
• Any fault on any
component?
• Where are the
“Stress points”
in time?
• Time remaining?
Is it optimised?
• Which VMs need
adjustment?
• What are my key
ratios?
• How much can I
claim back from
“fat” VMs?
• How many more
VMs can I put
without impacting
performance?
Direct Mapping by vCenter Operations
 Is it healthy = Health
• Workload
• Anomalies
• Faults
 Is it enough = Risk
• Time remaining
• Capacity remaining
• Stress period
 Is it optimised = Efficiency
• What can we reclaim?
• Density. Key ratios for management
 Daily update at midnight
 Goes beyond Performance
• Capacity
• Compliance
• Application Dependancy
19
Exercise 1: Big Picture
 Your CIO wants a dashboard that shows the entire environment (VMs, compute,
storage, spanning many physical DC) in 1 easy to understand chart.
He wants to see:
• the higher the utilisation, the bigger the object
• the less healthy the object, the brighter its color (e.g. red)
• the more time it has in terms of capacity, the further it is on the horizon.
• the more oversized the object, the higher it is on the chart.
 Example 2
• He wants to see just VM from Production Datacenter
• the worse the compliance to company standard, the bigger the object
• faults should be highlighted in color. The worse the fault, the brighter the color of the object
• the more time it has in terms of capacity, the further it is on the horizon.
• the more abnormal the behaviour of the object, the higher it is on the chart.
20
Answer
21
Exercise 2: Big Picture
 You are in charge of a large virtual platform spanning 10 datacenters.
• It has 10,000 VM, 600 ESXi, 500 datastores.
• At 90% virtualised, your platform is by far the largest in the company.
 CIO wants to have a dashboard that shows her the health of entire infrastructure at a
glance.
• She wants information to be color coded. Green, Yellow, Amber, Red.
• She wants to know the following
• Utilisation, which must cover CPU, RAM, Disk, Network.
• Issues, such as faults
• Abnormal behaviour
How will you show the above?
22
Managing >10,000s virtual objects
23
Visibility across vCenters
24
Demo
 vSphere UI
• Dashboard
• Scoreboard
• Main Tabs
• Configuration
25
Counters and Badges
 A vCenter farm with 100 VM and 10 ESX will have
>50000 counters!
• It is not humanely possible to look at them, let alone
analyse them.
 vCenter presents raw counters
• e.g. What does Ready Time of 1500 in Real Time chart
mean? Is value of 2000 in Real Time chart better than value
of 75000 in Daily Chart?
Standardises the scale into 0 100.
• e.g. Is memory.usage at 90% at ESXi level good or bad?
1 universal unit. Minimise the
“translation” in our head.
• E.g. Is IOPS of 300 good or bad for datastore XYZ?
Can be >100 if demand is unmet
 Single counter can be misleading
• e.g. Low CPU usage does not mean VM is getting the CPU, if
there is Limit, Contention and Co-Stop.
• e.g. To see disk performance, we need to see multiple
counters at multiple layers (VM, kernel, physical)
 Different counters have different units
• GHz, %, MB, kbps, ops/sec, ms
• This makes analysis even more complex
26
Derived Counters
Universal. Apply to CPU, RAM,
Disk, Net, etc.
Counters derived using
sophisticated formula, not just
aggregated.
For the same counter, different
objects use different formula.
Samples of Derived Metric: Health
 Health Score of an Object = MAX (Abnormal Workload, Faults)
• Abnormal Workload per Metric = Geometric Mean (MAX (Abnormality (Capacity/Entitlement), Abnormality (Demand/Usage)),
Workload)
• Abnormal Workload per Object = Score Aggregation (Abnormal Workload per Metric)
• Fault depends on the object as every object is different:
Cluster = HA Issues = MAX (HA Insufficient Failover Resources, HA Failover In Progress, HA Cannot Find Master)
Host = MAX (Hardware Issues, HA Issues)
Hardware Issues = MAX (Network Issues, Storage Issues, Compute Issues, CIM Issues)
Network Issues = MAX (Network, DVPort, VMNic)
Network = Max_of_all_instances (Network Device)
DVPort = Max_of_all_instances (DVPort Device)
VMNic = Max_of_all_instances (VMNic Device)
Storage Issues = MAX(Storage, SCSI, VMFS heartbeat, NFS server, CIM Storage)
Storage = Max_of_all_instances (Storage Device)
SCSI = Max_of_all_instances (SCSI Device)
VMFS heartbeat = Max_of_all_instances (VMFS heartbeat Device)
NFS server = Max_of_all_instances (NFS server Device)
Compute Issues = MAX (Error, PCIe)
CIM Issues = MAX (Processor, Memory, Fan, Voltage, Temperature, Power, System Board, Battery, Other
Health, IPMI, BMC)
HA Issues = HA Host Status
VM = MAX (FT Issues, HA Issues)
27
Threshold: a shift in mindset needed
 vCenter sets “static” threshold, which can be misleading
• During peak, it is common for VM to reach high utilisation.
• Static threshold will generate alerts when they should not.
• vSphere admin quickly learns to ignore them, defeating the purpose of alert to begin with.
• During non-peak, it might be abnormal for VM to reach even 50% utilisation.
• Static threshold will not generate alerts when they should have.
 vCenter only sets high threshold
• Do you set static threshold when CPU or RAM utilisation drops below 5%? 
• A drop in entire array storage IOPS might be a sign of terrible day ahead.
• Will not alert when these happen:
• Utilisation drops from 75% to 1% when it should not.
• Utilisation change from 5% to 70% when it should not.
• We need to plots both upper range and lower range
 But each VM differs. And the same VM differs depending on day/time… 
• Intelligence required to analyse each metrics and their expected “normal” behaviour.
28
Dynamic threshold & alerts
 vCenter Operations uses dynamic threshold
• It is dynamic and personalised down to individual metric.
• Varies from object to object. 1000 VM will have their own threshold.
• Varies from time to time. The same CPU Usage counter has different threshold at different time. This cater for peak. See the
chart below.
• Varies from metric to metric. An ESX with 12 cores, each core has its own CPU Usage threshold.
• You can fix hard thresholds if you need to.
• This needs Enterprise edition. It comes with no static threshold defined.
• Steps  http://virtual-red-dot.blogspot.com/2012/01/vcenter-operations-5-hard-threshold.html
Notice the range varies
in size
29
Dynamic Threshold Analysis
For each metric
 DT analysis runs nightly
• New dynamic thresholds are computed for
Data
Categorization
each metric
 Data categorization
• Tries to identify stat as linear,
Linear DT
Multinomial
DT
Sparse
Sigma DT
Step Function
DT
Quantile
Sigma DT
multinomial, step function, etc
• If one of those matches, that DT function
is used
CCPD
 Otherwise: competition
• Sigma: assumes hourly cycles
ACPD
• CCPD: tries to find normal cycles
• ACPD: tries to find abnormal cycles
DT Scoring
• Winner is assigned based on metric
trending accuracy
 The same metric may get different DT
Dynamic
Thresholds
30
function on different day
Dynamic Threshlold: Algorithm
m 1 m 1
m


 
    0 ,0     i , j    i , j   m  1 m  1
m
i 1 j 1
i  m , j 1
i,j 1
 1 

  

 P , P ,..., P ( p1,1, p1, 2 ,..., p m , m ) 
    p i , j   p i ,i j, j 
m

1
m

1
m
1,1 1,2
m ,m


i 1 j 1
i  m , j 1

    0 ,0        i , j       i , j   
i 1 j 1
i  m , j 1


m 1 m 1
w h e re

i 1 j 1
m
pi, j 

p i , j  1,
0  pi, j  1 a n d   z  
i  m , j 1
T h e m a rg in a l d istrib u tio n o f th e i
( p i ,1 ,..., p i , m  1 )

 D irich le t



 D irich le t


 

th


t
z 1
ro w o f J is:

fo r i  1,..., m  1






 ,  m ,1,  m , 2 ,. ..,  m , m ,  0 ,0  fo r i  m







     0 ,0 



m

m, j
j 1
m 1 m 1
 
j 1
m
i, j


 i, j
i  m , j 1
It is pretty difficult for a human to beat the computer in analysis of the data..
The above is one of the many algorithms applied by vCenter Operations.
31
pi, j
 0 ,0  1
t
 i , j ,  i ,1,  i , 2 ,...,  i , m  1 
i 1

i  m , j 1


 
e dt
j 1
w h e re    0 ,0 
m
0
m 1


 m 1 m 1
 1     p i , j 

 i 1 j 1
Analytics
7 different analytics areas.
For DT feature, there are 8
algorithms.
Only in
Enterprise Edition
These advance
features create
Smart Alert.
32
Discussion Point
Raw Counters vs Derived Counters
Dynamic Threshold vs Static Threshold
33
Performance Management: FAQ
• How is the entire virtual data center doing? What’s the degree of their health?
• For every cluster, host, datastore, what’s their health?
• For every single VM, is CPU/Memory/Disk/Network bound?
• Any VM is not getting what they are entitled?
• What’s the normal workload range for every object in our vDC?
• Is our vDC doing business as usual today? Or is today a turbulent day with lots of unexpected changes?
• Which VMs, ESX, cluster, datastore, etc are behaving abnormally?
• …. and exactly which counters are the culprits?
• What faults do we experience in our vDC?
• In our vDC, do we have stress points or periods? How bad is it?
• For every cluster, VM, datastore, which ones are experiencing stress and how bad is it?
• How much time do we have before we need to buy more server, storage, network before performance starts
to degrade or we run out of capacity?
• For every cluster, VM, datastore, how much time do we have?
• How many more VM can we put without impacting performance or using up capacity?
• For every cluster, VM, datastore, which components (CPU, RAM, Disk, Network) would run out first?
• What’s the degree of our compliance?
34
Badge – Health
 Answer complex questions like:
• How is the entire virtual data center doing? What’s the
degree of their health?
• For every cluster, host, datastore, what’s their health?
 Health is a current Operational State.
• It represents what is wrong now that should be
addressed within 1 day. Thus Health needs to be scored
such that if it is red, then it really needs attention.
 Weather Map
• Simple way to check that entire farm is healthy
• For child object, it is replaced with Health Trend
• Shows Health of all parent and child objects
• Each square can be VM, ESX, datastore, cluster, datacenter,
vCenter.
Value
35
Explanation
75 – 100
Normal behaviour
50 – 75
The object experience some problems.
25 – 50
The object might have serious problems.
Check and take action as soon as possible.
0 – 25
The object is either not functioning properly or
will stop functioning soon.
Badge – Workload
95
 Answer complex questions like:
• For every object, how is Demand vs Supply?
• For every single VM, is CPU/Memory/Disk/Network
bound?
• Any VM is not getting what they are entitled?
• What’s the normal workload range for every object in our
vDC?
 Workload is not utilisation or usage
• More accurate than utilisation as it takes many factors
than just utilisation.
 Workload = (Demand/Entitlement)
• Entitlement is dynamic. Affected by shares, limit, etc.
Value
• Demand ≠ Usage.
0 – 80
Workload is not high.
80 – 90
The object is experiencing some
high resource workloads.
90 – 95
Workload on the object is
approaching its capacity in ≥1 area.
• Usage may mean passive usage. E.g. the RAM page is there but
no write/read.
• Score is Max (CPU, RAM, Disk IO, Net IO)
• To bring up the attention
>95
36
Explanation
Workload on the object is at or over its
capacity in ≥1 areas.
Badge – Anomalies
 Answer complex questions like:
• Is our vDC doing business as usual today? Or is today a
turbulent day with lots of unexpected changes?
• Which VMs, ESX, cluster, datastore, etc are behaving
abnormally?
• …. and exactly which counters are the culprits?
 Identifying metric abnormalities
• It need to learn dynamic ranges of “Normal” for each
metric, so give it >3 cycle per metric.
• A month-end job means it needs 3 months.
• Normal range changes after configuration or application
changes.
 Anomalies score
Value
0 – 50
Normal Anomaly range
• Usually an indication of a problem
50 – 75
The score exceeds the normal range.
• Demand change
75 – 90
The score is very high.
• A high number of anomalies:
• Application team change code/app
• KPI metrics impacts the Anomalies score more than
non-KPI metrics.
37
Explanation
> 90
Most of the metrics are beyond their
thresholds. This object might not be
working properly or will stop working
soon.
This virtual DC spans multiple vCenters.
vCenter Ops show all the counters that
are behaving abnormally.
38
Badge – Faults
 Answer complex questions like:
• What faults do we experience in our vDC?
• For every object, what faults does it have?
 Specific knowledge of which vCenter Events
• Which events affect Availability and Performance of
which object?
• Pulled from active vCenter events
• Example:
• Loss of redundancy in NICs or HBAs
• Memory checksum errors
• HA failover problems
• Each fault has a default score (e.g. 25, 50, 75, 100)
Value
• Highest individual Fault Score drives the Fault object
0 – 25
No fault is registered on the object
25 – 50
Faults of low importance happens on
object.
50 – 75
Faults of high importance happens on
object.
Score
 Best Practices:
• Do not change the Faults Threshold
• Use Alerts View to manage Faults. Filter it to just show
Fault.
39
> 75
Explanation
Faults of critical importance happens on
object
Badge – Risk
 Answer complex questions like:
• Do we have risk from performance and capacity in
our vDC? If yes, where are they and can you
quantify the seriousness?
• Which objects are at risk? What is the specific
risk?
 Risk Score takes into account
• Time Remaining
• Capacity Remaining
• Stress
 Risk is an early warning system.
• Identifies potential problems that could eventually
hurt the performance
• The Risk Chart shows Risk score over the last 7
days, giving a view of the trend.
40
Value
Explanation
0 – 50
No problems are expected in the future.
50 – 75
There is a low chance of future problems or a
potential problem might occur in the far future.
75 – 100
There is a chance of a more serious problem or a
problem might occur in the medium-term future.
100
The chances of a serious future problem are high
or a problem might occur in the near future
Badge – Time Remaining
 Answer complex questions like:
• How much time do we have before we need
to buy more server, storage, network before
performance starts to degrade or we run out
of capacity?
• For every cluster, VM, datastore, how much
time do we have?
 Measures time remaining before each
resource type reaches its capacity
• CPU
• Memory
• Disk (IOPS & Space)
• Network I/O
 Early warning of upcoming provisioning
needs
• Based on Score Provisioning buffer. Default
value is 30 days.
• Set in “Capacity & Time Remaining” section
41
Value
Time remaining
50 – 100
> 2x SP Buffer (60 days)
25 – 50
< 2x SP Buffer
<25
Near SP Buffer
0
< SP buffer (30 days)
Badge – Capacity Remaining
 Answer complex questions like:
• How many more VM can we put without impacting
performance or using up capacity?
• For every cluster, VM, datastore, which components
(CPU, RAM, Disk, Network) would run out first?
 Early warning system
• A low score of 1 mean you still have >30 days.
333 More VMs correlates to 77% Capacity
Remaining for this object
• Measures how many more VMs can be placed on the
object
 Percentage of Total VM “Slots” Remaining
• Based on the average size of the VM on the object
(e.g. VM profile)
• Each object has its OWN VM profile size: Host,
Cluster, Datacenter, Etc.
 From the table, notice value is not linear
• It is also not the same with Time Remaining
threshold.
• A value of 30 means >120 days for capacity but
around 40 days for time.
42
Value
Capacity remaining
>10
>120 days
5 – 10
60 – 120 days
0–5
30 – 60 days
0
<30 days
Capacity Remaining Calculation
 Determine Capacity Constraint Resource
 Deployed or Powered On VMs
• Powered Off VMs only use disk space resources
• Powered On VMs uses ALL of the 4 resources
 Calculation Example Shown:
• Limiting Resource is Disk Space with 333 VMs
available
• Use the Deployed VM number of 99 to do the
calculation for percentage space remaining
• Determine Capacity Remaining
• 333 / (333 + 99) = 77%
43
Capacity and Time details
 You can drill down to see details
• You can check the 9 components, as
shown on the right.
• This helps answer the question which
components have how many days or
VM left!
• Summary = Min (all 9 components)
44
Badge – Stress
 Answer complex questions like:
• In our vDC, do we have stress points or
periods? How bad is it?
• For every cluster, VM, datastore, which ones
are experiencing stress and how bad is it?
 Measures long-term or chronic workload
(6 weeks)
• Chart shows weeks break down of Stress for
each day/hour averaged over the last 6 Weeks
• Workloads > 70% = “Stressed”
• Threshold Configurable as per screenshot below
Value
0–1
Normal score. No action needed
1–5
Some of the object resources are
not enough to meet the demands.
5 – 30
The object is experiencing regular
resource shortage.
>30
45
Explanation
Most of the resources on the object are
constantly insufficient. The object might
stop functioning properly.
Stress Calculation
Stress Zone
100
12%
70
Workload
Line
0
6 Weeks
 Stress Score is a % and is based on area of Workload Above “Stress Line” Threshold
compared to the Total Capacity of the object
• Stress Score = (Stress area / Stress Zone) *100
• But max value can be > 100% as the workload can be >100.
 Example
• Stress Line is 70% Workload
• 12% of the area is above the 70% threshold
• Stress Score is 12
46
Badge – Efficiency
 Answer complex questions like:
• Are there optimization opportunities in our
vDC?
• How well do we do in terms of VM
provisioning? Do we get them right?
 Efficiency Score factors
• Reclaimable waste
• Density ratio
 Graph Depicts VMs by Percent
• Optimal – Optimally Provisioned VMs
• Waste – Over Provisioned VMs
• Stress – Under Provisioned VMs
• Not used in Efficiency Calculation (see Risk)
Value
Explanation
The efficiency is good. The resource use
 Three
>25 Resources Considered
on the selected object is optimal.
• CPU
The efficiency is good, but can be
• 10Memory
– 25
improved. Some resources are not fully
used.
• Disk Space
0 – 10
The resources on the selected object are
not used in the most optimal way.
Waste
0
The efficiency is bad. Many resources are
wasted.
 Note: VMs can appear in Stress and
47
Badge – Reclaimable Waste
 Answer complex questions like:
• Do we over provisioned the VMs in terms of CPU,
RAM and Disk? If yes, what’s the degree of over
provisioning?
• For every cluster, VM, datastore, what can we
reclaim?
 It identifies the amount of reclaimable
resources
• CPU
• Memory
• Disk
 Reclaimable Waste = Reclaimable Capacity /
Deployed Capacity
• Waste Score = Max(CPU Waste Score, RAM Waste
Score, Disk Space Waste Score)
• Disk calculation can also include old snapshots and
Value
0 – 50
No resources are wasted on the
selected object.
50 – 75
Some resource can be used better.
75 – 100
Many resources are underused
templates
100
48
Explanation
Most of the resources on the selected
object are wasted.
Badge – Density
 Answer complex questions like:
• How high can we push our consolidation
ratio before we experience performance
problem?
• Now that’s a million dollar question! 
• For every datacenter, cluster, ESXi, what
are our key ratios and how much head
room do we have?
 Contrasts Actual vs Ideal Density
• Identify Optimal Resource Deployment
Before Contention Occurs
• Ideal is based on demand, not simple
configuration.
• High Density is good. 100 is not too high.
Value
>25
49
Explanation
Good consolidation
10 – 25
Some resources are not fully consolidated
0 – 10
The consolidation for many resources is low
0
The resource consolidation is extremely low.
Badge Thresholds
Disable Color Threshold by
Clicking the Level Off
50
Using badges together
 Workload High & Anomalies Low & Stress High
• Workload – Object is Running Hot. Potentially Starving
for Resources
• Anomalies – Normal Behavior for this timeframe
Add resources
• Stress – Object is often running under high Workload.
 Workload High & Anomalies Low & Stress Low
• Workload – Object is Running Hot. Potentially Starving
for Resources
• Anomalies – Normal Behavior for this timeframe
Not likely a big problem…
a cyclical workload spike?
• Stress – Object usually has enough resources
 Workload High & Anomalies High
• Workload – Object is Running Hot. Potentially Starving
for Resources
• Anomalies – Abnormal behavior for this timeframe
 If there are Alert and Fault too, then it is a sign of
major issue
51
Something is amiss!
Immediate attention.
Discussion Point
Is Badge the way to go?
Are these the right 12 badges?
What other badges do you need?
52
Heat Map
 Built-in heat maps
• Basic:
•
•
•
•
Storage: space, IO
CPU
RAM
Network
A great way to show a lot of information on 1
screen.
Heat map can quickly highlight information,
as it can present relative information.
It is good for relative comparison among
VMs.
• Advance (or composite)
• Health
• Workload
• Capacity
 Custom heat map or cold map
• Since we can change the color, we can actually
create cold map.
Heat map is a 2 dimensional chart. So it takes
2 parameters. You cannot choose >2 data.
For example, you cannot show the following
at the same time:
•
IOPS, Latency and Throughput. Also,
these 3 have different units so it’s hard
to combine using Super Metric.
•
ESX, VM and Datastore.
• In cold map, the bigger the size, the colder it is
(less utilised it is). The bluer it is, the less utilised it
is.
• Hence it focuses on Waste
53
Storage: Datastore, VM, Workload & latency
 Since all the datastores are on the same array, how do we quickly tell the relative
workload generated by every one of them?
• This answers: which datastores are heavily loaded?
 For each of these datastores, how do we know the relative workload generated by
the VM?
• This answers: which VMs dominate within a datastore?
 For every VM, how do we performance is reasonable number?
• This answers: which VM has storage bottlenect?
 How do we show all the above data in one page, without the need to show a lot of
numbers?
• And we still want to be able to drill down to each VM and datastore.
54
Each square is a VM. They are grouped by datastore.
Bigger square: bigger throughput
Color: latency.
55
Storage: Throughput & Latency at cluster level
 Which cluster is generating high storage workload?
 Are they getting the SLA they ask? What’s the latency? The cluster owner wants to
know that his entire cluster is getting <10 ms latency.
 We expect these X, Y, Z clusters to be doing little work. Can we prove this?
Basically, the same concept from
previous slide, but looking from cluster
point of view as Cluster & Datastore has
a Many-to-Many relationship.
56
Storage: Throughput & Latency at cluster level
57
Storage: Throughput & Latency at host level
58
Storage: Throughput & Latency at VM level
Can we show at VM level now?
That’s why you need a 24” monitor 
59
Storage: Space (GB) & Latency
 Any big VM that is not getting the SLA we agreed on?
60
Storage: Datastore space contention
 Do we have space contention at any of the datastore? If yes, how bad is the
contention?
• While we use thin provision at vSphere level (and thick at array level), we still have risk of space
from snapshots, vRAM increase, new VM, new vDisk, storage vMotion, storage DRS, etc.
• The higher the contention, the brighter the color.
 Are we running low on capacity in those datastore with high contention?
This requires custom heat map.
We can do a variant of this heat map.
61
Storage: Space contention
 We use thin provisioning
This is a variant of previous slide heat map.
In this variant, we answer question: are the datastore of the same standard size?
62
CPU: Contention vs Usage at cluster level
 Which clusters are doing the most work? Which are not doing much?
 How is the CPU workload on every cluster?
 For each of those clusters, can we see if there is CPU contention?
63
CPU: Contention vs Usage at host level
 Same questions with previous, but for host.
 We can expect some “drill down” in this heat map
64
CPU: Contention vs Usage at VM level
Can we show at VM level now?
That’s why you need a 24” full HD
monitor 
65
VM Health
 Heat Map 1: Current Health
• Are all the VMs healthy? Especially those VMs which have high workload!
• Which VMs are experiencing problems?
• Are more demanding VMs less healthy?
• Can we see this by cluster? By host?
 Heat Map 2: Future Health
• Will all the VMs be okay in future (30 days)? Need to check CPU, RAM, Disk IO, Disk Space and
network for every single VM!
• For those VMs which are not ok, can we be specific on which value will run out first? Can we
“drill down” to individual VM?
66
VM: color by health, size by workload
67
VM: color by capacity, size by workload
 This is now showing future projection. We can see that the VM vCenter 5 is having red color. Its capacity will run out within 30
days. So we click on it to drill down.
68
Drill down to specific VM
 Screenshot below shows vCenter 5. We can see that it will need more vCPU as it will max out in 10 days.
 We can go as far as 6 months. This is good enough as you should not buy hardware >6 months in advance. It makes sense in the
physical world as it’s fixed, but unwise in virtual world.
69
Discussion Point
Which heat maps are useful for you?
What other heat maps or cold maps do you need?
70
Monitoring the big workload
 You have convinced your CIO to virtualise the remaining 50% of the servers.
 Your CIO needs you to prove, supported by performance charts, that the platform has
served every VM well, meeting the SLA in the past 1 quarter.
• Tier 1 cluster SLA: 2% CPU Contention, 0 RAM Contention, 10 ms disk latency, 0 drop packets.
• Tier 2 cluster SLA: 4% CPU Contention, 5% RAM Contention, 20 ms disk latency, 0 drop packets.
• Tier 3 cluster SLA: 6% CPU Contention, 10% RAM Contention, 30 ms disk latency, 0 drop packets.
 You have 500 VM on 50 ESXi, 8 clusters, 40 datastores, 5 RDM.
 You must prove that:
• Not a single Tier 1 VM has >2% CPU Contention in the past 1 quarter. The underlying ESXi also
has <2% CPU contention.
• Not a single Tier 1 VM has >10 ms disk latency in the past 1 quarter. The underlying ESXi also has
<10 ms disk latency.
• Etc, for each Tier and each component (CPU, RAM, Disk, Net)
What kind of charts do you need to show?
71
Super Metrics
72
See you 1 April!
VMware office
© 2010 VMware Inc. All rights reserved
Demo
 Super Metric
• Editor
• Package
• Attaching to object
74
Discussion Point
Think of super metrics that you need.
Explain why and how you will need them.
75
Implementation Approach
Define who
needs what
Create
Super Metrics
Create
Applications
Create
Tags
Create
Heat Maps
 Begin with the end in mind
• Every Super Metric must serve a particular role
• Role, not individual. A person can & will have many heatmaps/dashboards.
• Decide if you need the following non-standard info
• Application-level & Guest-OS-level info
• Info from physical machines (UNIX, X64, etc)
• Info from physical storage and network (switch, FW, router, etc)
 Think in terms of application
• A great way to complement vSphere as vCenter does not have this object.
76
Create
Dashboards
Who needs to see what
CIO or CTO
Simple Dashboard.
Big picture. Tend to be application focused.
No absolute data. Normalised to 0-100.
Focus on long term.
Averaged data. A 30-minute spike will not show up.
Updated daily.
Group Head
e.g. Head of Infra, Head of Apps
Dept Head
e.g. Head of Storage, Head of Server,
Head of Network, Head of Databases
Admin/Architect
e.g. Storage Admin, Network Admin,
App Owner, VM Owner
77
Rich Dashboard. Ideally Full HD screen.
Specific info.
Absolute data + Normalised Data.
Focus on short term.
Actual data. A 5-minute spike will be visible.
Updated every 2 minutes.
Who needs to see what (samples)
Roles
Info presented
CIO
Health of overall IT in the past 1 month
Health of key applications in the past 1 month
CTO
As above, but with more technical content, and tailored to him.
Head of Applications
Health of all key apps in the past 1 month, with the ability to do 1 level drill down for each app.
Capacity projection for all key apps.
Head of Infrastructure
Health of Storage
Health of Network: max drop packets for entire infra
Health of Servers (VMware and Physical)
Health of VM
Head of Storage
A higher level, simpler dashboard than Storage Admin
Head of Network
Max througput across entire infrastructure
Max drop packets
VMware Team
An App Owner
78
The infra is providing each of the VMs in my App with the resources it needs
Designing Super Metric
 Leverage existing derived metrics
 Leverage Objects that vCenter cannot provide performance data
• Application, Resource Pool, Folder, Location, can now have performance counters
 Minimise static alert.
 Know what a good range for the end result
 Build a simple table to avoid super metric sprawl and duplicating existing metrics
• Below is an example, showing 2 Super Metrics.
Name
VM SLA
Infra SLA
79
Purpose
Target Role
Formula
Shows that a VM gets the
resources it wants from
infrastructure based on the
defined SLA.
VM Owner
VM SLA = 100% - Max (CPU, RAM, Disk, Network)
CPU = CPU Contention %.
RAM = RAM ballooning %.
Disk = % above threshold latency.
Network = Packet Drop %.
Show that the underlying infra
has the resources for all the
VMs on it
VMware
Admin
Tier 1 Disk SLA is 10 ms.
Tier 2 Disk SLA is 20 ms.
Tier 3 Disk SLA is 30 ms.
Infra SLA = 100% - Max (Host Cluster, Datastore
Cluster)
Good Range
>99% (Tier 1 cluster)
>97 (Tier 2 cluster)
>95% (Tier 3 cluster)
Custom Heat Map or Cold Map
Component
Heat Map
Cold Map
Least utilised VM: size by vCPU count, color by RAM + CPU
usage (a Super Metric)
CPU
Resource pool: size by CPU utilisation,
RAM
Most RAM intensive VMs, grouped by ESX. Size by RAM
utilisation, color by health
Disk
Most disk intensive VMs, grouped by ESX. Size by disk
utilisation, color by health
Least utilised disk: size by GB, color by % of free
Network
Most network intensive VMs, grouped by ESX. Size by
network utilisation, color by health
Most idle VMs, grouped by host
Capacity
VMs with file system that will run out soon. Color by %
left, size by GB left.
Health
VM health, grouped by cluster. Color by health, size by
workload.
 Design consideration
• Use Super Metric so the info is richer.
• Group VMs by 1 consistent hierarchy only. If you group by cluster, it won’t make sense to further group by datastore as 1
datastore can spans multiple cluster.
80
Demo
 vSphere UI
• Operations
• Details
• Cluster
• Host
• VM
• Datastore
• Events
• All Metrics
• VC Ops Chart vs vCenter chart
 Custom UI
81
Thank you
© 2010 VMware Inc. All rights reserved
vCenter Operations presents datastore
with all the details. It also show the
estimated max IOPS!
83
Storage in vCenter Operations
Automatic learning of storage performance.
Calculating both Demand and Normal rate.
84
vSphere 5 Performance Chart (fat client)
Can only
choose 1
component
at a time.
e.g. cannot
show CPU
and RAM at
the same
time.
85
vSphere 5 Performance Chart (fat client)
Can only show 1 chart at a time.
Hence can only show 2 units at a time.
86
vCenter Operation charts
Can show >1 charts at a time. Can combine/split charts.
Can show different data type from different objects.
Line is color coded, showing when threshold is breached.
87
CPU counters
88
89
90
91
Demand: Derived Metric
The chart below shows Demand in action.
I generated IOPS which on a local datastore,
resulting in spike in latency (read latency when
up from 3 ms to 60 ms.
Demand correspondingly go up from 4 to 100!
92
Cluster Overview
93
94
95

similar documents