The Case for Drill-Ready Cloud Computing

Report
The Case for Drill-Ready
Cloud Computing
Vision Paper
Tanakorn Leesatapornwongsa
and
Haryadi S. Gunawi
1
Cloud Services
Cheap
Convenient
Reliable
2
Yahoo Mail Disruption
• Hardware failures
• Wrong failover
• Disruptions
– Some users could not access
– Some users saw wrong notifications
– Several days to recover
3
Outlook Disruption
• Hardware failures
– Caching server
•
•
•
•
Failover to backend servers correctly
Requests flooded the servers
Service went down
Microsoft needed to change its software
infrastructure
4
Cloud Outages
Outage
Root Event
Supposedly
tolerable failure
Incorrect
Recovery
Major Outage
Amazon EBS
Network
misconfig
Network partition Re-mirroring
storm
Gmail
Upgrade event
Servers offline
Bad request
routing
All routing
servers down
App Engine
Power failure
25 % machines
offline
Bad failover
All user app
were degraded
Skype
Overload
30 % nodes failed
Positive
feedback loop
Almost all
nodes failed
Google
Drive
Network bug
Network offline
Timeout during 33 % requests
failover
affected
Caching failure
Failover to
backend
Request
flooding
7-hour outage
Hardware
failures
Servers offline
Buggy failover
1 % of users
affected
Outlook
Yahoo Mail
Clusters
collapsed
5
Journey of Cloud Dependability Research
6
Fault-Tolerant Systems
Complex failures
• Hard to handle and implement correctly
• Recovery protocols are very complex
• Recovery code is one of the most buggy parts
7
Offline Testing
• Thoroughly verify recovery mechanism
8
Offline Testing
•
•
•
•
Thoroughly verify recovery mechanism
Fault injection, model checking, stress testing, etc.
“Mini cluster” that represents production runs
Testing and production environment is different
– Cluster, workload, failure
Real workload
Test workload
Mini cluster
Production run
9
Offline Testing
•
•
•
•
Thoroughly verify recovery mechanism
Fault injection, model checking, stress testing, etc.
“Mini cluster” that represents production runs
Testing and production environment is different
– Cluster, workload, failure
• Orders of magnitude different in scale
– Facebook used 100 machines to mimic 3000-machine
production run[2011]
• Small start-ups forego the luxury
– Many tests are much smaller than this
10
Diagnosis
• Help administrators to point out and
reproduce causes of outages
• BUT
– Post-mortem, not prevent disruptions
– Passive approach, wait outages happen before
diagnosis
11
Online Testing and Failure Drills
Customers
“Inject failures online”
Administrators
Test
Requests
Users outnumber testers
Real deep scenarios
12
A Missing Piece
Boss, let doDear
inject beloved customers,
failures online
using ChaosThank you for trusting our
Monkey
Hmm …
services, but we accidentally
lose your data because the
failure drills that we run
...
...
Employee
Boss
13
Future of Failure Drill
Current Drill
A team of engineers standing by
Drill-ready clouds
14
Drill-Ready Cloud Computing
• Automatic failure drill and automatic
cancellation
• Safe, efficient, easy manner
• Ideally, no engineering effort required
15
Drill-Ready Cloud Computing
Administrator
Drill-ready cloud computing
Systems take care
Drill Mode
failure injection and cancellation
Drill Spec
Kill 25 %
If it disrupts
revert back
Drill-Ready System
16
Outline
•
•
•
•
Safety
Efficiency
Usability
Generality
17
Safety
Learn about failure implications without
suffering through them
• Learn whether data can be lost
– But not lose the data
• Learn whether SLA can be violated
– But not violate it for long time
18
Safety Solutions
• Normal and drill states
Not drill aware
19
Safety Solutions
• Normal and drill states
“Maintaining 2 states”
Normal and drill states
The first most important thing
for drill-ready clouds
Normal Topology
Drill Topology
Revert back to normal state easily
20
Safety Solutions
• Drill state isolation
• Self cancellation
– Real failures during the drill
– Drill master and drill agent
– Drill master command agents
– What if network partition?
• Agents are in limbo state
– Self cancellation when agents cannot contact
master
21
Safety Solutions
• Drill state isolation
• Self cancellation
Safe drill specification
• Safe drill specification
Check whether the specification
– Drill specificationcan run safely
Drill Spec
- What failures?
- How long?
- Cancellation
conditions
- Etc.
Example
Kill 25 %
If SLA is violated
revert back
22
Efficiency
• Failures trigger data migration
• Monetary cost
– Bandwidth
– Storage space
• System performance
– Affect users
23
Efficiency Solutions
• Low-overhead drill setup and cleanup
– Do we need to do real key re-balance?
– Depends on the objective of the test
[11-15]
[16-20]
[11-20] [21-30]
[1-10]
Yes, if we want to see background
re-balance impact
[31-40]
SLA okay?
Read / Write
data
[51-60]
[41-50]
[41-45]
[46-50]
24
Efficiency Solutions
• Low-overhead drill setup and cleanup
– Do we need to do real key re-balance?
– Depends
on the objective
the cleanup
test
Low-overhead
setupofand
The cost
depends
[31-45]
[16-30]
on
the
drillwant
objectives
No,
if we
to measure
and
performance, when we lose 2 nodes
Drill objectives must be parts
on drill specifications
SLA okay?
[1-15]
[46-60]
Read / Write
[46-60]
No key [11]
25
Efficiency Solutions
• Low-overhead drill setup and cleanup
• Cheap drill specification
– Smarter and cheaper drill specification
Replicating progress status
If replication is 50 % correct  assume that the rest are correct
Stop half way and report success
26
Usability Solutions
• Declarative drill specification language
Drill Specification
During peak load
Kill 5% machines
If SLA violated > 1 mins
Cancel the drill
– Need declarative language
• Describe results
• Easy to read and write
If recovery is 50% good
Stop the drill
Report success
27
Generality Solutions
•
•
•
•
Elasticity drill
Configuration change drill
Software upgrade drill
Security attack drill
28
Conclusion
• Drill-ready cloud computing
– New reliability paradigm
• Sketching a first draft
• We want your FEEDBACK
29
Thank You
http://ucare.cs.uchicago.edu
30

similar documents