overview talk - UCSB Computer Science

Report
Scalable, Consistent, and
Elastic Database Systems for
Cloud Platforms
Sudipto Das
Computer Science, UC Santa Barbara
[email protected]
Sponsors:
Web replacing Desktop
Sudipto Das [email protected]}
2
Paradigm shift in Infrastructure
Sudipto Das [email protected]}
3
Cloud computing

Computing infrastructure
and solutions delivered as a
service
◦ Industry worth USD150 billion by
2014*

Contributors to success
◦ Economies of scale
◦ Elasticity and pay-per-use pricing

Popular paradigms
◦ Infrastructure as a Service (IaaS)
◦ Platform as a Service (PaaS)
◦ Software as a Service (SaaS)
*http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm
Sudipto Das [email protected]}
4
Databases for cloud platforms
Data is central to applications
 DBMSs are mission critical component in
cloud software stack

◦ Manage petabytes of data, drive revenue
◦ Serve a variety of applications (multitenancy)

Data needs for cloud applications
◦ OLTP systems: store and serve data
◦ Data analysis systems: decision support,
intelligence
Sudipto Das [email protected]}
5
Application landscape
 Social
gaming
 Rich
content and
mash-ups
 Managed
applications
 Cloud
application
platforms
Sudipto Das [email protected]}
6
Challenges for OLTP systems
 Scalability
◦ While ensuring efficient transaction
execution!
 Lightweight
Elasticity
◦ Scale on-demand!
 Self-Manageability
◦ Intelligence without a human controller!
Sudipto Das [email protected]}
7
Two approaches to scalability

Scale-up
◦ Preferred in classical
enterprise setting (RDBMS)
◦ Flexible ACID transactions
◦ Transactions access a single node

Scale-out
◦ Cloud friendly (Key value
stores)
◦ Execution at a single server
 Limited functionality & guarantees
◦ No multi-row or multi-step
transactions
Sudipto Das [email protected]}
8
Why care about transactions?
confirm_friend_request(user1, user2)
{
begin_transaction();
update_friend_list(user1, user2, status.confirmed);
update_friend_list(user2, user1, status.confirmed);
end_transaction();
}
Simplicity in application design
with ACID transactions
Sudipto Das [email protected]}
9
confirm_friend_request_A(user1, user2) {
try {
update_friend_list(user1, user2, status.confirmed);
} catch(exception e) {
report_error(e);
return;
}
try {
update_friend_list(user2, user1, status.confirmed);
} catch(exception e) {
revert_friend_list(user1, user2);
report_error(e);
return;
}
}
confirm_friend_request_B(user1, user2) {
try{
update_friend_list(user1, user2, status.confirmed);
} catch(exception e) {
report_error(e);
add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time());
}
try {
update_friend_list(user2, user1, status.confirmed);
} catch(exception e) {
report_error(e);
add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time());
}
}
Sudipto Das [email protected]}
10
Challenge: Transactions at Scale
Scale-out
Key Value Stores
RDBMSs
ACID transactions
Sudipto Das [email protected]}
11
Challenge: Lightweight Elasticity
Provisioning on-demand and not for peak
Resources
Capacity
Demand
Resources
Optimize operating cost!
Capacity
Demand
Time
Time
Traditional Infrastructures
Deployment in the Cloud
Unused resources
Slide Credits: Berkeley RAD Lab
Sudipto Das [email protected]}
12
Challenge: Self-Manageability

Managing a large distributed system
◦
◦
◦
◦
◦
◦

Detecting failures and recovering
Coordination and synchronization
Provisioning
Capacity planning
…
“A large distributed system is a Zoo”
Cloud platforms inherently multitenant
◦ Balance conflicting goals
 Minimize operating cost while ensuring good performance
Sudipto Das [email protected]}
13
Contributions for OLTP systems
 Transactions
at Scale
◦ ElasTraS [HotCloud 2009, UCSB TR 2010]
◦ G-Store [SoCC 2010]
 Lightweight
Elasticity
◦ Albatross [VLDB 2011]
◦ Zephyr [SIGMOD 2011]
 Self-Manageability
◦ Pythia [in progress]
Sudipto Das [email protected]}
14
Contributions
Data Management
Analytics
Transaction Processing
Ricardo
[SIGMOD ‘10]
Dynamic
partitioning
MD-HBase
[MDM ‘11]
Best Paper
Runner up
G-Store
[SoCC ‘10]
Anonimos
[ICDE ‘10],
[TKDE]
Static
partitioning
ElasTraS
[HotCloud ‘09]
[TR ‘10]
Albatross [VLDB ‘11]
Zephyr [SIGMOD ‘11]
This talk
Pythia [in progress]
Sudipto Das [email protected]}
Novel
Architectures
Hyder
[CIDR ‘11]
Best Paper
CoTS
[ICDE ‘09],
[VLDB ‘09]
TCAM
[DaMoN ‘08]
15
Transactions at Scale
Scale-out
Key Value Stores
RDBMSs
ACID transactions
Sudipto Das [email protected]}
16
Scale-out with static partitioning

Table level partitioning (range, hash)
◦ Distributed transactions

Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions
Sudipto Das [email protected]}
17
Scale-out with static partitioning

Table level partitioning (range, hash)
◦ Distributed transactions

Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions

Scaling-out with static partitioning
◦
◦
◦
◦
ElasTraS [HotCloud 2009, TR 2010]
Cloud SQL Server [ICDE 2011]
Megastore [CIDR 2011]
Relational Cloud [CIDR 2011]
Sudipto Das [email protected]}
18
Dynamically formed partitions

Access patterns change, often rapidly
◦ Online multi-player gaming applications
◦ Collaboration based applications
◦ Scientific computing applications
Not amenable to static partitioning
 How to get the benefit of partitioning
when accesses do not statically partition?

◦ Ours is the first solution to allow that
Sudipto Das [email protected]}
19
Online Multi-player Games
ID
Name
$$$
Player Profile
Sudipto Das [email protected]}
20
Score
Online Multi-player Games
Execute transactions
on player profiles while
the game is in progress
Sudipto Das [email protected]}
21
Online Multi-player Games
Partitions/groups
are dynamic
Sudipto Das [email protected]}
22
Online Multi-player Games
Hundreds of thousands
of concurrent groups
Sudipto Das [email protected]}
23
Data Fusion for dynamic partitions
[G-Store, SoCC 2010]
Transactional access to a group of data
items formed on-demand
 Challenge: Avoid distributed transactions!
 Key Group Abstraction

◦ Groups are small
◦ Groups execute non-trivial no. of transactions
◦ Groups are dynamic and on-demand

Groups are dynamically formed tenant
databases
Sudipto Das [email protected]}
24
Transactions on Groups
Without distributed transactions
Grouping Protocol
Key
Group
Ownership
of keys at a
single node
One key selected as the
leader
 Followers transfer
ownership of keys to leader

Sudipto Das [email protected]}
25
Why is group formation hard?

Guarantee the contract between
leaders and followers in the presence of:
◦ Leader and follower failures
◦ Lost, duplicated, or re-ordered messages
◦ Dynamics of the underlying system

How to ensure efficient and ACID
execution of transactions?
Sudipto Das [email protected]}
26
Grouping protocol
Log entries
Follower(s)
Create
Request
Leader
L(Joining)
J
JA
L(Joined)
JAA
Group Opns
L(Creating) L(Joined)
Time

L(Free)
D
L(Deleting)
DA
L(Deleted)
Delete
Request
Conceptually akin to “locking”
◦ Locks held by groups
Sudipto Das [email protected]}
27
Efficient transaction processing

How does the leader execute transactions?
◦ Caches data for group members  underlying data
store equivalent to a disk
◦ Transaction logging for durability
◦ Cache asynchronously flushed to propagate updates
◦ Guaranteed update propagation
Leader
Transaction Manager
Cache Manager
Log
Asynchronous update
Propagation
Followers
Sudipto Das [email protected]}
28
Prototype: G-Store [SoCC 2010]
An implementation over Key-value stores
Application Clients
Transactional Multi-Key Access
Grouping middleware layer resident on top of a key-value store
Grouping Transaction
Layer
Manager
Grouping Transaction
Layer
Manager
Grouping Transaction
Layer
Manager
Key-Value Store Logic
Key-Value Store Logic
Key-Value Store Logic
Distributed Storage
G-Store
Sudipto Das [email protected]}
29
G-Store Evaluation

Implemented using HBase
◦ Added the middleware layer
◦ ~10000 LOC
Experiments in Amazon EC2
 Benchmark: An online multi-player game
 Cluster size: 10 nodes
 Data size: ~1 billion rows (>1 TB)
 For groups with 100 keys

◦ Group creation latency: ~10 – 100ms
◦ More than 10,000 groups concurrently created
Sudipto Das [email protected]}
30
G-Store Evaluation
Group creation latency
Group creation throughput
Sudipto Das [email protected]}
31
Lightweight Elasticity
Provisioning on-demand and not for peak
Resources
Capacity
Demand
Resources
Optimize operating cost!
Capacity
Demand
Time
Time
Traditional Infrastructures
Deployment in the Cloud
Unused resources
Slide Credits: Berkeley RAD Lab
Sudipto Das [email protected]}
32
Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das [email protected]}
33
Live database migration

Migrate a database partition (or tenant)
in a live system
◦ Optimize operating cost
◦ Resource orchestration in multitenant
systems

Different from
◦ Migration between software versions
◦ Migration in case of schema evolution
Sudipto Das [email protected]}
34
VM migration for DB elasticity

One DB partition-per-VM
◦ Pros: allows fine-grained load
balancing
◦ Cons
 Performance overhead
 Poor consolidation ratio [Curino et
al., CIDR 2011]

VM
VM
VM
Hypervisor
Multiple DB partitions in
a VM
◦ Pros: good performance
◦ Cons: Migrate all partitions 
Coarse-grained load balancing
Sudipto Das [email protected]}
VM
Hypervisor
35
Live database migration

Multiple partitions share the same
database process
◦ Shared process multitenancy

Migrate individual partitions ondemand in a live system
◦ Virtualization in the database tier

Straightforward solution
◦
◦
◦
◦
Stop serving partition at the source
Copy to destination
Start serving at the destination
Expensive!
Sudipto Das [email protected]}
36
Migration cost measures

Service un-availability
◦ Time the partition is unavailable

Number of failed requests
◦ Number of operations failing/transactions
aborting

Performance overhead
◦ Impact on response times

Additional data transferred
Sudipto Das [email protected]}
37
Two common DBMS architectures

Decoupled storage
architectures
◦ ElasTraS, G-Store, Deuteronomy,
MegaStore
◦ Persistent data is not migrated
◦ Albatross [VLDB 2011]

Shared nothing architectures
◦ SQL Azure, Relational Cloud,
MySQL Cluster
◦ Migrate persistent data
◦ Zephyr [SIGMOD 2011]
Sudipto Das [email protected]}
38
Why is live DB migration hard?

Persistent data must be migrated (GBs)
◦ How to ensure no downtime?

Nodes can fail during migration
◦ How to guarantee correctness during
failures?
 Transaction atomicity and durability
 Recover migration state after failure

Transactions execute during migration
◦ How to guarantee serializability?
 Transaction correctness equivalent to normal operation
Sudipto Das [email protected]}
39
Our approach: Zephyr
[SIGMOD 2011]

Migration executed in phases
◦ Starts with transfer of minimal information to
destination (“wireframe”)

Database pages used as granule of
migration
◦ Unique page ownership


Source and destination concurrently
execute transactions in one migration phase
Minimal transaction synchronization
 Guaranteed serializability

Logging and handshaking protocols
Sudipto Das [email protected]}
40
Simplifying assumptions
 For this talk
◦ Transactions access a single partition
◦ No replication
◦ No structural changes to indices

Extensions in the paper [SIGMOD 2011]
◦ Relaxes these assumptions
Sudipto Das [email protected]}
41
Design overview
P1
Owned Pages
P2
P3
Pn
Active transactions
TS1,…,
TSk
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das [email protected]}
42
Init mode
Freeze indices and migrate wireframe
P1
Owned Pages
Active transactions
P2
P3
P1
P2
P3
Pn
Pn
Un-owned Pages
TS1,…,
TSk
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das [email protected]}
43
What is an index wireframe?
Source
Destination
Sudipto Das [email protected]}
44
Dual mode
Requests for un-owned pages can block
P1
P2
P3
Pn
Old, still active
transactions
P3 accessed by
TDi
P1
P2
P3
P3 pulled
from source
Pn
TSk+1,…,
TSl
TD1,…,
TDm
Source
Destination
Index wireframes remain frozen
New transactions
Page owned by Node
Page not owned by Node
Sudipto Das [email protected]}
45
Finish mode
Pages can be pulled by the destination, if needed
P1
P2
P3
P1
P2
P3
Pn
P1, P2, …
pushed from
source
Pn
TDm+1,…
, TDn
Completed
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das [email protected]}
46
Normal operation
Index wireframe un-frozen
P1
P2
P3
Pn
TDn+1,…,
TDp
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das [email protected]}
47
Artifacts of this design

Once migrated, pages are never pulled back
by source
◦ Abort transactions at source accessing the
migrated pages

No structural changes to indices during
migration
◦ Abort transactions (at both nodes) that make
structural changes to indices

Destination “pulls” pages on-demand
◦ Transactions at the destination experience higher
latency compared to normal operation
Sudipto Das [email protected]}
48
Serializability

Only concern is “dual mode”
◦ Init and Finish: only one node is executing
transactions



Local predicate locking of internal index and
exclusive page ownership  no phantoms
Strict 2PL  Transactions are locally
serializable
Pages transferred only once
◦ No Tdest  Tsource conflict dependency

Guaranteed serializability
Sudipto Das [email protected]}
49
Recovery

Transaction recovery
◦ For every database page, Tsrc  Tdst
◦ Recovery: transactions replayed in conflict
order

Migration recovery
◦ Atomic transitions between migration modes
 Developed logging and handshake protocols
◦ Every page has exactly one owner
 Bookkeeping at the index level
Sudipto Das [email protected]}
50
Correctness

In the presence of arbitrary repeated
failures, Zephyr ensures:
◦ Updates made to database pages are consistent
◦ Failure does not leave a page without an owner
◦ Both source and destination are in the same
migration mode

Guaranteed termination and starvation
freedom
Sudipto Das [email protected]}
51
Implementation

Prototyped using an open source OLTP
database H2
◦
◦
◦
◦

Supports standard SQL/JDBC API
Serializable isolation level
Tree Indices
Relational data model
Modified the database engine
◦ Added support for freezing indices
◦ Page migration status maintained using index
◦ ~6000 LOC

Tungsten SQL Router migrates JDBC
connections during migration
Sudipto Das [email protected]}
52
Results Overview

Downtime (partition unavailability)
◦ S&C: 3 – 8 seconds (needed to migrate, unavailable
for updates)
◦ Zephyr: No downtime. Either source or
destination is available

Service interruption (failed operations)
◦ S&C: ~100s – 1,000s. All transactions with updates
are aborted
◦ Zephyr: ~10s – 100s. Order of magnitude less
interruption

Minimal operational and data transfer
overhead
Sudipto Das [email protected]}
53
Failed Operations
Order of
magnitude
fewer failed
operations
Sudipto Das [email protected]}
54
Concluding Remarks
 Major
enabling
technologies
◦ Scalable distributed
database infrastructure
 ElasTraS
◦ Dynamically formed
data partitions
 G-Store
◦ Live database migration
 Albatross, Zephyr
Sudipto Das [email protected]}
55
Future Directions

Self-managing controller for large
multitenant database infrastructures

Novel data management architectures
◦ Leveraging advances from novel hardware
◦ Convergence of transactional and analytics
systems for real-time intelligence

Putting human-in-the-loop: Leveraging
crowd-sourcing
Sudipto Das [email protected]}
56
Thank you!
Collaborators
UCSB:
Divy Agrawal, Amr El Abbadi, Ömer Eğecioğlu
Shashank Agarwal, Shyam Antony, Aaron Elmore,
Shoji Nishimura (NEC Japan)
Microsoft Research Redmond:
Phil Bernstein, Colin Reid
IBM Almaden:
Yannis Sismanis, Kevin Beyer, Rainer Gemulla,
Peter Haas, John McPherson

similar documents