Data Management in the Cloud

Report
Data Management in the
Cloud
Paul Szerlip
The rise of data
•
Think about this
o
•
Cheap sensors
o
•
Cellphones are packed with sensory information
 Images, video, audio, etc
Expensive sensors
o
•
For the past two decades, the largest generator of
data was humans -- now it's our devices
DZero, high energy physics, generates 1 TB a day
How do you deal with that much data?
[1,2]
Data in the cloud
•
•
Storing the data
o
Bigtable, S3, NoSQL, etc
Processing the data
o
MapReduce, Hadoop, etc
Good data management in the
cloud
•
Availability
o
•
Scalability
o
•
•
•
Accessible in cases of partial network failure or
datacenter failure
Support for massive database sizes - spread across
many servers
Elasticity
o
Scaling up and scaling down
Performance
o
Efficient system storage utilization
Multitenancy
o
Many applications on the same hardware
Good data management
(continued)
•
•
•
•
Load and Tenant Balancing
o
Moving load between servers
Fault Tolerance
o
Tolerating network or hardware failures
Running in heterogeneous environment
o
Dealing with hardware degredation
Flexible query interface
o
Providing ways to access both SQL and non-SQL
languages
Overarching Themes
•
•
Frustration with ACID on the cloud
o
Hard to maintain ACID guarantees with data
replication over large geographic distances [1]
o
•
(Atomicity, consistency, isolation, durability)
Consistency, Availability, Tolerance to partitions,
choose 2
Rise of NoSQL (a misnomer) [2]
o
Eventually consistent can be okay, some ACID
properties are relaxed or left to application developers
Investigating 3 Systems
•
•
•
Bigtable (Google)
o
And quick look at MapReduce
Amazon:S3/SimpleDB
Open source NoSQL alternatives:
o
o
Cassandra (key-value)
MongoDB (document)
Bigtable
•
•
•
Distributed storage designed to scale to
petabyte size databases spread across
thousands of servers [1]
Used extensively by Google
Not fully relational
o
•
•
"Sparse, distributed, persistent multidimentional
sorted map" [1]
Uses Google File System (GFS) under the
hood
Index using row keys
o
Tablet = range of row keys, used for load balancing
Bigtable Diagram [2]
Bigtable
•
GFS
o
o
•
SSTable
 Provides a persistent immutable ordered map
Chubby provides locking mechanism
 Ensures single master
 Location of bigtable data
 Storing schema information and access control
lists
Each Bigtable is allocated to one master,
and many multiple tablet servers
o
Master assigns tablets to different tablet servers,
dynamically based on server load
MapReduce
•
•
•
Introduced by Google in 2004 [1]
Often used to operate on Bigtable data [1]
A means to process large amounts of data in
a distributed environment in a highly
parallelized manner
MapReduce Steps
1. Input files split into M pieces, multiple copies
of program started on cluster
2. One copy is master, M map tasks, R reduce
tasks assigned to idle workers
3. Worker reads file split contents, passes to
map function - results buffered in memory
4. Buffered results written to local disk
periodically, partitioned into R regions by
partitioning function, locations passed to
master
MapReduce (continued)
5. Reduce worker notified about location, reads
buffered data from map workers, sorts so
that same keys are grouped together
6. Reduce worker passes key and intermediate
values to Reduce function, output is
appended to final output file
7. After all map and reduce tasks completed,
master wakes up user program
S3 - Simple Storage Service
•
•
"Infinite" store for objects of variable size [1]
Organized in 2 levels
o
o
•
Buckets
 Like folders, you can save any number of objects
in them
Objects
 Byte container (up to 5 GB) and metadata (up to
2KB)
Limited search
o
Single bucket, name only
SimpleDB
•
•
•
•
Organized into domains (tables) where you
can insert data, get data, or run queries [1]
Each domain has items which are descibed
by attribute name/value pairs
No schema
API Accesso
•
•
CreateDomain, DeleteDomain, PutAttributes,
DeleteAttributes, GetAttributes, and Select
Meant for fast reads
Keeps multiple copies of the domains
NoSQL
•
•
•
What does this mean?
o
More about relaxing ACID than being "No" SQL [2]
Lots of open source NoSQL systems
o
Zynga was big on NoSQL
Why to use them?
Excellent elasticity
o Flexible data models - often schema-less
o CHEAP (relative to RDBMS)
o (if you have lots of frequent and small writes)
o
Types of NoSQL
•
•
•
Key-value
o
Redis, Cassandra, etc.
Document store
o
CouchDB, mongoDB, etc
Graph dbs, object stores
o
Won't go into these much
Cassandra
•
•
•
Highly scalable, eventually consistent,
distributed, structured, key-value store [1]
Open sourced by Facebook (2008) [1]
ColumnFamily based
o
o
•
Column is a tuple of {key, value, timestamp}
ColumnFamilies contain many columns, all
referenced by row-key
Kind of like a hybrid of Dynamo and Bigtable
[1]
MongoDB
•
Document-oriented
High input read/write
o High availability
o Scalability
o Flexible query language
o
References
•
•
•
[1] Sakr, S., Liu, A., Batista, D.M., Alomari,
M., A Survey of Large Scale Data
Management Approaches in Cloud
Environments, IEEE Communications, 2011.
[2] Cloud Computing: Theory and Practice
(our lecture notes)
[3]
http://www.mongodb.org/display/DOCS/Intro

similar documents