PPT

Report
Dynamo
Highly Available Key-Value Store
Dennis Kafura – CS5204 – Operating Systems
1
Dynamo
Context

Core e-commerce services need scalable and reliable storage for
massive amounts of data



Size and scalability require a storage architecture that is






n x 100 of services
n x 100,000 concurrent sessions on key services
highly decentralized
high component count
commodity hardware
High component count creates reliability problems (“treats
failure handling as the normal case”)
Address reliability problems by replication
Replication raises issues of:


Consistency (replicas differ after failure)
Performance


When to enforce consistency (on read, on write)
Who enforces consistency (client, storage system)
Dennis Kafura – CS5204 – Operating Systems
2
Dynamo

System Elements
Maintain state of services with




Used only internally



High reliability requirements
Latency-sensitive performance
Control tradeoff between consistency and performance
Can leverage characteristics of services and workloads
Non-hostile environment (no security requirements)
Simple key-value interface
Applications do not require more complicated (e.g. database)
semantics or hierarchical name space
 Key is unique identifier for data item; Value is a binary object (blob)
 No operations over multiple data items



Adopts weaker model of consistency (eventual consistency) in favor
of higher availability
Service level agreements (SLA)
At 99.9% percentile
Key factors: service latency at a given request rate
Example: response time of 300ms for 99.9% of requests at peak
client load of 500 requests per second
 State manage (storage) efficiency a key factor in SLAs



Dennis Kafura – CS5204 – Operating Systems
3
Dynamo
Design Considerations

Consistency vs. availability


Strict consistency means that data is unavailable in case of
failure to one of the replicas
To improve availability,


use weaker form of consistency (eventual consistency)
allow optimistic updates (changes propagate in the background)


Conflicts



Can lead to conflicting changes which must be detected and
resolved
Dynamo applications require “always writeable” storage
Perform conflict detection/resolution on reads
Other factors



Incremental scalability
Symmetry/decentralization (P2P organization/control)
Heterogeneity (not all servers the same)
Dennis Kafura – CS5204 – Operating Systems
4
Dynamo
Design Overview
Dennis Kafura – CS5204 – Operating Systems
5
Dynamo
Partitioning

Interface

get(key)
Returns context and
 A single object or a list of conflicting objects


put(key, context, object)


Context from previous read
Object placement/replication


MD5 hash of key yields
128 bit identifier
Consistent hashing
preference list
Dennis Kafura – CS5204 – Operating Systems
6
Dynamo
Versioning

Failure free operation
put

replicas
What to do in case of failure?
?
put
Dennis Kafura – CS5204 – Operating Systems
replicas
7
Dynamo
Versioning

Object content is treated as immutable and an
update operation creates a new version
put
v2
v1
v2
v1
v2
Dennis Kafura – CS5204 – Operating Systems
v1
8
Dynamo
Versioning

Versioning can lead to inconsistency

Due to network partitioning
put
v2
v1
v2
v1
v1
Dennis Kafura – CS5204 – Operating Systems
9
Dynamo
Versioning

Versioning can lead to inconsistency

Due to concurrent updates
v2b v2a v1
puta
v2a
v2b v1
putb
v2b v2a
Dennis Kafura – CS5204 – Operating Systems
v1
10
Dynamo
Object Resolution




Uses vector-clocks
Conflicting versions
passed to application as
output of get operation
Application resolves
conflicts and puts a
new (consistent)
version
Inconsistent version
rare: 99.94% of get
operations saw exactly
one version
Dennis Kafura – CS5204 – Operating Systems
11
Dynamo
Handling get/put operations

Operating handled by coordinator:


First among the top N nodes in the preference list
Located by



Quorum voting






call to load balancer (no Dynamo-specific node needed in application
but may require extra level of indirection)
Direct call to coordinator (via Dynamo-specific client library)
R nodes must agree to a get operation
W nodes must agree to a put operation
R+W > N
(N, R, W) can be chosen to achieve desired tradeoff
Common configuration is (3,2,2)
“Sloppy quorum”
Top N’ healthy nodes in the preference list
Coordinator is first in this group
Replicas sent to node contain a “hint” indicating the
(unavailable) original node that should hold the replica
 Hinted replicas are stored by available node and sent forwarded
when original node recovers.



Dennis Kafura – CS5204 – Operating Systems
12
Dynamo
Replica synchronization



Accelerates detection of
inconsistent replicas using
Merkle tree
Separate tree maintained by
each node for each key range
Adds overhead to maintain
Merkle trees
Dennis Kafura – CS5204 – Operating Systems
13
Dynamo
Ring membership






Nodes are explicitly added to/removed from a ring
Membership, partitioning, and placement information
propagates via periodic exchanges (a gossip protocol)
Existing nodes transfer key ranges to newly added
node or receive key ranges from exiting nodes
Nodes eventually know key ranges of its peers and
can forward requests to them
Some “seed” nodes are well-known
Nodes failures detected by lack of responsiveness and
recovery detected by periodic retry
Dennis Kafura – CS5204 – Operating Systems
14
Dynamo
Partition/Placement Strategies
Strategy
Placement
Partition
1
T random tokens per node
Consecutive tokens create a partition
2
T random tokens per node
Q equal sized partitions
3
Q/S tokens per node
Q equal sized partitions
S = number of nodes
Dennis Kafura – CS5204 – Operating Systems
15
Dynamo
Strategy Performance Factors

Strategy 1

Bootstrapping of new node is lengthy
It must acquire its key ranges from other nodes
 Other nodes process scanning/transmission of key
ranges for new node as background activities
 Has taken a full day during peak periods



Numerous nodes many have to adjust their Merkle
trees when a new node joins/leaves system
Archival process difficult
Key ranges may be in transit
 No obvious synchronization/checkpointing structure

Dennis Kafura – CS5204 – Operating Systems
16
Dynamo
Strategy Performance Factors

Strategy 2



Decouples partition and placement
Allows changing of placement scheme at run-time
Strategy 3


Decouples partition and placement
Faster bootstrapping/recovery and ease of archiving
because key ranges can be segregates into
different files that can be shared/archived
separately
Dennis Kafura – CS5204 – Operating Systems
17
Dynamo
Partition Strategies - Performance



Strategies have different tuning parameters
Fair comparison: evaluate the skew in their load distributions for a
fixed amount of space to maintain membership information
Strategy 3 superior
Dennis Kafura – CS5204 – Operating Systems
18
Dynamo
Client- vs Server-Side Coordination



Any node can coordinate read requests; write requests handled by coordinator
State-machine for coordination can be in load balancing server or incorporated
into client
Client-driven coordination has lower latency because it avoids extra network
hop (redirection)
Dennis Kafura – CS5204 – Operating Systems
19

similar documents