Storage at Facebook - University of Pennsylvania

Report
NETS 212: Scalable and Cloud Computing
Storage at Facebook
December 3, 2013
© 2013 A. Haeberlen
University of Pennsylvania
1
Announcements

Second midterm on Tuesday next week




Project: Any questions?




© 2013 A. Haeberlen
Comparable to first midterm (open-book, closed-Google, ...)
Covers all material up to, and including, December 5 lecture
Practice questions at the end of Thursday's lecture
You should have a first running prototype by the end of this week
Don't try to finish this at the last minute!
Please test early and often!
If you're 'stuck' on something or would like feedback on something,
please see me (or one of the TAs) during office hours!!
University of Pennsylvania
2
Reminder: Facebook award

The team with the best PennBook application
will receive an award (sponsored by
)



© 2013 A. Haeberlen
Criteria: Architecture/design, supported features, stability,
performance, security, scalability, deployment
Winning teams get Facebook backpacks and/or T-shirts
Winners will be announced on the course web page
University of Pennsylvania
3
Menu for today

Two recent research papers from Facebook



Paper #1: A scalable storage system + cache
for the social graph (improves on Memcache)



True graph data model - not a KV store like Memcache
Impressive performance: 1B reads/sec, 1M writes/sec!
Paper #2: A special storage system for their
images (where caching doesn't help as much)

© 2013 A. Haeberlen
They're not as secretive as some other companies
Lots of papers with reasonable details about their systems:
https://www.facebook.com/publications
Reason: "Long tail" of requests that can't easily be cached
University of Pennsylvania
4
What you should look out for

Not so much the exact details of how it works


Rather, what principles they used!





© 2013 A. Haeberlen
Highly technical, and somewhat problem-specific
How did they make it scale?
Why did they make the design decisions the way they did?
What kinds of problems did they face?
Etc.
Also, I am hoping to give you a sense of how
things "really work" today (as far as we know)
University of Pennsylvania
5
TAO: Facebook's Distributed Data Store
for the Social Graph
Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov,
Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov,
Dimitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani
USENIX Annual Technical Conference 2013
© 2013 A. Haeberlen
University of Pennsylvania
6
Motivation

From David's guest lecture:




Social graph stored in mySQL databases
Memcache used as a (scalable) lookaside cache
This is great - but can we do even better?
Some challenges with this design:


Inefficient edge lists: Key-value cache is not a good fit for
the edge lists in a graph; need to always fetch entire list
Distributed control logic: Cache control logic is run on clients
that don't communicate with each other


Expensive read-after-write consistency: In original design,
writes always have to go to the 'master'

© 2013 A. Haeberlen
More failure modes; difficult to avoid "thundering herds" ( leases)
Can we write to caches directly, without inter-regional communication?
University of Pennsylvania
7
Goals for TAO


Provide a data store with a graph abstraction
(vertexes and edges), not keys+values
Optimize heavily for reads


Explicitly favor efficiency and availability over
consistency


© 2013 A. Haeberlen
Remember what David said? More than 2 orders of
magnitude more reads than writes!
Slightly stale data is often okay (for Facebook)
Communication between data centers in different regions is
expensive
University of Pennsylvania
8
Images by Jojo Mendoza, Creative Commons licensed
Thinking about related objects
Facebook
Remember
this slide?
fan-of
friend-of
Alice


fan-of
friend-of
Sunita
fan-of
Mikhail
fan-of
Magna Carta
Jose
We can represent related objects as a
labeled, directed graph
Entities are typically represented as nodes;
relationships are typically edges


© 2013 A. Haeberlen
fan-of
Nodes all have IDs, and possibly other properties
Edges typically have values, possibly IDs and other
properties
9
TAO's data model

Facebook's data model is exactly like that!



Example: Alice visits a landmark with Bob




© 2013 A. Haeberlen
Focuses on people, actions, and relationships
These are represented as vertexes and edges in a graph
Alice 'checks in' with her mobile phone
Alice 'tags' Bob to indicate that he is with her
Cathy added a comment
David 'liked' the comment
University of Pennsylvania
vertexes and
edges in the
graph
10
TAO's data model and API

TAO "objects" (vertexes)



64-bit integer ID (id)
Object type (otype)
Data, in the form of key-value pairs

Object API: allocate, retrieve, update, delete

TAO "associations" (edges)





Source object ID (id1)
Association type (atype)
Destination object ID (id2)
32-bit timestamp
Data, in the form of key-value pairs

Association API: add, delete, change type

Associations are unidirectional

© 2013 A. Haeberlen
But edges often come in pairs (each edge type has an 'inverse
type' for the reverse edge)
University of Pennsylvania
11
Example: Encoding in TAO
Data (KV pairs)
Inverse
edge types
© 2013 A. Haeberlen
University of Pennsylvania
12
Association queries in TAO

TAO is not a general graph database


Has a few specific (Facebook-relevant) queries 'baked into it'
Common query: Given object and association type, return an
association list (all the outgoing edges of that type)


Optimized based on knowledge of Facebook's workload



Example: Most queries focus on the newest items (posts, etc.)
There is creation-time locality  can optimize for that!
Queries on association lists:




© 2013 A. Haeberlen
Example: Find all the comments for a given checkin
assoc_get(id1, atype, id2set, t_low, t_high)
assoc_count(id1, atype)
assoc_range(id1, atype, pos, limit)
 "cursor"
assoc_time_range(id1, atype, high, low, limit)
University of Pennsylvania
13
TAO's storage layer

Objects and associations are stored in mySQL

But what about scalability?


Solution: Data is divided into logical shards





© 2013 A. Haeberlen
Facebook's graph is far too large for any single mySQL DB!!
Each object ID contains a shard ID
Associations are stored in the shard of their source object
Shards are small enough to fit into a single mySQL instance!
A common trick for achieving scalability
What is the 'price to pay' for sharding?
University of Pennsylvania
14
Caching in TAO (1/2)

Problem: Hitting mySQL is very expensive



But most of the requests are read requests anyway!
Let's try to serve these from a cache
TAO's cache is organized into tiers




A tier consists of multiple cache servers (number can vary)
Sharding is used again here  each server in a tier is
responsible for a certain subset of the objects+associations
Together, the servers in a tier can serve any request!
Clients directly talk to the appropriate cache server


© 2013 A. Haeberlen
Avoids bottlenecks!
In-memory cache for objects, associations, and association
counts (!)
University of Pennsylvania
15
Caching in TAO (2/2)

How does the cache work?



New entries filled on demand
When cache is full, least recently used (LRU) object is evicted
Cache is "smart": If it knows that an object had zero associations of some type, it knows how to answer a range query


What about write requests?


Need to go to the database (write-through)
But what if we're writing a bidirectonal edge?


This may be stored in a different shard  need to contact that shard!
What if a failure happens while we're writing such an edge?



© 2013 A. Haeberlen
Could this have been done in Memcached? If so, how? If not, why not?
You might think that there are transactions and atomicity...
... but in fact, they simply leave the 'hanging edges' in place (why?)
Asynchronous repair job takes care of them eventually
University of Pennsylvania
16
Leaders and followers

How many machines
should be in a tier?


Solution: Add another
level of hierarchy





© 2013 A. Haeberlen
Too many is problematic:
More prone to hot spots, etc.
Each shard can have multiple
cache tiers: one leader, and multiple followers
The leader talks directly to the mySQL database
Followers talk to the leader
Clients can only interact with followers
Leader can protect the database from 'thundering herds'
University of Pennsylvania
17
Leaders/followers and consistency

What happens now when a client writes?



Follower sends write to the leader, who forwards to the DB
Does this ensure consistency? No!
Need to tell the other followers about it!


Write to an object  Leader tells followers to invalidate any
cached copies they might have of that object
Write to an association  Don't want to invalidate. Why?


Solution: Leader sends a 'refill message' to followers


© 2013 A. Haeberlen
Followers might have to throw away long association lists!
If follower had cached that association, it asks the leader for an update
What kind of consistency does this provide?
University of Pennsylvania
18
Scaling geographically


Facebook is a global service. Does this work?
No - laws of physics are in the way!


© 2013 A. Haeberlen
Long propagation delays, e.g., between Asia and U.S.
What tricks do we know that could help with this?
University of Pennsylvania
19
Scaling geographically

Idea: Divide data
centers into
regions; have one
full replica of the
data in each region




© 2013 A. Haeberlen
What could be a problem with this approach?
Again, consistency!
Solution: One region has the 'master' database; other regions
forward their writes to the master
Database replication makes sure that the 'slave' databases
eventually learn of all writes; plus invalidation messages, just
like with the leaders and followers
University of Pennsylvania
20
Handling failures

What if the master database fails?

Can promote another region's database to be the master
But what about writes that were in progress during switch?
What would be the 'database answer' to this?
TAO's approach:

Why is (or isn't) this okay in general / for Facebook?



© 2013 A. Haeberlen
University of Pennsylvania
21
Consistency in more detail

What is the overall level of consistency?




When faults occur, consistency can degrade



In some situations, clients can even observe values
'go back in time'!
How bad is this (for Facebook specifically / in general)?
Is eventual consistency always 'good enough'?


© 2013 A. Haeberlen
During normal operation: Eventual consistency (why?)
Refills and invalidations are delivered 'eventually' (typical
delay is less than one second)
Within a tier: Read-after-write (why?)
No - there are a few operations on Facebook that need
stronger consistency (which ones?)
TAO reads can be marked 'critical' ; such reads are handled
directly by the master.
22
University of Pennsylvania
Fault handling in more detail

General principle: Best-effort recovery


Database failures: Choose a new master


If leader fails permanently, need to invalidate cache for the
entire shard
Follower failures: Failover to other followers

© 2013 A. Haeberlen
Route around the faulty leader if possible (e.g., go to DB)
Refill/invalidation failures: Queue messages


Might happen during maintenance, after crashes, repl. lag
Leader failures: Replacement leader


Preserve availability and performance, not consistency!
The other followers jointly assume responsibility for handling
the failed follower's requests
University of Pennsylvania
23
Production deployment at Facebook

Impressive performance


Reads dominate massively



45% of assoc_count calls return 0...
but there is a heavy tail: 1% return >500,000! (why?)
Cache hit rate is very high

© 2013 A. Haeberlen
Only 0.2% of requests involve a write
Most edge queries have zero results


Handles 1 billion reads/sec and 1 million writes/sec!
Overall, 96.4%!
University of Pennsylvania
24
Summary

The data model really does matter!


Several useful scaling techniques



"Sharding" of databases and cache tiers (not invented at
Facebook, but put to great use)
Primary-backup replication to scale geographically
Interesting perspective on consistency


© 2013 A. Haeberlen
KV pairs are nice and generic, but you sometimes can get
better performance by telling the storage system more about
the kind of data you are storing in it ( optimizations!)
On the one hand, quite a bit of complexity & hard work to
do well in the common case (truly "best effort")
But also, a willingness to accept eventual consistency
(or worse!) during failures, or when the cost would be high
University of Pennsylvania
25
Finding a needle in Haystack: Facebook's
photo storage
Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel
OSDI 2010
© 2013 A. Haeberlen
University of Pennsylvania
26
Motivation

Facebook stores a huge number of images



How to serve requests for these images?

© 2013 A. Haeberlen
In 2010, over 260 billion (~20PB of data)
One billion (~60TB) new uploads each week
Typical approach: Use a CDN (and Facebook does do that)
University of Pennsylvania
27
The problem
"Long tail"

When would the CDN approach work well?



Problem: "Long tail" of requests for old images!


© 2013 A. Haeberlen
If most requests were for a small # of images
But is this the case for Facebook photos?
CDN can help, but can't serve everything!
Facebook's system still needs to handle a lot of requests
University of Pennsylvania
28
Facebook pre-2010 (1/2)

Images were kept on NAS devices



© 2013 A. Haeberlen
NAS = network-attached storage
File system can be mounted on servers via NFS
CDN can request images from the servers
University of Pennsylvania
29
Facebook pre-2010 (2/2)

Problem: Accesses to metadata


OS needs to translate file name to inode number, read inode
from disk, etc., before it can read the file itself
Often more than 10 disk I/Os - and these are really slow!
(Disk head needs to be moved, wait for rotating media, ...)


Hmmm... what happens once they have SSDs?
Result: Disk I/Os for metadata were limiting
their read throughput

© 2013 A. Haeberlen
Directories, inodes,
block maps, ...
Could you have guessed that this was going to be the
bottleneck?
University of Pennsylvania
30
What could be done?

They considered various ways to fix this...


But in the end they decided to build a
special-purpose storage system for images


Goal: Massively reduce size of metadata
When is this a good idea (compared to using
an existing storage system)?


© 2013 A. Haeberlen
... including kernel extensions etc.
Pros and cons - in general?
... and for Facebook specifically?
University of Pennsylvania
31
Haystack overview

Three components:



Directory
Cache
Store




When the user
visits a page:


© 2013 A. Haeberlen
Physical volumes
Several of these make up
a logical volume
Replication - why?
Web server constructs a URL for each image
http://(CDN)/(Cache)/(MachineID)/(Logical volume, photo)
University of Pennsylvania
32
Reading an image
http://(CDN)/(Cache)/(MachineID)/(Logical volume, photo)

Read path:





© 2013 A. Haeberlen
CDN tries to find the image first
If not found, strips CDN part
off and contacts the Cache
If not found there either,
Cache strips off the cache part
and contacts the specified
Store machine
Store machine finds the volume, and the photo within the
volume
All the necessary information comes from the URL!
University of Pennsylvania
33
Uploading an image


User uploads image
to a web server
Web server requests
a write-enabled
volume


© 2013 A. Haeberlen
Volumes become read-only
when they are full, or for
operational reasons
Web server assigns unique ID and sends
image to each of the physical volumes that
belong to the chosen logical volume
University of Pennsylvania
34
Haystack: The Directory

Directory serves four functions:






© 2013 A. Haeberlen
Maps logical to physical volumes
Balances writes across logical volumes, and reads across
physical volumes
Determines whether request should be handed by the CDN
or by the Cache
Identifies the read-only volumes
How does it do this?
Information stored as usual: replicated
database, Memcache to reduce latency
University of Pennsylvania
35
Haystack: The Cache

Organized as a distributed hashtable


Caches images ONLY if:



Remember the Pastry lecture?
1) the request didn't come from the CDN, and
2) the request came from a write-enabled volume
Why?!?


Post-CDN caching not very effective (hence #1)
Photos tend to be most heavily accessed soon after they
uploaded (hence #2)

© 2013 A. Haeberlen
... and file systems tend to perform best when they're either reading or
writing (but not both at the same time!)
University of Pennsylvania
36
Haystack: The Store (1/2)

Volumes are simply very large files (~100GB)


Structure of each file:



© 2013 A. Haeberlen
Few of them needed  In-memory data structures small
A header, followed by a number of 'needles' (images)
Cookies included to prevent guessing attacks
Writes simply append to the file; deletes simply set a flag
University of Pennsylvania
37
Haystack: The Store (2/2)

Store machines have an in-memory index


Maps photo IDs to offsets in the large files
What to do when the machine is rebooted?

Option #1: Rebuild from reading the files front-to-back



Option #2: Periodically write the index to disk
What if the index on disk is stale?



© 2013 A. Haeberlen
Is this a good idea?
File remembers where the last needle was appended
Server can start reading from there
Might still have missed some deletions - but the server can
'lazily' update that when someone requests the deleted img
University of Pennsylvania
38
Recovery from failures

Lots of failures to worry about


Faulty hard disks, defective controllers, bad motherboards...
Pitchfork service scans for faulty machines



Periodically tests connection to each machine
Tries to read some data, etc.
If any of this fails, logical (!) volumes are marked read-only


Bulk sync service can restore the full state


© 2013 A. Haeberlen
Admins need to look into, and fix, the underlying cause
... by copying it from another replica
Rarely needed
University of Pennsylvania
39
How well does it work?

How much metadata does it use?




© 2013 A. Haeberlen
Only about 12 bytes per image (in memory)
Comparison: XFS inode alone is 536 bytes!
More performance data in the paper
Cache hit rates: Approx. 80%
University of Pennsylvania
40
Summary

Different perspective from TAO's


Interesting (and unexpected) bottleneck


To get really good scalability, you need to understand your
system at all levels!
In theory, constants don't matter - but in
practice, they do!


© 2013 A. Haeberlen
Presence of "long tail"  caching won't help as much
Shrinking the metadata made a big difference to them,
even though it is 'just' a 'constant factor'
Don't (exclusively) think about systems in terms of big-O
notations!
University of Pennsylvania
41
Stay tuned
Next time you will learn about:
Special topics
© 2013 A. Haeberlen
University of Pennsylvania
42

similar documents