Partitioning Social Networks for
Fast Retrieval of Time-dependent Queries
Mindi Yuan, David Stein, Berenice Carrasco, Joana Trindade, Yi Lu
University of Illinois at Urbana-Champaign
Sameh Elnikety
Microsoft Research
Email questions, praise and complains to Prof. Yi Lu <[email protected]>
Online Social Networking (OSN)
• OSNs differ from traditional web applications significantly
– Handle highly personalized content
• Retrieved pages differ from user to user; also vary over short time intervals
when new activities occur
– Highly interconnected data
• Immediate neighborhood and extended neighborhood (friend’s friend)
– Community
• Strong community structure among end users, albeit varying with time
• Scaling OSNs is a problem
– Interconnected nature
– Astounding growth rate
• Twitter grew by 10 times in one month, forced to redesign and re-implement its
architecture several times
Existing Approaches
• Conventional vertical scaling
– Upgrade existing hardware
– Replicate fully on each machine
• Problems
– Expensive because of the cost of high performance servers
– Even infeasible
• Facebook requires multiple hundreds of Terabytes of memory
• Horizontal scaling
– Use a higher number of cheap commodity servers
– Partition the load among them
– Good for stateless traditional web applications
• Problematic with data back-end layer that maintains state
– OSN data cannot be partitioned into disjoint components
– Leads to costly inter-server communication
Inter-server communication
• Distributed queries are shown to reduce performance compared to
local queries [Pujol et. al. 2010][Curino et. al. 2010]
• Curino et. al. reported that local queries double throughput
• Question to consider:
How much can we minimize inter-server communication and at what
1. Problem
2. Previous work
3. Our approach
4. Evaluation
Two extremes of the spectrum
• Cassandra
Random partitioning with distributed hashing
Consistency is easy to maintain since only one replica is available
Inter-server traffic increases at retrieval
Slow down responses
Lead to “multi-get” hole problem with e.g. Facebook, where increasing the
number of servers actually reduces throughput
• Replicating all users’ data on multiple servers
Eliminates inter-server traffic but increases replication overhead
Impossible to hold all data in memory on each server
Delay in writing due to updates of all replicas
Network traffic for maintaining consistency across replicas
Little Engines
The little engine(s) that could: scaling online social networks. Pujol et. al. Sigcomm, 2010.
• Assumption
– Most of operations are based on data of a user and her neighbors
– The assumption is debatable: True for Twitter data, not true for Facebook data
where a two-hop network is accessed for most operations (expanded on next
• Guarantees that for all users in an OSN, their direct neighbors’ data are colocated on the same server
– Application developers can assume local semantics
– Scalability achieved with servers with low memory and network I/O
• Drawback
– Popular data replicated on all servers
– Extension to two-hop network will require a much larger number of replicas
Two-hop network
• SPAR proposed in Little Engines considers one-hop network
– Mainly focus on twitter-type models: a user follows all contents of a group
of senders
– Facebook messages were tested upon by SPAR, but they are still modeled
as a user following all his friends, instead of the real-world scenario where
a user can access related activities of friends’ friends
• Two-hop networks
– For instance, Joana can accesses all activities Between Jona and Nandana
1. Problem
2. Previous work
3. Our approach
4. Evaluation
Our Approach
• Goal: have most requests handled locally while having much fewer
replicas than the approach in “little engines”?
– To have an approach scalable with two-hop networks
– To reduce the cost of maintaining consistency among replicas
• Main idea:
– Instead of partitioning the social network, which contains all users that will
potentially send a message
– We focus on the activity network, which contains users actually sending
– Partitioning is based on actual activities in the network so that most requests
can be served locally, which can be achieved with much fewer replicas than a
strong locality guarantee
Activity Network 1/2
• Contains much fewer edges than the social network
– More amenable to partitioning and replication
• In a two-hop neighborhood, even if a user does not participate in much
activity with his direct neighbors, there can still be abundant activity in
the two-hop neighborhood
• Facebook data for Dec 2006 New Orleans network
– While most user pairs only have 1 post, the top 8% of users receiver > 100
posts in their two-hop neighborhood
Activity Network 2/2
• There is strong time correlation among wall posts between the same
pairs of users
– This motivates a local, dynamic partitioning algorithm that adapts to the
temporal correlation
• Computation of auto-correlation
– Gather all messages in the period m(1), … m(k)
– Compute the auto-correlation function
Activity Prediction Graph
• The example has 6 users in the network
• There is a message node between each pair of user nodes
• Each edge is assigned a weight that reflects
– Read frequency of a user (assume data collected)
– The likelihood of retrieving a message from this particular edge when the
most recent N messages are retrieved from the two-hop neighborhood
Partitioning Algorithm
• Periodic partitioning
– Use K-Metis to repartition the graph at fixed time epochs
– In between epochs, new users and new interactions are added to the graph
to minimize partitioning cost
• Local adaptive partitioning
– Takes advantage of strong time correlation between messages
– Triggered when a retrieval results in remote accesses
– Keep movements of nodes across partitions small
1. Problem
2. Previous work
3. Our approach
4. Evaluation
Evaluation Scenario
• Each retrieval involves the most recent 6 messages in a two-hop
• We chose 6 as the dataset available is relatively small
– A total of 13948 messages with 8640 active users
• We consider algorithms
Random hashing based on sender (hash_p1)
Random hashing based on sender-receiver pair (hash_p2)
Periodic partitioning
Local adaptive partitioning
Retrospective algorithm: instead use the APG, which uses past data to predict
the activity for the current month, use the actual activity in the month. It is
the best one can do given the limitations of the partitioning algorithm
• Proportion of retrievals that accesses only one partition for all 6 messages
• The periodic algorithm performs much better than random hashing
• Close to retro, so APG does provide relatively accurate prediction
• Proportion of retrievals that accesses at most 3 partitions for all 6
• Most retrievals touch at most 3 partitions
• The local adaptive algorithm further improves over periodic partitioning
• At least 6.4 times more local queries than random hashing
• Without causing large movements of nodes as in the periodic case
• We also compare the periodic partitioning on the APG vs a graph using
weights in one-hop neighborhood only
• The retrievals are performed in the two-hop neighborhood
• There is clearly an advantage in considering activities in the two-hop
neighborhood when constructing the graph
• The advantage is more significant as the number of partitions increases
• We proposed a dynamic, adaptive algorithm to predict activities on an
OSN and partition the network for fast retrieval of data
• The algorithm scales when information needs to be retrieved from the
two-hop neighborhood
• Further work is done to improve data locality for most retrievals with a
small number of replicas

similar documents