Presentation Slides

Report
Graph Data Mining with Map-Reduce
Nima Sarshar, Ph.D.
INTUIT Inc,
[email protected]
2
Intuit, Graphs and Me
 Me:
B1
 Large-scale graph data processing,
complex networks analysis, graph
algorithms …
 Intuit:
 QuickBooks, TurboTax, Mint.com,
GoPayment, …
 Graphs @ Intuit:
 Commercial Graph is the business
“social network”
B2
C1
My Goals for this Talk
 You leave with your inner computer scientist tantalized:
 There is more to writing efficient Map-Reduce algorithms
than counting words and merging logs
 You get a general sense of the state of the research
 I convince you of the need for a real graph processing
package for Hadoop
 You know a bit about our work at Intuit
4
Plan
 Jump right to it with an example (enumerating triangles)
 Define the performance metrics (what are we optimizing
for?)
 Give a classification of known “recipes”
 The triangle example with with a new trick
 Personalized PageRank, connected components
 A list of other algorithms
Finding Triangles with Map-Reduce
1
1
2
5 Potential Triangles to Consider
2 round
3 of Map Reduce jobs
Another
will check for the existence of the
“closing” edge
4
3
3
2
4
3
4
2
3
1
1
2
4
3
3
3
4
4
2
6
Problems with this Approach
1.
Each triangle will be detected 3 times – once under
each of its 3 vertices
2.
Too many “potential” triangles are created in the first
reduce step.
æ d ö
 For a node with degree d: ç
÷ ~ O(d 2 )
è 2 ø
 Total # of records:
2
d
å v=V
vÎV
2
2
p
k
=
N
k
åk
k
Modified Algorithm [Cohen ‘08]
1
2
3
2
4
3
4
For each triangle exactly one potential
triangle is created (under the lowest value
2
3 node)
4
3
1
3
3
1
4
3
4
2
8
The quadratic problem still persists
Bin an edge under it’s
LOW DEGREE node
 This is neat. At least we are not triple counting
 But the quadratic problem still exists. The number of
Break ties arbitrarily, but consistently
records is still O(N<k2>)
 We want to avoid binning edges under high degree
nodes
3
2
1
4
 The ordering of nodes is arbitrary! Let the degree of a
node define its order.
1
5
4
5
2
3
9
The performance
(
 Worst case: Q M
3/2
)
records vs.
Q( M 2 )
 The same as the best serial algorithm [Suri ‘11]
 The gain for “real” graphs is fairly substantial. If a graph is
2
reasonably random, it cuts down to: N k vs. N k 2
 For a heavy-tailed social graph (like our Commercial
Graph), this can be fairly huge
10
Enumerating Rectangles
 Triangles will tell you the friends you have in common with
another friend
 “People you May Know”: Find another node, not
connected to you, who has many friends in common with
you. That node is a good candidate for “friendship”.
 Basis of User Based or Content Based collaborative filtering
 If the graph is bi-partite
Generalization to Rectangles
A
Ordering triangle
nodes has a unique
equivalency class
B
C
There are 4 classes for a rectangle:
requires a bit more work
1
2
1
3
4
3
4
2
11
1
2
3
4
12
Performance Metrics
 Computation:
 Total computation in all mappers and reducers
 Communication:
 How many bits are shuffled from the mapper to the reducer
 Number of map-reduce steps:

You can work it into the above

The overhead of running jobs
13
“Recipes” for Graph MR Algorithms
Roughly two classes of algorithms:
1.
2.
Partition-Compute then Merge

Create smaller sub-graphs that fit into a single memory

Do computation on the small graphs

Construct the final answer from the answers to the small
sub-problems
Compute-in-Parallel then Merge
14
Partition-Compute-Merge
15
Finding Triangles By Partitioning
[Suri ‘11]
1.
Partition the nodes into b sets:
2.
For every 3 sets
V = V1 ÈV2 È...ÈVb Vi ÇVj = F, i ¹ j
Vi, j,k = Vi ÈVj ÈVk i < j < k
create a reducer.
Vi, j,k
3.
Send an edge toVi, j,k iff both its ends are in
4.
Detect triangles using a serial algorithm within each
reducer
b=4, V1={1}, V2={2}, V3={3}, V4={4},
1
2
4
3
V1,2,3
3
3
2
3
2
4
3
4
V1,3,4
2
1
1
V2,3,4
1
2
3
4
3
4
17
Analysis
 Every triangle is detected. All 3 vertices are guaranteed
to be in at least one partition
æM ö
O
 Average # edges in each reducer is ç 2 ÷
èb ø
 Use an optimal serial triangle finder at each3/2reducer. The
total amount of work at all reducers is: æ M ö ´ b3 = O M 3/2
ç 2÷
( )
èb ø
 # of edges sent from the mappers to reducers
(communication cost) is
O ( bM ) = O ( M 3/2 ) for b = M
18
One Problem
 Each triangle may be detected multiple times. If all three
vertices are mapped to the same partition, it will be
detectedæç b - 2 ö÷ ~ O ( b2 ) times
è
2
ø
 This can be fixed with a similar ordering-of-nodes trick [Afrati
’12]
 Can be generalized to detect other small graph
structures efficiently [Afrati ‘12]
19
Minimum Weights Spanning Tree
1.
Partition the nodes into b sets
2.
For every pair of sets create a reducer
3.
Send all edges that have both their ends in one pair to
the corresponding reducer
4.
Compute the minimum spanning tree for the graph in
each reducer. Remove other edges to sparsify the
graph
5.
Compute the MST for the sparsified graph
20
Compute-in-parallel and merge
21
Personalized PageRank
 Like the global PageRank:
 But the random walker that comes back to where it started
with probability d
 For every v you will have a personalized page rank
vector of length N.
 We usually keep only a limited number of top personalized
PageRanks for each node.
 It finds the influential nodes in the proximity of a given
node.
22
Monte Carlo Approximation
Simulate many random walks from every single node. For
each walk:
1. A walk starting from node v is identified by v

2.
In each Map-Reduce step advance the walk by 1 step

3.
Keep track of <v,Uv,t> where Uv,t is the current end point at
step t for the walk starting at node v
Pick a random neighbor of Uv,t
Count the frequency of visits to each node
23
One can do better [Das Sarma ‘08]
This takes T steps for a walk of length T
 We can cut it down to T1/2 by a simple “stitching” idea
1.
Do T/J random walks from every node for some J
2.
To for a walk of length T, pick one of the T/J segments at random
and jump to the end of the segment
3.
Pick another random segment, etc
4.
If you arrive at a node twice, do not use the same segment
(that’s why you need T/J segments)
Total iterations: J+T/J  minimized when J=T1/2  O(T1/2)
24
Exponential speed up [Bahmani ‘11]
 The stitching was done somewhat serially (at each step,
one segment was stitched to another)
 Idea: Stich recursively, which will result in exponentially
expanding the walk/segment ratio
 Takes a little more tricks to make it work, but you can
bring it down to O(log T)
25
Labeling Connected Components
 Assign the same ID to all nodes inside the same
component
6
1
2
5
3
4
26
How do we do it on one machine?
1. i=1
2. Pick a random node you have not
picked before, assign it id=i and put
it in a stack
6
1
3. Pop a node from the stack, pull all
it’s neighbors we have not seen
before into the stack. Assign them
id=i
4. If stack is not empty go to 3,
otherwise i  i+1 and go to 2
Time and memory complexity O(M).
2
5
3
4
27
In Map-Reduce: More Parallelizim
 Instead of growing a frontier zone from a single seed, start
growing it from all nodes. When two zones meet, merge them
1
2
3
Edge File
Zone File
<v1,v2>
<v1,z1>
<v2,v3>
<v2,z2>
<v3,v4>
<v3,z3>
<v4,z4>
4
Game Plan
Bin Zone
and Edge
by Node
<v1,v2>
<v1,z1>
Bin edge to
zone map
<[v1,v2],z1>
Collect over
edges
28
1
2
A zone to
zone map
3
Reconcile
zones
4
Reassign
zones to
nodes
<[v1,v2],z1>
<z2,v2>
<[v1,v2],z2>
<z2,z1>
<v2,v1>
<v2,v3>
<v2,z2>
New Zone File
<z2,z1>
<z2,v2>
<v1,z1>
<[v2,v3],z2>
<z3,z2>
<z2,z1>
<v2,z1>
<[v2,v3],z3>
<z4,z3>
<[v1,v2],z2>
<[v2,v3],z2>
<v3,z2>
<z3,v3>
<v3,v2>
<v3,v4>
<v3,z3>
<v4,v3>
<v4,z4>
<z3,z2>
<[v2,v3],z3>
<[v3,v4],z3>
<[v3,v4],z3>
<[v3,v4],z4>
<z4,v4>
<z4,z3>
<[v3,v4],z4>
<v4,z3>
29
Analysis
 Communication: O(M+N)
 Number of rounds: O(d) where d is the diameter of the graph.
Most real graphs have small diameters.
 Random graph: d=O(log N)
 This works worst for a “path-graph”
 An algorithm with O(M+N) communication and O(log n) round
exists for all graphs [Rastogi ’12]
 Uses an idea similar to MinHash
30
Intuit’s GraphEdge
 A (hopefully soon to be open sourced) graph processing
package for Hadoop built on Cascading
 Efficient support of many core graph processing
algorithms:
 State of the art algorithms
 Industry-grade test for scalability
 Will take a few more months to release.
 Would love to gauge your interest
31
Intuit’s Commercial Graph
 Think of a graph in which a node is a business, or a
consumer
 An edge is a transaction between these entities
 The entities are either direct clients of Intuit’s many
offerings, or are business partners of Intuit’s clients
 We experiment with a “toy” version of this graph: about
200M nodes and 10B edges.
References
32
 Cohen, Jonathan. "Graph twiddling in a MapReduce world."
Computing in Science & Engineering 11.4 (2009): 29-41.
 Suri, Siddharth, and Sergei Vassilvitskii. "Counting triangles and the
curse of the last reducer." Proceedings of the 20th international
conference on World wide web. ACM, 2011.
 Bahmani Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast
personalized pagerank on mapreduce." Proceedings of the 37th
SIGMOD international conference on Management of data. 2011.
 A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating
PageRank on graph streams. In PODS, pages 69–78, 2008.
 Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman, Enumerating
Subgraph Instances Using Map-Reduce.
http://arxiv.org/abs/1208.0615 2012
 Lattanzi, Silvio, et al. "Filtering: a method for solving graph
problems in mapreduce.” 2011.

similar documents