PPT

Report
Parallel Subgraph Listing in a
Large-Scale Graph
Yingxia Shao Bin Cui  Lei Chen Lin Ma  Junjie Yao  Ning Xu 
 School
of EECS, Peking University
 Hong Kong University of Science and Technology
1
Outline
• Subgraph listing operation
• Related work
• PSgL framework
• Evaluation
• Conclusion
2
Introduction
Motivation
Motif Detection in Bioinformatics
Triangle Counting in SN
Cascades Counting in RN
3
Introduction
Problem Definition
1
2
4
3
Subgraph Listing Operation
o Input: pattern graph, data graph [both are undirected]
o Output: all the occurrences of pattern graph in the data graph.
Pattern graph
Goal of our work
o Efficiently listing subgraph in a large-scale graph
4
6
5
1
3
2
Data graph
4
Related Work
Related Work
Centralized algorithms
 Enumerate one by one [Chiba ’85, Wernicke ’06, Grochow ’07]
Streaming algorithms
 Only counting and results are inaccurate [Buriol ’06, Bordino ’08, Zhao ’10]
MapReduce based Parallel algorithms
 Decompose pattern graph + explicit join operation [Afrati ’13]
 Fixed exploration plan + implicit join operation [Plantenga ’13]
Other efficient algorithms for specific pattern graph
 Triangle [Suri ’11, Chu ’11, Hu ’13]
5
Related Work
Drawbacks in existing parallel solutions
• MapReduce is not friendly to process graphs.
• Join operation is expensive.
• Do not take care of the balance of data distribution.
• Data graph
• Intermediate results
The novel PSgL framework lists subgraph via graph traversal
on in-memory stored native graph.
6
Contributions
• We propose an efficient parallel subgraph listing framework, PSgL.
• We introduce a cost model for the subgraph listing in PSgL.
• We propose a simple but effective workload-aware distribution
strategy, which facilitates PSgL to achieve good workload balance.
• We design three independent mechanisms to reduce the size of
intermediate results.
7
Preliminaries
Partial subgraph instance
• A data structure that records
the mapping between pattern
graph and data graph.
• Denoted by 
• Assume the vertices of  are
numbered from 1 to | |, we
simply state  as {map(1),
map(2), ..., map(| |)}.
4
1
2
4
3
6
5
1
{?,?,?,?}
1
{2,3,4,5}
1
{1,5,6,?}
3
2
4
6
5
3
2
4
6
5
3
2
8
Preliminaries
Independence Property
•  Tree
• A node is a 
• The children of a node are derived from
expanding one mapped data vertex in the
node.
Gpsi
{?,?,?,?}
Gpsi
{1,?,?,?}
Gpsi
{2,?,?,?}
Gpsi
{4,?,?,?}
Gpsi
{6,?,?,?}
• Characteristics
• A  encodes a set of results.
•  s are independent from each other
except the ones in its generation path.
Gpsi
{2,1,?,5}
Gpsi
{2,1,?,3}
Gpsi
{2,3,?,5}
Gpsi
{6,1,?,5}
Gpsi
{6,5,?,1}
 Tree
9
PSgL
PSgL: Parallel Subgraph Listing Framework
• PSgL follows the popular graph
processing paradigm
Iteration
• vertex centric model
• BSP model
• PSgL iteratively generates  in
parallel;
• Each  is expanded by a data
vertex.
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
P1
P2
P3
Worker-1
Worker-2
Worker-3
10
PSgL
Vertex program
Algorithm of Expanding a  - I
• Partial Pattern Graph encodes
• pattern graph,
•  ,
• progress state.
• Three types of vertices
• BLACK vertex is the one which has been
expanded.
• GRAY vertex has a mapped data vertex,
but it has not been expanded.
• WHITE vertex is the one which hasn’t
been mapped to any data vertex.
1
4
2
Gp
<1,3>
<2,5>
<4,2>
<3,?>
3
+
Gpsi = { 3, 5, ?, 2 }
Gpp
11
PSgL
Vertex program
Algorithm of Expanding a Gpsi - II
• Main logic
• Changes one GRAY vertex into BLACK;
• Validates the expanding vertex’s GRAY neighbors;
• Makes the expanding vertex’s WHITE neighbor
become GRAY.
• Two observations
• In each expansion, at least one pattern vertex is
processed.
• All GRAYs are the valid candidates for the next
expansion.
1
4
2
Gp
<1,3>
<2,5>
<4,2>
<3,?>
3
+
Gpsi = { 3, 5, ?, 2 }
Gpp
Example: expanding vertex <4, 2>
12
PSgL
Analysis
Efficiency of PSgL
# of Gpsi processed
by worker k
# of
iterations
Total cost
# of
workers
Three metrics:
cost of
processing a
Gpsi
• The number of iterations.
• S is bounded by |MVC| ≤ S ≤ |Vp| - 1.
• Workload balance.
• Required by the max function.
• The number of  s
*Refer to the paper for the details of estimating load( ).
13
Optimization
Workload balance - I
• Partial subgraph instance distribution problem
• There are N  s to be processed by K workers, the goal is to find out a
distribution strategy to achieve
• NP-hard problem!
• Naive Solutions
• Random distribution strategy
• Roulette wheel distribution strategy
•  has a higher probability to be expanded by a data vertex with smaller degree.
14
Optimization
Workload balance - II
• Workload aware distribution strategy
• A general greedy-based heuristic rule.
α
Description
Drawbacks
1
Selecting worker  for the  ℎ  which has minimal overall workload 
local optimal
0
Selecting worker  where the  incurs the least increased workload 
imbalance
Making a trade-off between local optimal and imbalance
-
0.5 (*)
All three strategies have the same worst bound which is K*|OPT|.
But in practice, α = 0.5 performs best.
15
Optimization
Comparison among various approaches
Random
Roulette
=0
=1
=0
16
Optimization
Partial subgraph instance reduction - I
• Pattern graph automorphism breaking
• Using DFS to find the equivalent vertex
group
• Assign partial order for each equivalent
vertex group
• Initial pattern vertex selection
• Introduce a cost model
• General pattern graph
1
2
<
3
Automorphism
Breaking
Cost
Model
Best
Initial
Pattern
Vertex
Initial Pattern Vertex Section
based on cost model
• Enumerate all possible selections based on
cost model
• Cycle and clique
• The vertex with lowest rank is the best one.
17
Optimization
Partial subgraph instance reduction - II
1
• Online pruning invalid  s
2
• Filter by the partial order and degree restriction
• Prune with the help of a light weight global edge index
3
PG1
1
4
2
3
• Using bloom filter to index the ends of an edge
Data Graph
PG
Gpsi # w/ index
Gpsi # w/o index
Pruning Ratio
LiveJournal
PG1(v1)
2.86 x 108
6.81 x 108
58.01%
PG4(v1)
9.93 x
109
OOM
unknown
PG5(v1)
2.26 x 107
3.17 x 108
92.87%
PG5(v3; v4)
7.38 x 109
2.04 x 1010
63.89%
UsPatent
PG4
1
2
5
3
4
PG5
18
Evaluation
Evaluation - Comparing to MR solutions
 Afrati and SGIA-MR are the state-of-art MapReduce solutions.
 The ratios exceed 100 times are not visualized.
PSgL: 4302s
Afrati: 7291s
1
2
3
1
4
2
3
19
Evaluation
Evaluation - Comparing to GraphLab
Data Graph
Pattern Graph
Afrati
PowerGraph
PSgL
Twitter
1
432min
2min
12.5min
Wikipedia
1
871s
36s
125s
WikiTalk
2
4402s
48s
318s
WikiTalk
3
13743s
100s
494s
WikiTalk
3
13743s
OOM*
494s
WikiTalk
4
1785s
127s
38s
LiveJournal
4
2749s
OOM
1330s
* using a different traversal order.
1
2
3
1
1
4
1
4
1
4
2
3
2
3
2
3
2
3
4
20
Conclusion
• Subgraph listing is a fundamental operation for massive graph
analysis.
• We propose an efficient parallel subgraph listing framework, PSgL.
• Various distribution strategies
• Cost model
• Light-weight global edge index
• The workload-aware distribution strategy can be extended to other
balance problems.
• A new execution engine is required for larger pattern graphs.
21
Thanks!
22
Backup Expr. – Scalability of PSgL
Performance vs. Worker Number
23
Backup Expr. – Initial pattern vertex
selection
Livejournal
Random graph
Influences of the Initial Pattern Vertex on Various Data Graphs
24

similar documents