Report

Parallel Subgraph Listing in a Large-Scale Graph Yingxia Shao Bin Cui Lei Chen Lin Ma Junjie Yao Ning Xu School of EECS, Peking University Hong Kong University of Science and Technology 1 Outline • Subgraph listing operation • Related work • PSgL framework • Evaluation • Conclusion 2 Introduction Motivation Motif Detection in Bioinformatics Triangle Counting in SN Cascades Counting in RN 3 Introduction Problem Definition 1 2 4 3 Subgraph Listing Operation o Input: pattern graph, data graph [both are undirected] o Output: all the occurrences of pattern graph in the data graph. Pattern graph Goal of our work o Efficiently listing subgraph in a large-scale graph 4 6 5 1 3 2 Data graph 4 Related Work Related Work Centralized algorithms Enumerate one by one [Chiba ’85, Wernicke ’06, Grochow ’07] Streaming algorithms Only counting and results are inaccurate [Buriol ’06, Bordino ’08, Zhao ’10] MapReduce based Parallel algorithms Decompose pattern graph + explicit join operation [Afrati ’13] Fixed exploration plan + implicit join operation [Plantenga ’13] Other efficient algorithms for specific pattern graph Triangle [Suri ’11, Chu ’11, Hu ’13] 5 Related Work Drawbacks in existing parallel solutions • MapReduce is not friendly to process graphs. • Join operation is expensive. • Do not take care of the balance of data distribution. • Data graph • Intermediate results The novel PSgL framework lists subgraph via graph traversal on in-memory stored native graph. 6 Contributions • We propose an efficient parallel subgraph listing framework, PSgL. • We introduce a cost model for the subgraph listing in PSgL. • We propose a simple but effective workload-aware distribution strategy, which facilitates PSgL to achieve good workload balance. • We design three independent mechanisms to reduce the size of intermediate results. 7 Preliminaries Partial subgraph instance • A data structure that records the mapping between pattern graph and data graph. • Denoted by • Assume the vertices of are numbered from 1 to | |, we simply state as {map(1), map(2), ..., map(| |)}. 4 1 2 4 3 6 5 1 {?,?,?,?} 1 {2,3,4,5} 1 {1,5,6,?} 3 2 4 6 5 3 2 4 6 5 3 2 8 Preliminaries Independence Property • Tree • A node is a • The children of a node are derived from expanding one mapped data vertex in the node. Gpsi {?,?,?,?} Gpsi {1,?,?,?} Gpsi {2,?,?,?} Gpsi {4,?,?,?} Gpsi {6,?,?,?} • Characteristics • A encodes a set of results. • s are independent from each other except the ones in its generation path. Gpsi {2,1,?,5} Gpsi {2,1,?,3} Gpsi {2,3,?,5} Gpsi {6,1,?,5} Gpsi {6,5,?,1} Tree 9 PSgL PSgL: Parallel Subgraph Listing Framework • PSgL follows the popular graph processing paradigm Iteration • vertex centric model • BSP model • PSgL iteratively generates in parallel; • Each is expanded by a data vertex. Gpsi Gpsi Gpsi Gpsi Gpsi Gpsi Gpsi Gpsi Gpsi P1 P2 P3 Worker-1 Worker-2 Worker-3 10 PSgL Vertex program Algorithm of Expanding a - I • Partial Pattern Graph encodes • pattern graph, • , • progress state. • Three types of vertices • BLACK vertex is the one which has been expanded. • GRAY vertex has a mapped data vertex, but it has not been expanded. • WHITE vertex is the one which hasn’t been mapped to any data vertex. 1 4 2 Gp <1,3> <2,5> <4,2> <3,?> 3 + Gpsi = { 3, 5, ?, 2 } Gpp 11 PSgL Vertex program Algorithm of Expanding a Gpsi - II • Main logic • Changes one GRAY vertex into BLACK; • Validates the expanding vertex’s GRAY neighbors; • Makes the expanding vertex’s WHITE neighbor become GRAY. • Two observations • In each expansion, at least one pattern vertex is processed. • All GRAYs are the valid candidates for the next expansion. 1 4 2 Gp <1,3> <2,5> <4,2> <3,?> 3 + Gpsi = { 3, 5, ?, 2 } Gpp Example: expanding vertex <4, 2> 12 PSgL Analysis Efficiency of PSgL # of Gpsi processed by worker k # of iterations Total cost # of workers Three metrics: cost of processing a Gpsi • The number of iterations. • S is bounded by |MVC| ≤ S ≤ |Vp| - 1. • Workload balance. • Required by the max function. • The number of s *Refer to the paper for the details of estimating load( ). 13 Optimization Workload balance - I • Partial subgraph instance distribution problem • There are N s to be processed by K workers, the goal is to find out a distribution strategy to achieve • NP-hard problem! • Naive Solutions • Random distribution strategy • Roulette wheel distribution strategy • has a higher probability to be expanded by a data vertex with smaller degree. 14 Optimization Workload balance - II • Workload aware distribution strategy • A general greedy-based heuristic rule. α Description Drawbacks 1 Selecting worker for the ℎ which has minimal overall workload local optimal 0 Selecting worker where the incurs the least increased workload imbalance Making a trade-off between local optimal and imbalance - 0.5 (*) All three strategies have the same worst bound which is K*|OPT|. But in practice, α = 0.5 performs best. 15 Optimization Comparison among various approaches Random Roulette =0 =1 =0 16 Optimization Partial subgraph instance reduction - I • Pattern graph automorphism breaking • Using DFS to find the equivalent vertex group • Assign partial order for each equivalent vertex group • Initial pattern vertex selection • Introduce a cost model • General pattern graph 1 2 < 3 Automorphism Breaking Cost Model Best Initial Pattern Vertex Initial Pattern Vertex Section based on cost model • Enumerate all possible selections based on cost model • Cycle and clique • The vertex with lowest rank is the best one. 17 Optimization Partial subgraph instance reduction - II 1 • Online pruning invalid s 2 • Filter by the partial order and degree restriction • Prune with the help of a light weight global edge index 3 PG1 1 4 2 3 • Using bloom filter to index the ends of an edge Data Graph PG Gpsi # w/ index Gpsi # w/o index Pruning Ratio LiveJournal PG1(v1) 2.86 x 108 6.81 x 108 58.01% PG4(v1) 9.93 x 109 OOM unknown PG5(v1) 2.26 x 107 3.17 x 108 92.87% PG5(v3; v4) 7.38 x 109 2.04 x 1010 63.89% UsPatent PG4 1 2 5 3 4 PG5 18 Evaluation Evaluation - Comparing to MR solutions Afrati and SGIA-MR are the state-of-art MapReduce solutions. The ratios exceed 100 times are not visualized. PSgL: 4302s Afrati: 7291s 1 2 3 1 4 2 3 19 Evaluation Evaluation - Comparing to GraphLab Data Graph Pattern Graph Afrati PowerGraph PSgL Twitter 1 432min 2min 12.5min Wikipedia 1 871s 36s 125s WikiTalk 2 4402s 48s 318s WikiTalk 3 13743s 100s 494s WikiTalk 3 13743s OOM* 494s WikiTalk 4 1785s 127s 38s LiveJournal 4 2749s OOM 1330s * using a different traversal order. 1 2 3 1 1 4 1 4 1 4 2 3 2 3 2 3 2 3 4 20 Conclusion • Subgraph listing is a fundamental operation for massive graph analysis. • We propose an efficient parallel subgraph listing framework, PSgL. • Various distribution strategies • Cost model • Light-weight global edge index • The workload-aware distribution strategy can be extended to other balance problems. • A new execution engine is required for larger pattern graphs. 21 Thanks! 22 Backup Expr. – Scalability of PSgL Performance vs. Worker Number 23 Backup Expr. – Initial pattern vertex selection Livejournal Random graph Influences of the Initial Pattern Vertex on Various Data Graphs 24