Report

NETS 212: Scalable and Cloud Computing Graph algorithms in MapReduce October 15, 2013 © 2013 A. Haeberlen, Z. Ives University of Pennsylvania 1 Announcements No class on October 22nd or 24th Special 'catch-up' class on October 30th Andreas at IMC in Barcelona Please work on HW2 and HW3 (will be released this week) 4:30-6:00pm, Location TBA Any questions about HW2? © 2013 A. Haeberlen, Z. Ives If you haven't started yet: Please start early!!! University of Pennsylvania 2 What we have seen so far In the first half of the semester, we saw how the map/reduce model could be used to filter, collect, and aggregate data values This is useful for data with limited structure We could extract pieces of input data items and collect them to run various reduce operations We could “join” two different data sets on a common key But that’s not enough… 3 © 2013 A. Haeberlen, Z. Ives Beyond average/sum/count Much of the world is a network of relationships and shared features Members of a social network can be friends, and may have shared interests / memberships / etc. Customers might view similar movies, and might even be clustered by interest groups The Web consists of documents with links Documents are also related by topics, words, authors, etc. 4 © 2013 A. Haeberlen, Z. Ives Goal: Develop a toolbox We need a toolbox of algorithms useful for analyzing data that has both relationships and properties For the next ~2 lectures we’ll start to build this toolbox Some of the problems are studied in courses you may not have taken yet: CIS 320 (algorithms), CIS 391/520 (AI), CIS 455 (Web Systems) So we’ll see both the traditional solution and the MapReduce one 5 © 2013 A. Haeberlen, Z. Ives Plan for today Representing data in graphs Graph algorithms in MapReduce NEXT Computation model Iterative MapReduce A toolbox of algorithms © 2013 A. Haeberlen, Z. Ives Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes University of Pennsylvania 6 Images by Jojo Mendoza, Creative Commons licensed Thinking about related objects Facebook fan-of friend-of Alice fan-of fan-of friend-of Sunita fan-of Mikhail fan-of Magna Carta Jose We can represent related objects as a labeled, directed graph Entities are typically represented as nodes; relationships are typically edges © 2013 A. Haeberlen, Z. Ives Nodes all have IDs, and possibly other properties Edges typically have values, possibly IDs and other properties 7 Encoding the data in a graph Facebook Mikhail Magna Carta Alice Jose Recall basic definition of a graph: Sunita G = (V, E) where V is vertices, E is edges of the form (v1,v2) where v1,v2 V Assume we only care about connected vertices Then we can capture a graph simply as the edges ... or as an adjacency list: vi goes to [vj, vj+1, … ] © 2013 A. Haeberlen, Z. Ives 8 Graph encodings: Set of edges Facebook Mikhail Magna Carta Alice Sunita Jose (Alice, Facebook) (Alice, Sunita) (Jose, Magna Carta) (Jose, Sunita) (Mikhail, Facebook) (Mikhail, Magna Carta) (Sunita, Facebook) (Sunita, Alice) (Sunita, Jose) © 2013 A. Haeberlen, Z. Ives 9 Graph encodings: Adding edge types Facebook fan-of friend-of Alice fan-of fan-of friend-of Sunita fan-of Mikhail fan-of Magna Carta Jose (Alice, fan-of, Facebook) (Alice, friend-of, Sunita) (Jose, fan-of, Magna Carta) (Jose, friend-of, Sunita) (Mikhail, fan-of, Facebook) (Mikhail, fan-of, Magna Carta) (Sunita, fan-of, Facebook) (Sunita, friend-of, Alice) (Sunita, friend-of, Jose) © 2013 A. Haeberlen, Z. Ives 10 Graph encodings: Adding weights Facebook fan-of 0.8 fan-of 0.5 0.7 fan-of friend-of friend-of Alice 0.9 Sunita 0.3 fan-of Mikhail 0.7 fan-of Magna Carta 0.5 Jose (Alice, fan-of, 0.5, Facebook) (Alice, friend-of, 0.9, Sunita) (Jose, fan-of, 0.5, Magna Carta) (Jose, friend-of, 0.3, Sunita) (Mikhail, fan-of, 0.8, Facebook) (Mikhail, fan-of, 0.7, Magna Carta) (Sunita, fan-of, 0.7, Facebook) (Sunita, friend-of, 0.9, Alice) (Sunita, friend-of, 0.3, Jose) © 2013 A. Haeberlen, Z. Ives 11 Recap: Related objects We can represent the relationships between related objects as a directed, labeled graph We can annotate this graph in various ways Vertices represent the objects Edges represent relationships Add labels to edges to distinguish different types Add weights to edges ... We can encode the graph in various ways © 2013 A. Haeberlen, Z. Ives Examples: Edge set, adjacency list 12 Plan for today Representing data in graphs Graph algorithms in MapReduce NEXT Computation model Iterative MapReduce A toolbox of algorithms © 2013 A. Haeberlen, Z. Ives Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes University of Pennsylvania 13 A computation model for graphs Facebook fan-of 0.8 fan-of 0.5 0.7 fan-of friend-of friend-of Alice Sunita 0.3 Mikhail 0.7 fan-of Magna Carta 0.5 Jose Once the data is encoded in this way, we can perform various computations on it 0.9 fan-of Simple example: Which users are their friends' best friend? More complicated examples (later): Page rank, adsorption, ... This is often done by © 2013 A. Haeberlen, Z. Ives annotating the vertices with additional information, and propagating the information along the edges "Think like a vertex"! University of Pennsylvania 14 A computation model for graphs Facebook fan-of 0.8 fan-of 0.5 0.7 fan-of friend-of friend-of Alice 0.9 Sunita 0.3 fan-of Mikhail 0.7 fan-of Magna Carta 0.5 Jose Slightly more technical: How many of my friends have me as their best friend? Example: Am I my friends' best friend? © 2013 A. Haeberlen, Z. Ives Step #1: Discard irrelevant vertices and edges University of Pennsylvania 15 A computation model for graphs Mikhail friend-of Alice alicesunita: 0.9 0.9 friend-of Sunita 0.3 sunitaalice: 0.9 sunitajose: 0.3 Jose josesunita: 0.3 Example: Am I my friends' best friend? © 2013 A. Haeberlen, Z. Ives Step #1: Discard irrelevant vertices and edges Step #2: Annotate each vertex with list of friends Step #3: Push annotations along each edge University of Pennsylvania 16 A computation model for graphs Mikhail friend-of Alice sunitaalice: 0.9 sunitajose: 0.3 alicesunita: 0.9 0.9 friend-of Sunita 0.3 alicesunita: 0.9 josesunita: 0.3 sunitaalice: 0.9 sunitajose: 0.3 Jose sunitaalice: 0.9 sunitajose: 0.3 josesunita: 0.3 Example: Am I my friends' best friend? © 2013 A. Haeberlen, Z. Ives Step #1: Discard irrelevant vertices and edges Step #2: Annotate each vertex with list of friends Step #3: Push annotations along each edge University of Pennsylvania 17 A computation model for graphs Mikhail friend-of 0.9 Alice sunitaalice: 0.9 sunitajose: 0.3 alicesunita: 0.9 friend-of Sunita 0.3 alicesunita: 0.9 josesunita: 0.3 sunitaalice: 0.9 sunitajose: 0.3 Jose sunitaalice: 0.9 sunitajose: 0.3 josesunita: 0.3 Example: Am I my friends' best friend? © 2013 A. Haeberlen, Z. Ives Step Step Step Step #1: #2: #3: #4: Discard irrelevant vertices and edges Annotate each vertex with list of friends Push annotations along each edge Determine result at each vertex University of Pennsylvania 18 Can we do this in MapReduce? map(key: node, value: [<otherNode, relType, strength>]) { } reduce(key: ________, values: list of _________) { } Using adjacency list representation? 19 © 2013 A. Haeberlen, Z. Ives Can we do this in MapReduce? map(key: node, value: <otherNode, relType, strength>) { } reduce(key: ________, values: list of _________) { } Using single-edge data representation? 20 © 2013 A. Haeberlen, Z. Ives A real-world use case A variant that is actually used in social networks today: "Who are the friends of multiple of my friends?" Where have you seen this before? Friend recommendation! Maybe these people should be my friends too! 21 © 2013 A. Haeberlen, Z. Ives Generalizing… Now suppose we want to go beyond direct friend relationships Example: How many of my friends' friends (distance-2 neighbors) have me as their best friend's best friend? What do we need to do? How about distance k>2? To compute the answer, we need to run multiple iterations of MapReduce! 22 © 2013 A. Haeberlen, Z. Ives Iterative MapReduce The basic model: copy files from input dir staging dir 1 (optional: do some preprocessing) while (!terminating condition) { map from staging dir 1 reduce into staging dir 2 move files from staging dir 2 staging dir1 } (optional: postprocessing) move files from staging dir 2 output dir Note that reduce output must be compatible with the map input! © 2013 A. Haeberlen, Z. Ives What can happen if we filter out some information in the mapper or in the reducer? 23 Graph algorithms and MapReduce A centralized algorithm typically traverses a tree or a graph one item at a time (there’s only one “cursor”) You’ve learned breadth-first and depth-first traversals Most algorithms that are based on graphs make use of multiple map/reduce stages processing one “wave” at a time Sometimes iterative MapReduce, other times chains of map/reduce 24 © 2013 A. Haeberlen, Z. Ives Recap: MapReduce on graphs Suppose we want to: compute a function for each vertex in a graph... ... using data from vertices at most k hops away We can do this as follows: "Push" information along the edges "Think like a vertex" Finally, perform the computation at each vertex May need more than one MapReduce phase © 2013 A. Haeberlen, Z. Ives Iterative MapReduce: Outputs of stage i inputs of stage i+1 University of Pennsylvania 25 Plan for today Representing data in graphs Graph algorithms in MapReduce Computation model Iterative MapReduce A toolbox of algorithms © 2013 A. Haeberlen, Z. Ives NEXT Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes University of Pennsylvania 26 Path-based algorithms Sometimes our goal is to compute information about the paths (sets of paths) between nodes Edges may be annotated with cost, distance, or similarity Examples of such problems (see CIS 121+320): Shortest path from one node to another Minimum spanning tree (minimal-cost tree connecting all vertices in a graph) Steiner tree (minimal-cost tree connecting certain nodes) Topological sort (node in a DAG comes before all nodes it points to) 27 © 2013 A. Haeberlen, Z. Ives Single-Source Shortest Path (SSSP) Given a directed graph G = (V, E) in which each edge e has a cost c(e): Compute the cost of reaching each node from the source node s in the most efficient way (potentially after multiple 'hops') a b 1 ? ? 10 2 s 0 3 9 5 7 ? c © 2013 A. Haeberlen, Z. Ives 6 4 2 ? d 28 SSSP: Intuition We can formulate the problem using induction The shortest path follows the principle of optimality: the last step (u,v) makes use of the shortest path to u We can express this as follows: bestDistanceAndPath(v) { if (v == source) then { return <distance 0, path [v]> } else { find argmin_u (bestDistanceAndPath[u] + dist[u,v]) return <bestDistanceAndPath[u] + dist[u,v], path[u] + v> } } 29 © 2013 A. Haeberlen, Z. Ives SSSP: CIS 320-style solution Traditional approach: Dijkstra's algorithm V: vertices, E: edges, S: start node foreach v in V dist_S_to[v] := infinity Initialize length and predecessor[v] = nil last step of path to default values spSet = {} Q := V Update length and while (Q not empty) do path based on edges u := Q.removeNodeClosestTo(S) radiating from u spSet := spSet + {u} foreach v in V where (u,v) in E if (dist_S_To[v] > dist_S_To[u]+cost(u,v)) then dist_S_To[v] = dist_S_To[u] + cost(u,v) predecessor[v] = u © 2013 A. Haeberlen, Z. Ives 30 Example from CLR 2nd ed. p. 528 SSSP: Dijkstra in Action a 1 ∞ ∞ b 10 2 s 0 3 9 5 7 c ∞ 2 Q = {s,a,b,c,d} spSet = {} dist_S_To: {(a,∞), (b,∞), (c,∞), (d,∞)} predecessor: {(a,nil), (b,nil), (c,nil), (d,nil)} © 2013 A. Haeberlen, Z. Ives 6 4 ∞ d 31 Example from CLR 2nd ed. p. 528 SSSP: Dijkstra in Action a 1 10 ∞ b 10 2 s 0 3 9 5 7 c 5 2 Q = {a,b,c,d} spSet = {s} dist_S_To: {(a,10), (b,∞), (c,5), (d,∞)} predecessor: {(a,s), (b,nil), (c,s), (d,nil)} © 2013 A. Haeberlen, Z. Ives 6 4 ∞ d 32 Example from CLR 2nd ed. p. 528 SSSP: Dijkstra in Action a 1 8 14 b 10 2 s 0 3 9 5 7 c 5 2 Q = {a,b,d} spSet = {c,s} dist_S_To: {(a,8), (b,14), (c,5), (d,7)} predecessor: {(a,c), (b,c), (c,s), (d,c)} © 2013 A. Haeberlen, Z. Ives 6 4 7 d 33 Example from CLR 2nd ed. p. 528 SSSP: Dijkstra in Action a 1 8 13 b 10 2 s 0 3 9 5 7 c 5 2 Q = {a,b} spSet = {c,d,s} dist_S_To: {(a,8), (b,13), (c,5), (d,7)} predecessor: {(a,c), (b,d), (c,s), (d,c)} © 2013 A. Haeberlen, Z. Ives 6 4 7 d 34 Example from CLR 2nd ed. p. 528 SSSP: Dijkstra in Action a 1 8 9 b 10 2 s 0 3 9 5 7 c 5 2 Q = {b} spSet = {a,c,d,s} dist_S_To: {(a,8), (b,9), (c,5), (d,7)} predecessor: {(a,c), (b,a), (c,s), (d,c)} © 2013 A. Haeberlen, Z. Ives 6 4 7 d 35 Example from CLR 2nd ed. p. 528 SSSP: Dijkstra in Action a 1 8 9 b 10 2 s 0 3 9 5 7 c 5 2 Q = {} spSet = {a,b,c,d,s} dist_S_To: {(a,8), (b,9), (c,5), (d,7)} predecessor: {(a,c), (b,a), (c,s), (d,c)} © 2013 A. Haeberlen, Z. Ives 6 4 7 d 36 SSSP: How to parallelize? Dijkstra traverses the graph along a single route at a time, prioritizing its traversal to the next step based on total path length (and ? ? avoiding cycles) No real parallelism to be had here! s 0 Intuitively, we want something that “radiates” from the origin, one “edge hop distance” at a time ? ? Each step outwards can be done in parallel, before another iteration occurs - or we are done Recall our earlier discussion: Scalability depends on the algorithm, not (just) on the problem! 37 © 2013 A. Haeberlen, Z. Ives SSSP: Revisiting the inductive definition bestDistanceAndPath(v) { if (v == source) then { return <distance 0, path [v]> } else { find argmin_u (bestDistanceAndPath[u] + dist[u,v]) return <bestDistanceAndPath[u] + dist[u,v], path[u] + v> } } Dijkstra’s algorithm carefully considered each u in a way that allowed us to prune certain points Instead we can look at all potential u’s for each v Compute iteratively, by keeping a “frontier set” of u nodes i edge-hops from the source 38 © 2013 A. Haeberlen, Z. Ives SSSP: MapReduce formulation init: For each node, node ID <, -, {<succ-node-ID,edge-cost>}> take node ID <dist, next, {<succ-node-ID,edge-cost>}> This is a new path from For each succ-node-ID: the source to succ-node-ID emit succ-node ID {<node ID, distance+edge-cost>} that we just discovered (not necessarily shortest) emit node ID distance,{<succ-node-ID,edge-cost>} reduce: ... this is the next ... and here is the adjacency list for nodeID hop on that path... map: The shortest path we have found so far from the source to nodeID has length ... Why is this necessary? distance := min cost from a predecessor; next := that predec. emit node ID <distance, next, {<succ-node-ID,edge-cost>}> Repeat until no changes Postprocessing: Remove adjacency lists © 2013 A. Haeberlen, Z. Ives 39 Iteration 0: Base case mapper: (a,<s,10>) (c,<s,5>) edges reducer: (a,<10, ...>) (c,<5, ...>) a "Wave" 1 ∞ ∞ b 10 2 s 0 3 9 5 6 4 7 c ∞ 2 ∞ d 40 © 2013 A. Haeberlen, Z. Ives Iteration 1 mapper: reducer: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>) (b,<c,14>) (d,<c,7>) edges (a,<8, ...>) (c,<5, ...>) (b,<11, ...>) (d,<7, ...>) a "Wave" 10 1 ∞ b 10 2 s 0 3 9 5 6 4 7 c 5 2 ∞ d 41 © 2013 A. Haeberlen, Z. Ives Iteration 2 mapper: reducer: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>) (b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edges (a,<8>) (c,<5>) (b,<11>) (d,<7>) a 1 8 11 b "Wave" 10 2 s 0 3 9 5 6 4 7 c 5 2 7 d 42 © 2013 A. Haeberlen, Z. Ives No change! Convergence! Iteration 3 mapper: reducer: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>) (b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edges (a,<8>) (c,<5>) (b,<11>) (d,<7>) a 1 8 11 b 10 2 s 0 3 9 5 Question: If a vertex's path cost is the same in two consecutive rounds, can we be sure that this vertex has converged? © 2013 A. Haeberlen, Z. Ives 6 4 7 c 5 2 7 d 43 Summary: SSSP Path-based algorithms typically involve iterative map/reduce They are typically formulated in a way that traverses in “waves” or “stages”, like breadthfirst search This allows for parallelism They need a way to test for convergence Example: Single-source shortest path (SSSP) Original Dijkstra formulation is hard to parallelize But we can make it work with the "wave" approach 44 © 2013 A. Haeberlen, Z. Ives Plan for today Representing data in graphs Graph algorithms in MapReduce Computation model Iterative MapReduce A toolbox of algorithms © 2013 A. Haeberlen, Z. Ives Single-source shortest path (SSSP) NEXT k-means clustering Classification with Naïve Bayes University of Pennsylvania 45 Learning (clustering / classification) Sometimes our goal is to take a set of entities, possibly related, and group them If the groups are based on similarity, we call this clustering If the groups are based on putting them into a semantically meaningful class, we call this classification Both are instances of machine learning 46 © 2013 A. Haeberlen, Z. Ives Age The k-clustering Problem Clusters Items Expenses Given: A set of items in a n-dimensional feature space Example: data points from survey, people in a social network Goal: Group the items into k “clusters” What would be a 'good' set of clusters? 47 © 2013 A. Haeberlen, Z. Ives Approach: k-Means Let m1, m2, …, mk be representative points for each of our k clusters Specifically: the centroid of the cluster Initialize m1, m2, …, mk to random values in the data For t = 1, 2, …: Map each observation to the closest mean (t ) Si x j : x j mi (t ) (t ) Assign the mi to be a new centroid for each set ( t 1) mi © 2013 A. Haeberlen, Z. Ives x j m i * , i * 1,..., k 1 S (t ) i x x (t ) j S i j 48 A simple example (1/4) (20,21) Age (18,20) (30,21) (11,16) (10,10) (15,12) Expenses 49 © 2013 A. Haeberlen, Z. Ives A simple example (2/4) (20,21) Age (18,20) (30,21) (11,16) Randomly chosen initial centers (10,10) (15,12) Expenses 50 © 2013 A. Haeberlen, Z. Ives A simple example (3/4) (20,21) (18,20) (30,21) Age (19.75,19.5) (11,16) (12.5,11) (10,10) (15,12) Expenses 51 © 2013 A. Haeberlen, Z. Ives A simple example (4/4) (20,21) (30,21) (18,20) Age (22.67,20.67) (11,16) (12,12.67) (10,10) (15,12) Expenses Stable! 52 © 2013 A. Haeberlen, Z. Ives k-Means in MapReduce Map #1: Input: node ID <position, centroid ID, [centroid IDs and positions]> Compute nearest centroid; emit centroid ID <node ID, position> Reduce #1: Recompute centroid position from positions of nodes in it Emit centroidID <node IDs, positions> and for all other centroid IDs, emit otherCentroidID centroid(centroidID,X,Y) Map #2: Pass through values to Reducer #2 Reduce #2: For each node in the current centroid, emit node ID <position, centroid ID, [centroid IDs and positions]> Input for the next map iteration Also, emit <X, <centroid ID, position>> Each centroid will need to know where all the other centroids are This will be the 'result' (remember that we wanted the centroids!) Repeat until no change © 2013 A. Haeberlen, Z. Ives 53 Plan for today Representing data in graphs Graph algorithms in MapReduce Computation model Iterative MapReduce A toolbox of algorithms © 2013 A. Haeberlen, Z. Ives Single-source shortest path (SSSP) k-means clustering NEXT Classification with Naïve Bayes University of Pennsylvania 54 Classification Suppose we want to learn what is spam (or interesting, or …) Predefine a set of classes with semantic meaning Train an algorithm to look at data and assign a class Based on giving it some examples of data in each class … and the sets of features they have Many probabilistic techniques exist Each class has probabilistic relationships with others e.g., p (spam | isSentLocally), p (isSentLocally | fromBob), … Typically represented as a graph(ical model)! See CIS 520 But we’ll focus on a simple, “flat” model: Naïve Bayes 55 © 2013 A. Haeberlen, Z. Ives A simple example Suppose we just look at the keywords in the email's title: Message(1, Message(2, Message(3, Message(4, Message(5, Message(6, “Won contract”) “Won award”) "Won the lottery") “Unsubscribe”) "Millions of customers") "Millions of dollars") What is probability message "Won Millions" is p(spam|containsWon,containsMillions) = p(spam) p(containsWon,containsMillions |spam) p(containsWon,containsMillions) ? Bayes’ Theorem 56 © 2013 A. Haeberlen, Z. Ives Classification using Naïve Bayes Basic assumption: Probabilities of events are independent This is why it is called 'naïve' Under this assumption, p(spam) p(containsWon,containsMillions | spam) p(containsWon,containsMillions) = p(spam) p(containsWon | spam) p(containsMillions | spam) p(containsWon) p(containsMillions) = 0.5 * 0.67 * 0.33 / (0.5 * 0.33) = 0.67 So how do we “train” a learner (compute the above probabilities) using MapReduce? 57 © 2013 A. Haeberlen, Z. Ives What do we need to train the learner? p(spam) Easy Easy p(containsXYZ | spam) Count how many spam emails there are Count total number of emails Count how many spam emails contain XYZ Count how many emails contain XYZ overall 1 2 p(containsXYZ) © 2013 A. Haeberlen, Z. Ives Count how many emails contain XYZ overall Count total number of emails University of Pennsylvania 2 Easy 58 Training a Naïve Bayes Learner map 1: reduce 1: Count how many emails in the class contain the word (modified WordCount) emits <word, class> <count> map 2: takes messageId <class, {words}> emits <word, class> 1 takes messageId -> <class, {words}> emits word 1 reduce 2: Count how many emails contain the word overall (WordCount) emits word <totalCount> 59 © 2013 A. Haeberlen, Z. Ives Summary: Learning and MapReduce Clustering algorithms typically have multiple aggregation stages or iterations k-means clustering repeatedly computes centroids, maps items to them Fixpoint computation Classification algorithms can be quite complex In general: need to capture conditional probabilities Naïve Bayes assumes everything is independent Training is a matter of computing probability distribution Can be accomplished using two Map/Reduce passes 60 © 2013 A. Haeberlen, Z. Ives Stay tuned Next time you will learn about: PageRank and Adsorption © 2013 A. Haeberlen, Z. Ives University of Pennsylvania 61