Graph algorithms in MapReduce

Report
NETS 212: Scalable and Cloud Computing
Graph algorithms in MapReduce
October 15, 2013
© 2013 A. Haeberlen, Z. Ives
University of Pennsylvania
1
Announcements

No class on October 22nd or 24th



Special 'catch-up' class on October 30th


Andreas at IMC in Barcelona
Please work on HW2 and HW3 (will be released this week)
4:30-6:00pm, Location TBA
Any questions about HW2?

© 2013 A. Haeberlen, Z. Ives
If you haven't started yet: Please start early!!!
University of Pennsylvania
2
What we have seen so far


In the first half of the semester, we saw how
the map/reduce model could be used to filter,
collect, and aggregate data values
This is useful for data with limited structure



We could extract pieces of input data items and collect them
to run various reduce operations
We could “join” two different data sets on a common key
But that’s not enough…
3
© 2013 A. Haeberlen, Z. Ives
Beyond average/sum/count

Much of the world is a network of
relationships and shared features




Members of a social network can be friends, and may have
shared interests / memberships / etc.
Customers might view similar movies, and might even be
clustered by interest groups
The Web consists of documents with links
Documents are also related by topics, words, authors, etc.
4
© 2013 A. Haeberlen, Z. Ives
Goal: Develop a toolbox


We need a toolbox of algorithms useful for
analyzing data that has both relationships
and properties
For the next ~2 lectures we’ll start to build
this toolbox

Some of the problems are studied in courses you may not
have taken yet:


CIS 320 (algorithms), CIS 391/520 (AI), CIS 455 (Web Systems)
So we’ll see both the traditional solution and the
MapReduce one
5
© 2013 A. Haeberlen, Z. Ives
Plan for today


Representing data in graphs
Graph algorithms in MapReduce
NEXT



Computation model
Iterative MapReduce
A toolbox of algorithms



© 2013 A. Haeberlen, Z. Ives
Single-source shortest path (SSSP)
k-means clustering
Classification with Naïve Bayes
University of Pennsylvania
6
Images by Jojo Mendoza, Creative Commons licensed
Thinking about related objects
Facebook
fan-of
friend-of
Alice


fan-of
fan-of
friend-of
Sunita
fan-of
Mikhail
fan-of
Magna Carta
Jose
We can represent related objects as a
labeled, directed graph
Entities are typically represented as nodes;
relationships are typically edges


© 2013 A. Haeberlen, Z. Ives
Nodes all have IDs, and possibly other properties
Edges typically have values, possibly IDs and other
properties
7
Encoding the data in a graph
Facebook
Mikhail
Magna Carta
Alice

Jose
Recall basic definition of a graph:


Sunita
G = (V, E) where V is vertices, E is edges of the
form (v1,v2) where v1,v2  V
Assume we only care about connected vertices


Then we can capture a graph simply as the edges
... or as an adjacency list: vi goes to [vj, vj+1, … ]
© 2013 A. Haeberlen, Z. Ives
8
Graph encodings: Set of edges
Facebook
Mikhail
Magna Carta
Alice
Sunita
Jose
(Alice, Facebook)
(Alice, Sunita)
(Jose, Magna Carta)
(Jose, Sunita)
(Mikhail, Facebook)
(Mikhail, Magna Carta)
(Sunita, Facebook)
(Sunita, Alice)
(Sunita, Jose)
© 2013 A. Haeberlen, Z. Ives
9
Graph encodings: Adding edge types
Facebook
fan-of
friend-of
Alice
fan-of
fan-of
friend-of
Sunita
fan-of
Mikhail
fan-of
Magna Carta
Jose
(Alice, fan-of, Facebook)
(Alice, friend-of, Sunita)
(Jose, fan-of, Magna Carta)
(Jose, friend-of, Sunita)
(Mikhail, fan-of, Facebook)
(Mikhail, fan-of, Magna Carta)
(Sunita, fan-of, Facebook)
(Sunita, friend-of, Alice)
(Sunita, friend-of, Jose)
© 2013 A. Haeberlen, Z. Ives
10
Graph encodings: Adding weights
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
0.9
Sunita
0.3
fan-of
Mikhail
0.7
fan-of
Magna Carta
0.5
Jose
(Alice, fan-of, 0.5, Facebook)
(Alice, friend-of, 0.9, Sunita)
(Jose, fan-of, 0.5, Magna Carta)
(Jose, friend-of, 0.3, Sunita)
(Mikhail, fan-of, 0.8, Facebook)
(Mikhail, fan-of, 0.7, Magna Carta)
(Sunita, fan-of, 0.7, Facebook)
(Sunita, friend-of, 0.9, Alice)
(Sunita, friend-of, 0.3, Jose)
© 2013 A. Haeberlen, Z. Ives
11
Recap: Related objects

We can represent the relationships between
related objects as a directed, labeled graph



We can annotate this graph in various ways




Vertices represent the objects
Edges represent relationships
Add labels to edges to distinguish different types
Add weights to edges
...
We can encode the graph in various ways

© 2013 A. Haeberlen, Z. Ives
Examples: Edge set, adjacency list
12
Plan for today


Representing data in graphs
Graph algorithms in MapReduce



NEXT
Computation model
Iterative MapReduce
A toolbox of algorithms



© 2013 A. Haeberlen, Z. Ives
Single-source shortest path (SSSP)
k-means clustering
Classification with Naïve Bayes
University of Pennsylvania
13
A computation model for graphs
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice

Sunita
0.3
Mikhail
0.7
fan-of
Magna Carta
0.5
Jose
Once the data is encoded in this way, we can
perform various computations on it



0.9
fan-of
Simple example: Which users are their friends' best friend?
More complicated examples (later): Page rank, adsorption, ...
This is often done by



© 2013 A. Haeberlen, Z. Ives
annotating the vertices with additional information, and
propagating the information along the edges
"Think like a vertex"!
University of Pennsylvania
14
A computation model for graphs
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice

0.9
Sunita
0.3
fan-of
Mikhail
0.7
fan-of
Magna Carta
0.5
Jose
Slightly more technical:
How many of my friends
have me as their
best friend?
Example: Am I my friends' best friend?

© 2013 A. Haeberlen, Z. Ives
Step #1: Discard irrelevant vertices and edges
University of Pennsylvania
15
A computation model for graphs
Mikhail
friend-of
Alice
alicesunita: 0.9

0.9
friend-of
Sunita
0.3
sunitaalice: 0.9
sunitajose: 0.3
Jose
josesunita: 0.3
Example: Am I my friends' best friend?



© 2013 A. Haeberlen, Z. Ives
Step #1: Discard irrelevant vertices and edges
Step #2: Annotate each vertex with list of friends
Step #3: Push annotations along each edge
University of Pennsylvania
16
A computation model for graphs
Mikhail
friend-of
Alice
sunitaalice: 0.9
sunitajose: 0.3
alicesunita: 0.9

0.9
friend-of
Sunita
0.3
alicesunita: 0.9
josesunita: 0.3
sunitaalice: 0.9
sunitajose: 0.3
Jose
sunitaalice: 0.9
sunitajose: 0.3
josesunita: 0.3
Example: Am I my friends' best friend?



© 2013 A. Haeberlen, Z. Ives
Step #1: Discard irrelevant vertices and edges
Step #2: Annotate each vertex with list of friends
Step #3: Push annotations along each edge
University of Pennsylvania
17
A computation model for graphs
Mikhail
friend-of
0.9
Alice
sunitaalice: 0.9
sunitajose: 0.3
alicesunita: 0.9

friend-of
Sunita
0.3
alicesunita: 0.9
josesunita: 0.3
sunitaalice: 0.9
sunitajose: 0.3
Jose
sunitaalice: 0.9
sunitajose: 0.3
josesunita: 0.3
Example: Am I my friends' best friend?




© 2013 A. Haeberlen, Z. Ives
Step
Step
Step
Step
#1:
#2:
#3:
#4:
Discard irrelevant vertices and edges
Annotate each vertex with list of friends
Push annotations along each edge
Determine result at each vertex
University of Pennsylvania
18
Can we do this in MapReduce?
map(key: node, value: [<otherNode, relType, strength>])
{
}
reduce(key: ________, values: list of _________)
{
}

Using adjacency list representation?
19
© 2013 A. Haeberlen, Z. Ives
Can we do this in MapReduce?
map(key: node, value: <otherNode, relType, strength>)
{
}
reduce(key: ________, values: list of _________)
{
}

Using single-edge data representation?
20
© 2013 A. Haeberlen, Z. Ives
A real-world use case

A variant that is actually used in social
networks today: "Who are the friends of
multiple of my friends?"


Where have you seen this before?
Friend recommendation!

Maybe these people should be my friends too!
21
© 2013 A. Haeberlen, Z. Ives
Generalizing…

Now suppose we want to go beyond direct
friend relationships




Example: How many of my friends' friends (distance-2
neighbors) have me as their best friend's best friend?
What do we need to do?
How about distance k>2?
To compute the answer, we need to run
multiple iterations of MapReduce!
22
© 2013 A. Haeberlen, Z. Ives
Iterative MapReduce

The basic model:
copy files from input dir  staging dir 1
(optional: do some preprocessing)
while (!terminating condition) {
map from staging dir 1
reduce into staging dir 2
move files from staging dir 2  staging dir1
}
(optional: postprocessing)
move files from staging dir 2  output dir

Note that reduce output must be compatible with
the map input!

© 2013 A. Haeberlen, Z. Ives
What can happen if we filter out some information in the mapper or in
the reducer?
23
Graph algorithms and MapReduce

A centralized algorithm typically traverses a
tree or a graph one item at a time (there’s
only one “cursor”)


You’ve learned breadth-first and depth-first traversals
Most algorithms that are based on graphs
make use of multiple map/reduce stages
processing one “wave” at a time

Sometimes iterative MapReduce, other times chains of
map/reduce
24
© 2013 A. Haeberlen, Z. Ives
Recap: MapReduce on graphs

Suppose we want to:



compute a function for each vertex in a graph...
... using data from vertices at most k hops away
We can do this as follows:

"Push" information along the edges



"Think like a vertex"
Finally, perform the computation at each vertex
May need more than one MapReduce phase

© 2013 A. Haeberlen, Z. Ives
Iterative MapReduce: Outputs of stage i  inputs of stage i+1
University of Pennsylvania
25
Plan for today


Representing data in graphs
Graph algorithms in MapReduce



Computation model
Iterative MapReduce
A toolbox of algorithms



© 2013 A. Haeberlen, Z. Ives
NEXT
Single-source shortest path (SSSP)
k-means clustering
Classification with Naïve Bayes
University of Pennsylvania
26
Path-based algorithms

Sometimes our goal is to compute information
about the paths (sets of paths) between nodes


Edges may be annotated with cost, distance, or similarity
Examples of such problems (see CIS 121+320):




Shortest path from one node to another
Minimum spanning tree (minimal-cost tree connecting all
vertices in a graph)
Steiner tree (minimal-cost tree connecting certain nodes)
Topological sort (node in a DAG comes before all nodes it
points to)
27
© 2013 A. Haeberlen, Z. Ives
Single-Source Shortest Path (SSSP)
Given a directed graph G = (V, E) in which each edge e has a cost c(e):
 Compute the cost of reaching each node from the source node s in
the most efficient way (potentially after multiple 'hops')
a
b
1
?
?
10
2
s 0
3
9
5
7
?
c
© 2013 A. Haeberlen, Z. Ives
6
4
2
?
d
28
SSSP: Intuition

We can formulate the problem using induction


The shortest path follows the principle of optimality: the last
step (u,v) makes use of the shortest path to u
We can express this as follows:
bestDistanceAndPath(v) {
if (v == source) then {
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}
29
© 2013 A. Haeberlen, Z. Ives
SSSP: CIS 320-style solution

Traditional approach: Dijkstra's algorithm
V: vertices, E: edges, S: start node
foreach v in V
dist_S_to[v] := infinity
Initialize length and
predecessor[v] = nil
last step of path
to default values
spSet = {}
Q := V
Update length and
while (Q not empty) do
path based on edges
u := Q.removeNodeClosestTo(S)
radiating from u
spSet := spSet + {u}
foreach v in V where (u,v) in E
if (dist_S_To[v] > dist_S_To[u]+cost(u,v)) then
dist_S_To[v] = dist_S_To[u] + cost(u,v)
predecessor[v] = u
© 2013 A. Haeberlen, Z. Ives
30
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
∞
∞
b
10
2
s 0
3
9
5
7
c
∞
2
Q = {s,a,b,c,d}
spSet = {}
dist_S_To: {(a,∞), (b,∞), (c,∞), (d,∞)}
predecessor: {(a,nil), (b,nil), (c,nil), (d,nil)}
© 2013 A. Haeberlen, Z. Ives
6
4
∞
d
31
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
10
∞
b
10
2
s 0
3
9
5
7
c
5
2
Q = {a,b,c,d}
spSet = {s}
dist_S_To: {(a,10), (b,∞), (c,5), (d,∞)}
predecessor: {(a,s), (b,nil), (c,s), (d,nil)}
© 2013 A. Haeberlen, Z. Ives
6
4
∞
d
32
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
8
14
b
10
2
s 0
3
9
5
7
c
5
2
Q = {a,b,d}
spSet = {c,s}
dist_S_To: {(a,8), (b,14), (c,5), (d,7)}
predecessor: {(a,c), (b,c), (c,s), (d,c)}
© 2013 A. Haeberlen, Z. Ives
6
4
7
d
33
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
8
13
b
10
2
s 0
3
9
5
7
c
5
2
Q = {a,b}
spSet = {c,d,s}
dist_S_To: {(a,8), (b,13), (c,5), (d,7)}
predecessor: {(a,c), (b,d), (c,s), (d,c)}
© 2013 A. Haeberlen, Z. Ives
6
4
7
d
34
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
8
9
b
10
2
s 0
3
9
5
7
c
5
2
Q = {b}
spSet = {a,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
© 2013 A. Haeberlen, Z. Ives
6
4
7
d
35
Example from CLR 2nd ed. p. 528
SSSP: Dijkstra in Action
a
1
8
9
b
10
2
s 0
3
9
5
7
c
5
2
Q = {}
spSet = {a,b,c,d,s}
dist_S_To: {(a,8), (b,9), (c,5), (d,7)}
predecessor: {(a,c), (b,a), (c,s), (d,c)}
© 2013 A. Haeberlen, Z. Ives
6
4
7
d
36
SSSP: How to parallelize?

Dijkstra traverses the graph along a single
route at a time, prioritizing its traversal to the
next step based on total path length (and
?
?
avoiding cycles)

No real parallelism to be had here!
s 0

Intuitively, we want something
that “radiates” from the origin,
one “edge hop distance” at a time


?
?
Each step outwards can be done in parallel, before another
iteration occurs - or we are done
Recall our earlier discussion: Scalability depends on the
algorithm, not (just) on the problem!
37
© 2013 A. Haeberlen, Z. Ives
SSSP: Revisiting the inductive definition
bestDistanceAndPath(v) {
if (v == source) then {
return <distance 0, path [v]>
} else {
find argmin_u (bestDistanceAndPath[u] + dist[u,v])
return <bestDistanceAndPath[u] + dist[u,v], path[u] + v>
}
}


Dijkstra’s algorithm carefully considered each u
in a way that allowed us to prune certain points
Instead we can look at all potential u’s for each v

Compute iteratively, by keeping a “frontier set” of u nodes i
edge-hops from the source
38
© 2013 A. Haeberlen, Z. Ives
SSSP: MapReduce formulation

init:



For each node, node ID  <, -, {<succ-node-ID,edge-cost>}>
take node ID  <dist, next, {<succ-node-ID,edge-cost>}>
This is a new path from
For each succ-node-ID:
the source to succ-node-ID




emit succ-node ID  {<node ID, distance+edge-cost>}
that we just discovered
(not necessarily shortest)
emit node ID  distance,{<succ-node-ID,edge-cost>}
reduce:


... this is the next ... and here is the adjacency
list for nodeID
hop on that path...
map:


The shortest path we have found so far
from the source to nodeID has length ...
Why is this necessary?
distance := min cost from a predecessor; next := that predec.
emit node ID  <distance, next, {<succ-node-ID,edge-cost>}>
Repeat until no changes
Postprocessing: Remove adjacency lists
© 2013 A. Haeberlen, Z. Ives
39
Iteration 0: Base case
mapper:
(a,<s,10>) (c,<s,5>) edges
reducer:
(a,<10, ...>) (c,<5, ...>)
a
"Wave"
1
∞
∞
b
10
2
s 0
3
9
5
6
4
7
c
∞
2
∞
d
40
© 2013 A. Haeberlen, Z. Ives
Iteration 1
mapper:
reducer:
(a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>)
(b,<c,14>) (d,<c,7>) edges
(a,<8, ...>) (c,<5, ...>) (b,<11, ...>) (d,<7, ...>)
a
"Wave"
10
1
∞
b
10
2
s 0
3
9
5
6
4
7
c
5
2
∞
d
41
© 2013 A. Haeberlen, Z. Ives
Iteration 2
mapper:
reducer:
(a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edges
(a,<8>) (c,<5>) (b,<11>) (d,<7>)
a
1
8
11
b
"Wave"
10
2
s 0
3
9
5
6
4
7
c
5
2
7
d
42
© 2013 A. Haeberlen, Z. Ives
No change!
Convergence!
Iteration 3
mapper:
reducer:
(a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>)
(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edges
(a,<8>) (c,<5>) (b,<11>) (d,<7>)
a
1
8
11
b
10
2
s 0
3
9
5
Question: If a vertex's path cost
is the same in two consecutive
rounds, can we be sure that
this vertex has converged?
© 2013 A. Haeberlen, Z. Ives
6
4
7
c
5
2
7
d
43
Summary: SSSP


Path-based algorithms typically involve
iterative map/reduce
They are typically formulated in a way that
traverses in “waves” or “stages”, like breadthfirst search



This allows for parallelism
They need a way to test for convergence
Example: Single-source shortest path (SSSP)


Original Dijkstra formulation is hard to parallelize
But we can make it work with the "wave" approach
44
© 2013 A. Haeberlen, Z. Ives
Plan for today


Representing data in graphs
Graph algorithms in MapReduce



Computation model
Iterative MapReduce
A toolbox of algorithms



© 2013 A. Haeberlen, Z. Ives
Single-source shortest path (SSSP)
NEXT
k-means clustering
Classification with Naïve Bayes
University of Pennsylvania
45
Learning (clustering / classification)

Sometimes our goal is to take a set of
entities, possibly related, and group them



If the groups are based on similarity, we call this clustering
If the groups are based on putting them into a semantically
meaningful class, we call this classification
Both are instances of machine learning
46
© 2013 A. Haeberlen, Z. Ives
Age
The k-clustering Problem
Clusters
Items
Expenses

Given: A set of items in a n-dimensional
feature space


Example: data points from survey, people in a social network
Goal: Group the items into k “clusters”

What would be a 'good' set of clusters?
47
© 2013 A. Haeberlen, Z. Ives
Approach: k-Means

Let m1, m2, …, mk be representative points
for each of our k clusters



Specifically: the centroid of the cluster
Initialize m1, m2, …, mk to random values in
the data
For t = 1, 2, …:

Map each observation to the closest mean
(t )
Si


 x j : x j  mi
(t )
(t )

Assign the mi to be a new centroid for each set
( t 1)
mi
© 2013 A. Haeberlen, Z. Ives
 x j  m i * , i *  1,..., k
1

S
(t )
i
x
x
(t )
j S i
j
48
A simple example (1/4)
(20,21)
Age
(18,20)
(30,21)
(11,16)
(10,10)
(15,12)
Expenses
49
© 2013 A. Haeberlen, Z. Ives
A simple example (2/4)
(20,21)
Age
(18,20)
(30,21)
(11,16)
Randomly chosen
initial centers
(10,10)
(15,12)
Expenses
50
© 2013 A. Haeberlen, Z. Ives
A simple example (3/4)
(20,21)
(18,20)
(30,21)
Age
(19.75,19.5)
(11,16)
(12.5,11)
(10,10)
(15,12)
Expenses
51
© 2013 A. Haeberlen, Z. Ives
A simple example (4/4)
(20,21)
(30,21)
(18,20)
Age
(22.67,20.67)
(11,16)
(12,12.67)
(10,10)
(15,12)
Expenses
Stable!
52
© 2013 A. Haeberlen, Z. Ives
k-Means in MapReduce

Map #1:



Input: node ID  <position, centroid ID, [centroid IDs and positions]>
Compute nearest centroid; emit centroid ID  <node ID, position>
Reduce #1:


Recompute centroid position from positions of nodes in it
Emit centroidID  <node IDs, positions> and for all other centroid IDs,
emit otherCentroidID  centroid(centroidID,X,Y)


Map #2:


Pass through values to Reducer #2
Reduce #2:

For each node in the current centroid, emit
node ID  <position, centroid ID, [centroid IDs and positions]>


Input for the next map iteration
Also, emit <X, <centroid ID, position>>


Each centroid will need to know where all the other centroids are
This will be the 'result' (remember that we wanted the centroids!)
Repeat until no change
© 2013 A. Haeberlen, Z. Ives
53
Plan for today


Representing data in graphs
Graph algorithms in MapReduce



Computation model
Iterative MapReduce
A toolbox of algorithms



© 2013 A. Haeberlen, Z. Ives
Single-source shortest path (SSSP)
k-means clustering
NEXT
Classification with Naïve Bayes
University of Pennsylvania
54
Classification

Suppose we want to learn what is
spam (or interesting, or …)


Predefine a set of classes with semantic meaning
Train an algorithm to look at data and assign a class



Based on giving it some examples of data in each class
… and the sets of features they have
Many probabilistic techniques exist

Each class has probabilistic relationships with others



e.g., p (spam | isSentLocally), p (isSentLocally | fromBob), …
Typically represented as a graph(ical model)! See CIS 520
But we’ll focus on a simple, “flat” model: Naïve Bayes
55
© 2013 A. Haeberlen, Z. Ives
A simple example

Suppose we just look at the keywords in the
email's title:
Message(1,
Message(2,
Message(3,
Message(4,
Message(5,
Message(6,

“Won contract”)
“Won award”)
"Won the lottery")
“Unsubscribe”)
"Millions of customers")
"Millions of dollars")
What is probability message "Won Millions" is
p(spam|containsWon,containsMillions)
= p(spam) p(containsWon,containsMillions |spam)
p(containsWon,containsMillions)
?
Bayes’
Theorem
56
© 2013 A. Haeberlen, Z. Ives
Classification using Naïve Bayes

Basic assumption: Probabilities of events are independent


This is why it is called 'naïve'
Under this assumption,
p(spam) p(containsWon,containsMillions | spam)
p(containsWon,containsMillions)
= p(spam) p(containsWon | spam) p(containsMillions | spam)
p(containsWon) p(containsMillions)
= 0.5 * 0.67 * 0.33 / (0.5 * 0.33) = 0.67

So how do we “train” a learner (compute the above
probabilities) using MapReduce?
57
© 2013 A. Haeberlen, Z. Ives
What do we need to train the learner?

p(spam)



Easy
Easy
p(containsXYZ | spam)



Count how many spam emails there are
Count total number of emails
Count how many spam emails contain XYZ
Count how many emails contain XYZ overall
1
2
p(containsXYZ)


© 2013 A. Haeberlen, Z. Ives
Count how many emails contain XYZ overall
Count total number of emails
University of Pennsylvania
2
Easy
58
Training a Naïve Bayes Learner

map 1:



reduce 1:


Count how many
emails in the class
contain the word
(modified WordCount)
emits <word, class>  <count>
map 2:



takes messageId  <class, {words}>
emits <word, class>  1
takes messageId -> <class, {words}>
emits word  1
reduce 2:

Count how many
emails contain the
word overall
(WordCount)
emits word  <totalCount>
59
© 2013 A. Haeberlen, Z. Ives
Summary: Learning and MapReduce

Clustering algorithms typically have multiple
aggregation stages or iterations



k-means clustering repeatedly computes centroids,
maps items to them
Fixpoint computation
Classification algorithms can be quite complex



In general: need to capture conditional probabilities
Naïve Bayes assumes everything is independent
Training is a matter of computing probability distribution

Can be accomplished using two Map/Reduce passes
60
© 2013 A. Haeberlen, Z. Ives
Stay tuned
Next time you will learn about:
PageRank and Adsorption
© 2013 A. Haeberlen, Z. Ives
University of Pennsylvania
61

similar documents