### Document

```Jeffrey D. Ullman
Stanford University



Communication cost for a MapReduce job = the
total number of key-value pairs generated by all
the mappers.
For many applications, communication cost is
the dominant cost, taking more time than the
computation performed by the mappers and
reducers.
Focus: minimizing communication cost for a join
of three or more relations.
2


Given: a collection of relations, each with
attributes labeling their columns.
Find: Those tuples over all the attributes such
that when restricted to the attributes of any
relation R, that tuple is in R.
3
A
0
1
B
1
2
The join:
A
1
2
B C
1 2
2 3
A
1
B
2
C
3
1
C
3
4

Suppose we want to compute R(A,B) JOIN
S(B,C), using k reducers.
 Remember: a “reducer” corresponds to a key of the
intermediate file, so we are really asking how many
keys we want to use.


R and S are each stored in a chunked file of a
distributed file system.
Subtle point: when joining two relations, each
tuple is sent to one reducer, so k can be as large
as we like; in a multiway join, large k implies
more communication.
5

Use a hash function h from B-values to k
buckets.
 Earlier, we used h = identity function.

Map tasks take chunks from R and S, and send:
 Tuple R(a,b) to reducer h(b).
 Tuple S(b,c) to reducer h(b).


If R(a,b) joins with S(b,c), then both tuples are
sent to reducer h(b).
Thus, their join (a,b,c) will be produced there
and shipped to the output file.
6




Consider a chain of three relations:
R(A, B) JOIN S(B, C) JOIN T(C,D)
Example: R, S, and T are “friends” relations.
We could join any two by the 2-way
MapReduce algorithm, then join the third with
the resulting relation.
But intermediate joins are large.
7



An alternative is to divide the work among k =
m2 reducers.
Hash both B and C to m values.
A reducer (key) corresponds to a hashed Bvalue and a hashed C-value.
8



Each S-tuple S(b,c) is sent to one reducer:
(h(b), h(c)).
But each tuple R(a,b) must be sent to m
reducers, those whose keys are of the form
(h(b), x).
And each tuple T(c,d) must be sent to m
reducers of the form (y, h(c)).
9
h(c) =
0
1
2
3
S(b, c)
h(b)=0
h(b)=1
R(a, b)
h(b)=2
h(b)=3
T(c, d)
10


Thus, any joining tuples R(a,b), S(b,c), and T(c,d)
will be joined at the reducer (h(b), h(c)).
Communication cost: s + mr + mt.
 Convention: Lower-case letter is the size of the
relation whose name is the corresponding uppercase letter.
 Example: r is the size of R.
11

Suppose for simplicity that:
 Relations R, S, and T have the same size r.
 The probability of two tuples joining is p.


The 3-way join has communication cost r(2m+1).
Two two-way joins have a communication cost:
 3r to read the relations, plus
 pr2 to read the join of the first two.
 Total = r(3+pr).
12


3-way beats 2-way if 2m + 1 < 3 + pr.
pr is the multiplicity of each join.
 Thus, the 3-way chain-join is useful when the
multiplicity is high.


Example: relations are “friends”; pr is about
300. m2 = k can be 20,000.
15. m2 = k can be 64.
13
Share Variables and Their Optimization
Special Case: Star Joins
Special Case: Chain Joins
Application: Skew Joins
14



When we discussed the 3-way chain-join
R(A, B) JOIN S(B, C) JOIN T(C,D), we used
attributes B and C for the map-key (attributes
that determined the reducers).
Why not include A and/or D?
Why use the same number of buckets for B
and C?
15

For the general problem, we use a share
variable for each attribute.
 The number of buckets into which values of that
attribute are hashed.

Convention: The share variable for an
attribute is the corresponding lower-case
letter.
 Example: the share variable for attribute A is
always a.
16


The product of all the share variables is k, the
number of reducers.
The communication cost of a multiway join is
the sum of the size of each relation times the
product of the share variables for the
attributes that do not appear in the schema
of that relation.
17




Consider the cyclic join
R(A, B) JOIN S(B, C) JOIN T(A, C)
Cost function is rc + sa + tb.
Construct the Lagrangean remembering abc = k:
rc + sa + tb – (abc – k)
Take the derivative with respect to each share
variable, then multiply by that variable.
 Result is 0 at minimum.
18






d/da of rc + sa + tb – (abc – k) is s – bc.
Multiply by a and set to 0: sa – abc = 0.
Note: abc = k : sa = k.
Similarly, d/db and d/dc give: sa = tb = rc = k.
Solution: a = (krt/s2)1/3; b = (krs/t2)1/3;
c = (kst/r2)1/3.
Cost = rc + sa + tb = 3(krst)1/3.
19



Certain attributes can’t be in the map-key.
A dominates B if every relation of the join
with B also has A.
Example:
R(A,B,C) JOIN S(A,B,D) JOIN T(A,E) JOIN U(C,E)
Every relation with B
Also has A
20
R(A,B,C) JOIN S(A,B,D) JOIN T(A,E) JOIN U(C,E)


Cost expression:
rde + sce + tbcd + uabd
Since b appears wherever a does, if there
were a minimum-cost solution with b > 1,
we could replace b by 1 and a by ab, and
the cost would lower.
21

This rule explains why, in the discussion of the
chain join
R(A, B) JOIN S(B, C) JOIN T(C,D)
we did not give dominated attributes A and D a
share.
22


Unfortunately, there are more complex cases
than dominated attributes, where the
equations derived from the Lagrangean imply a
positive sum of several terms = 0.
We can fix, generalizing dominated attributes,
but we have to branch on which attribute
needs to be eliminated from the map-key.
23

Solutions not in integers:
 Drop an attribute with a share < 1 from the map-key
and re-solve.
 Round other nonintegers, and treat k as a
suggestion, since the product of the integers may
not be k.
24

A star join combines a large fact table
F(A1,A2,…,An) with tiny dimension tables
D1(A1,B1), D2(A2,B2),…, Dn(An ,Bn ).
 There may be other attributes not shown, each
belonging to only one relation.

Example: Facts = sales; dimensions tell about
25
B1
B4
B2
A1
A2
A4
A3
B3
26

Map-key = the A’s.
 B’s are dominated.

Solution: di /ai = k for all i.
 That is, the shares are proportional to the
dimension-table sizes.
27


Fact/dimension tables are often used for
analytics.
Aster Data approach: partition fact table among
nodes permanently (shard the fact table);
replicate needed pieces of dimension tables.
Fact
Shard
1
Fact
Shard
2
...
Fact
Shard
k
28


Shard fact table by country.
Wins for distributing the Customer dimension
table.
 Each Customer tuple needed only at the shard for
the country of the customer.

Loses big for other dimension tables, e.g., Item.
 Pretty much every item has been bought by
someone in each country, so Dimension tuples are
replicated to every shard.
29


Our solution lets you partition the fact table to k
shards, one for each reducer.
Analogy: Think of sharding process as taking the
join of the fact table and all dimension tables.
 Shards = reducers.


Only dimension keys (the A’s) get shares, so each
Fact tuple goes to exactly one shard/reducer.
Total copies of dimension tuples =
communication cost, which is minimized.
30

A chain join has the form
R(A0, A1) JOIN R(A1, A2) JOIN … JOIN R(An -1, An)
 Other attributes may appear, but only in one
relation.

A0 and An are dominated; other attributes are
in the map-key.
A0
A1
A2
A3
...
An -1
An
31

Illustrates strange behavior.
 Even and odd n have very different distributions of
the share variables.

Even n : a2 = a4 = … = an -2 = 1;
a1 = a3 = … = an -1 = k2/n
32
33

Even a’s grow exponentially.
 That is, a4 = a22; a6 = a23; a8 = a24,…

The odd a’s form the inverse sequence.
 That is, a1 = an-1; a3 = an-3; a5 = an-5;…
34
35




Suppose we want to join R(A,B) with S(B,C)
using MapReduce, as in the introduction.
But half the tuples of R and S have value B=10.
No matter how many reducers we use, one
reducer gets half the tuples and dominates the
wall-clock time.
To deal with this problem, systems like PIG
handle heavy-hitter values like B=10 specially.
36


Pick one of the relations, say R.
Divide the tuples of R having B=10 into k
groups, with a reducer for each.
 Note: we’re pretending B=10 is the only heavy hitter.
 Works for any value and > 1 heavy hitter.

Send each tuple of S with B=10 to all k reducers.
 Other tuples of R and S are treated normally.
37



Let R have r tuples with B=10 and S have s
tuples with B=10.
Then communication cost = r + ks.
Can be minimum if s << r, but also might not be
best.
38




Let’s partition the tuples of R with B=10 into x
groups and partition the tuples of S with B=10
into y groups, where xy = k.
Use one of k reducers for each pair (i, j) of a
group for R and a group for S.
Send each tuple of R with B=10 to all y groups
of the form (i, ?), and send each tuple of S with
B=10 to all x reducers (?, j).
Communication = ry + sx.
39




We need to minimize ry + sx, under the
constraint xy = k.
Solution: x = kr/s; y = ks/r; communication =
2krs .
Note: 2krs is always < r + ks (the cost of the
implemented skew join), often much less.
Example: if r = s, then new skew join takes 2rk,
while old takes r(k+1).
40
1.
2.
3.
4.
5.
Multiway joins can be computed by replicating
tuples and distributing them to many compute
nodes.
Minimizing communication requires us to
solve a nonlinear optimization.
Multiway beats 2-way joins for star queries
and queries on high-fanout graphs.
Exact solution for chain and star queries.
Simple application to optimal skew joins.
41
1.
We really don’t know the complexity of
optimizing the multiway join in general.
 The algorithm we offered in 2010 EDBT is
exponential in the worst case.
 But is it NP-hard?
2.
The Skew-join technique borrows from the
multiway join, but is a 2-way join.
 Can you generalize the idea to all multiway joins
with skew?
42
```