### Related Presentation

```SCALING SGD TO
BIG DATA
& HUGE MODELS
Alex Beutel
Based on work done with Abhimanu Kumar,
Vagelis Papalexakis, Partha Talukdar, Qirong Ho,
Christos Faloutsos, and Eric Xing
2
Big Learning Challenges
Collaborative Filtering
Predict movie preferences
Dictionary Learning
Remove noise or missing pixels from
images
Tensor Decomposition
Find communities in temporal graphs
400 million tweets per day
Topic Modeling
What are the topics of webpages,
3
Big Data & Huge Model Challenge
• 2 Billion Tweets covering
300,000 words
• Break into 1000 Topics
• More than 2 Trillion
parameters to learn
• Over 7 Terabytes of model
400 million tweets per day
Topic Modeling
What are the topics of webpages,
4
Outline
1. Background
2. Optimization
• Partitioning
• Constraints & Projections
3. System Design
1. General algorithm
3. Distributed normalization
4. “Always-On SGD” – Dealing with stragglers
4. Experiments
5. Future questions
5
BACKGROUND
6
y = (x - 4)2 + (x - 5)2
z1 = 4
y = (x - 4)2
z2 = 5
y = (x - 5)2
7
y = (x - 4)2 + (x - 5)2
z1 = 4
y = (x - 4)2
z2 = 5
y = (x - 5)2
8
SGD for Matrix Factorization Movies
V
Users
U
Genres
≈
X
9
SGD for Matrix Factorization
V
U
≈
Independent!
X
10
The Rise of SGD
• Hogwild! (Niu et al, 2011)
• Noticed independence
• If matrix is sparse, there will be little contention
• Ignore locks
• DSGD (Gemulla et al, 2011)
• Noticed independence
• Broke matrix into blocks
11
DSGD for Matrix Factorization (Gemulla, 2011)
Independent Blocks
12
DSGD for Matrix Factorization (Gemulla, 2011)
Partition your data & model into d × d blocks
Results in d=3 strata
Process strata sequentially,
process blocks in each stratum in parallel
14
TENSOR
DECOMPOSITION
15
What is a tensor?
• Tensors are used for structured data > 2 dimensions
• Think of as a 3D-matrix
For example:
Derek Jeter plays baseball
Subject
Object
Verb
16
Tensor Decomposition
Derek Jeter plays baseball
V
Subject
U
≈
X
Object
Verb
17
Tensor Decomposition
V
U
≈
X
18
Tensor Decomposition
Independent
V
U
≈
Not Independent
X
19
Tensor Decomposition
Z1
Z2
Z3
20
For d=3 blocks per stratum, we require d2=9 strata
Z1
Z1
Z1
Z2
Z2
Z3
Z2
Z3
Z3
Z2
Z2
Z3
Z2
Z1
Z3
Z1
Z3
Z1
Z1
Z2
Z3
Z1
Z3
Z1
Z3
Z2
Z2
21
Coupled Matrix + Tensor Decomposition
Subject
Y
X
Object
Document
Verb
22
Coupled Matrix + Tensor Decomposition
A
V
U
≈
Y
X
23
Coupled Matrix + Tensor Decomposition
Z1
Z2
Z3
Z1
Z2
Z3
24
CONSTRAINTS &
PROJECTIONS
25
Example: Topic Modeling
Words
Topics
Documents
26
Constraints
• Sometimes we want to restrict response:
• Non-negative
• Sparsity
Ui,k ³ 0
min X -UV T
U,V
F
+ lu U 1 + lv V
1
• Simplex (so vectors become probabilities)
åU
k
• Keep inside unit ball
2
U
å i,k £1
k
i,k
=1
27
How to enforce? Projections
• Example: Non-negative
28
More projections
• Sparsity (soft thresholding):
• Simplex
x
P(x) =
x1
• Unit ball
29
Sparse Non-Negative
Tensor Factorization
Sparse encoding
Non-negativity:
More interpretable results
30
Dictionary Learning
• Learn a dictionary of concepts and a sparse
reconstruction
• Useful for fixing noise and missing pixels of images
Sparse encoding
Within unit ball
31
Mixed Membership Network Decomp.
• Used for modeling communities in graphs (e.g. a social
network)
Simplex
Non-negative
32
Proof Sketch of Convergence
[Details]
• Regenerative process – each point is used once/epoch
• Projections are not too big and don’t “wander off”
(Lipschitz continuous)
2
h
• Step sizes are bounded: å t < ¥
t
Noise from SGD
Projection
SGD Constraint error
33
SYSTEM DESIGN
34
High level algorithm
Z1
Z1
Z1
Z2
Z2
Z3
Stratum 1
Z2
Z3
Stratum 2
Z3
Stratum 3
…
for Epoch e = 1 … T do
for Subepoch s = 1 … d2 do
Let Z s = {Z1, Z2 … Zd } be the set of blocks in stratum s
for block b = 1 … d in parallel do
Run SGD on all points in block Z b Î Z s
end
end
end
35
Mappers
Reducers
Run
SGD on
Update:
Z1(1)
Run
SGD on
U2 V1 W3
U3 V2 W1
U1 V3 W2
Update:
Z 2(1)
Run
SGD on
Z3(1)
Update:
36
Mappers
Reducers
Run
SGD on
Update:
Z1(2)
Run
SGD on
U2 V1 W2
U3 V2 W3
U1 V3 W1
Update:
Z 2(2)
Run
SGD on
Z3(3)
Update:
37
• MapReduce is typically very bad for iterative algorithms
• T × d2 iterations
• Little flexibility
38
High Level Algorithm
V1
Z1
V1
V2
Z2
Z3
U1 V1 W1
Z1
V3
U2 V2 W2
V2
Z2
Z3
U3 V3 W3
V3
39
High Level Algorithm
Z1
V1
V2
Z2
Z3
U1 V1 W3
Z1
V1
Z2
V3
V3
Z3
U2 V2 W1
V2
U3 V3 W2
40
High Level Algorithm
Z1
V1
Z2
Z3
U1 V1 W2
Z3
V2
V3
U2 V2 W3
V1
Z2
V2
V3
Z1
U3 V3 W1
42
p = {i, j, k, v}
Mappers
Process
points:
Map each
point
{p, b, s}
Partition
&
Sort
to its block
with necessary
info to order
Reducers
43
Reducers
Mappers
Process
points:
…
Map each
point
Z1(4) Z1(3) Z1(2)
Z1(1)
Partition
&
Sort
to its block
Z 2(5) Z 2(3) Z 2(2)
Z 2(1)
Z3(4) Z3(3) Z3(2)
Z3(1)
with necessary
info to order
…
44
Z1
Reducers
Z2
Z3
Run
SGD on
Mappers
Process
points:
…
Map each
point
Z1(4) Z1(3) Z1(2)
Run
SGD on
&
Sort
Z 2(4) Z 2(3) Z 2(2)
to its block
U1 V1 W1
Z1(1)
Partition
Run
SGD on
…
Z3(4) Z3(3) Z3(2)
Z3(1)
Update:
U2 V2 W2
Z 2(1)
with necessary
info to order
Update:
Update:
U3 V3 W3
45
Z1
Reducers
Z2
Z3
Mappers
Process
points:
…
Map each
point
Z1(5) Z1(4) Z1(3)
Z1(2)
Partition
&
Sort
Z 2(5) Z 2(4) Z 2(3)
to its block
Z 2(2)
with necessary
info to order
…
Z3(5) Z3(4) Z3(3)
Z3(2)
Run
SGD on
Update:
Z1(1)
U1 V1 W1
Run
SGD on
Update:
Z 2(1)
U2 V2 W2
Run
SGD on
Update:
Z3(1)
U3 V3 W3
46
Reducers
Z2
Z3
Mappers
Process
points:
Z1
…
Z1(5) Z1(4) Z1(3)
Run
SGD on
Update:
Z1(2)
U1 V1 W1
HDFS
Map each
point
Partition
&
Sort
Z 2(5) Z 2(4) Z 2(3)
to its block
Run
SGD on
Update:
Z 2(2)
U2 V2 W2
HDFS
with necessary
info to order
…
Z3(5) Z3(4) Z3(3)
Run
SGD on
Update:
Z3(2)
U3 V3 W3
47
System Summary
• Limit storage and transfer of data and model
• Stock Hadoop can be used with HDFS for communication
• Hadoop makes the implementation highly portable
• Alternatively, could also implement on top of MPI or even
a parameter server
48
Distributed Normalization
Words
Topics
π1 β1
Documents
π2 β2
π3 β3
49
Distributed Normalization
Transfer σ(b) to all machines
Each machine calculates σ:
σ(b) is a k-dimensional vector,
summing the terms of βb
s (b)
k =
åb
d
s = ås (b)
b=1
π1 β1
j,k
jÎbb
σ(2)
σ(1)
σ(2) σ(2)
Normalize:
σ(3)
π3 β3
π2 β2
σ(1)
σ(1)
σ(3)
σ(3)
b j,k =
b j,k
sk
50
Barriers & Stragglers
Reducers
Mappers
Process
points:
…
Z1(5) Z1(4) Z1(3)
Z1(2)
Run
SGD on
Update:
Z1(1)
U1 V1 W1
HDFS
Map each
point
Wasting
time
Run
Update:
waiting!
SGD
on
Partition
&
Sort
Z 2(5) Z 2(4) Z 2(3)
to its block
Z 2(2)
Z 2(1)
U2 V2 W2
HDFS
with necessary
info to order
…
Z3(5) Z3(4) Z3(3)
Z3(2)
Run
SGD on
Update:
Z3(1)
U3 V3 W3
51
Solution: “Always-On SGD”
For each reducer:
Run SGD on all points in
current block Z
Shuffle points in Z and
decrease step size
Check if other reducers
Run SGD on points in Z
again
Sync parameters
and get new block Z
Wait
52
“Always-On SGD”
Reducers
Run SGD on old points again!
Process
points:
…
Z1(5) Z1(4) Z1(3)
Z1(2)
Run
SGD on
Update:
Z1(1)
U1 V1 W1
HDFS
Map each
point
Partition
&
Sort
Z 2(5) Z 2(4) Z 2(3)
to its block
Z 2(2)
Run
SGD on
Update:
Z 2(1)
U2 V2 W2
HDFS
with necessary
info to order
…
Z3(5) Z3(4) Z3(3)
Z3(2)
Run
SGD on
Update:
Z3(1)
U3 V3 W3
53
Proof Sketch
[Details]
• Martingale Difference Sequence: At the beginning of each
epoch, the expected number of times each point will be
processed is equal
Z1(2)
Z1(1)
Z 2(2)
Z 2(1)
54
Proof Sketch
[Details]
• Martingale Difference Sequence: At the beginning of each
epoch, the expected number of times each point will be
processed is equal
• Can use properties of SGD and MDS to show variance
decreases with more points used
55
“Always-On SGD”
Reducer 1
Reducer2
Reducer 3
Reducer 4
First SGD pass of block Z
Write Parameters to HDFS
56
EXPERIMENTS
57
FlexiFaCT (Tensor Decomposition)
Convergence
58
FlexiFaCT (Tensor Decomposition)
Scalability in Data Size
59
FlexiFaCT (Tensor Decomposition)
Scalability in Tensor Dimension
Handles up to 2 billion parameters!
60
FlexiFaCT (Tensor Decomposition)
Scalability in Rank of Decomposition
Handles up to 4 billion parameters!
61
Fugue (Using “Always-On SGD”)
Dictionary Learning: Convergence
62
Fugue (Using “Always-On SGD”)
Community Detection: Convergence
63
Fugue (Using “Always-On SGD”)
Topic Modeling: Convergence
64
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability in Data Size
GraphLab
cannot spill to
disk
65
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability in Rank
66
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability over Machines
67
Fugue (Using “Always-On SGD”)
Topic Modeling: Number of Machines
68
Fugue (Using “Always-On SGD”)
69
LOOKING FORWARD
70
Future Questions
• Do “extra updates” work on other techniques, e.g. Gibbs
sampling? Other iterative algorithms?
• What other problems can be partitioned well? (Model &
Data)
• Can we better choose certain data for extra updates?
• How can we store large models on disk for I/O efficient
71
Key Points
• Flexible method for tensors & ML models
• Partition both data and model together for efficiency and
scalability
• When waiting for slower machines, run extra updates on
old data again
• Algorithmic & systems challenges in scaling ML can be