### slides - Yisong Yue

```Machine Learning & Data Mining
CS/CNS/EE 155
Lecture 3:
Regularization, Sparsity & Lasso
1
Homework 1
• Check course website!
• Some coding required
• Some plotting required
– I recommend Matlab
• Has supplementary datasets
• Submit via Moodle (due Jan 20th @5pm)
2
Recap: Complete Pipeline
S = {(xi , yi )}i=1
N
Training Data
f (x | w, b) = wT x - b
L(a, b) = (a - b)2
Model Class(es)
Loss Function
N
argmin å L ( yi , f (xi | w, b))
w,b
i=1
Cross Validation & Model Selection
Profit!
3
Different Model Classes?
• Option 1: SVMs vs ANNs vs LR vs LS
• Option 2: Regularization
N
argmin å L ( yi , f (xi | w, b))
w,b
i=1
Cross Validation & Model Selection
4
Notation
• L0 Norm
– # of non-zero entries
• L1 Norm
– Sum of absolute values
• L2 Norm & Squared L2 Norm
– Sum of squares
– Sqrt(sum of squares)
• L-infinity Norm
– Max absolute value
w 0 = å1[wd ¹0]
d
w = w 1 = å wd
d
w =
2
T
w
º
w
w
å d
d
w = å wd2 º wT w
2
d
w ¥ = lim p å wd = max wd
p
p®¥
d
d
5
Notation Part 2
• Minimizing Squared Loss
– Regression
– Least-Squares
argmin å( yi - w x + b)
T
w
2
i
– (Unless Otherwise Stated)
• E.g., Logistic Regression = Log Loss
6
Ridge Regression
argmin l w w + å( yi - w x + b)
T
w,b
Regularization
T
2
i
Training Loss
• aka L2-Regularized Regression
• Trades off model complexity vs training loss
• Each choice of λ a “model class”
– Will discuss the further later
7
é 1
[age>10]
x =ê
ê 1[gender=male]
ë
ù
ú
ú
û
argmin l w w + å( yi - w x + b)
T
T
w,b
2
i
w
0.2441
0.2277
0.1765
0.0817
0.0161
-0.1745
-0.1967
-0.2686
-0.4196
-0.5686
0.0001 0.0000
-0.6666
Test
0.7401
0.7122
0.6197
0.4124
0.1801
b
…
Larger Lambda
Train
ìï 1 height > 55"
y=í
ïî 0 height £ 55"
Person
Age>10 Male? Height
> 55”
Alice
1
0
1
Bob
0
1
0
Carol
0
0
0
Dave
1
1
1
Erin
1
0
1
Frank
0
1
1
Gena
0
0
0
Harold
1
1
1
Irene
1
0
0
John
0
1
1
Kelly
1
0
1
Larry
1
1
1
8
Updated Pipeline
S = {(xi , yi )}i=1
N
Training Data
f (x | w, b) = wT x - b
L(a, b) = (a - b)2
Model Class
Loss Function
N
argmin l wT w + å L ( yi , f (xi | w, b))
w,b
i=1
Cross Validation & Model Selection
Profit!
9
Train
Test
Person
Age>10 Male
?
Height
> 55”
Alice
1
0
1
0.91
0.89
0.83
0.75 0.67
Bob
0
1
0
0.42
0.45
0.50
0.58 0.67
Carol
0
0
0
0.17
0.26
0.42
0.50 0.67
Dave
1
1
1
1.16
1.06
0.91
0.83 0.67
Erin
1
0
1
0.91
0.89
0.83
0.79 0.67
Frank
0
1
1
0.42
0.45
0.50
0.54 0.67
Gena
0
0
0
0.17
0.27
0.42
0.50 0.67
Harold
1
1
1
1.16
1.06
0.91
0.83 0.67
Irene
1
0
0
0.91
0.89
0.83
0.79 0.67
John
0
1
1
0.42
0.45
0.50
0.54 0.67
Kelly
1
0
1
0.91
0.89
0.83
0.79 0.67
Larry
1
1
1
1.16
1.06
0.91
0.83 0.67
Model Score w/ Increasing Lambda
Best test error
10
Choice of Lambda Depends on Training Size
25 dimensional space
Randomly generated linear response function + noise
11
Recap: Ridge Regularization
• Ridge Regression:
– L2 Regularized Least-Squares
argmin l w w + å( yi - w x + b)
T
w,b
T
2
i
• Large λ  more stable predictions
– Less likely to overfit to training data
– Too large λ  underfit
• Works with other loss
– Hinge Loss, Log Loss, etc.
12
Model Class Interpretation
N
argmin l wT w + å L ( yi , f (xi | w, b))
w,b
i=1
• This is not a model class!
– At least not what we’ve discussed...
• An optimization procedure
– Is there a connection?
13
Norm Constrained Model Class
f (x | w, b) = w x - b
T
s.t. w w £ c º w £ c
2
T
Seems to correspond to lambda…
Visualization
N
argmin l wT w + å L ( yi , f (xi | w, b))
w,b
i=1
c=1
c=2
c=3
14
Lagrange Multipliers
argmin L(y, w) º ( y - w x )
T
2
w
s.t. wT w £ c
• Optimality Condition:
– Constraint Boundary
\$l ³ 0 : (¶w L(y, w) = l¶w w w) Ù ( w w £ c)
T
T
Omitting b &
1 training data
for simplicity
http://en.wikipedia.org/wiki/Lagrange_multiplier
15
Omitting b &
1 training data
for simplicity
Norm Constrained Model Class Training:
argmin L(y, w) º ( y - w x )
T
2
w
Two Conditions
Must Be Satisfied
At Optimality .
s.t. wT w £ c
\$l ³ 0 : (¶w L(y, w) = l¶w wT w) Ù ( wT w £ c)
Lagrangian:
Claim: Solving Lagrangian
Solves Norm-Constrained
Training Problem
argmin L(w, l ) = ( y - w x ) + l ( wT w - c)
T
2
w, l
Optimality Implication of Lagrangian:
Satisfies First Condition!
¶w L(w, l ) = -2x ( y - w x ) + 2l w º 0
T
T
(
 2x y - w x
T
)
T
= 2l w
http://en.wikipedia.org/wiki/Lagrange_multiplier
16
Omitting b &
1 training data
for simplicity
Norm Constrained Model Class Training:
argmin L(y, w) º ( y - w x )
T
2
w
Two Conditions
Must Be Satisfied
At Optimality .
s.t. wT w £ c
\$l ³ 0 : (¶w L(y, w) = l¶w wT w) Ù ( wT w £ c)
Lagrangian:
argmin L(w, l ) = ( y - w x ) + l ( wT w - c)
T
w, l
2
Claim: Solving Lagrangian
Solves Norm-Constrained
Training Problem
Implication of Lagrangian:
Optimality ImplicationOptimality
of Lagrangian:
¶w0L(w, lif) w
= T-2x
w <(cy - w x )+ 2lTw º 0
º0 T w w£c
T
T
ïî w w - c if
w w
2x³( cy - wT x ) = 2 l w
nd
Satisfies
2
Satisfies First Condition! ì
Condition! ¶ L(w, l ) = ï
í
l
T
T
http://en.wikipedia.org/wiki/Lagrange_multiplier
17
L2 Regularized Training:
Norm Constrained Model Class Training:
argmin L(y, w) º ( y - w x )
T
w
2
argmin l w w + ( y - w x )
T
s.t. w w £ c
T
T
2
w
Lagrangian:
argmin L(w, l ) = ( y - w x ) + l ( wT w - c)
T
2
w, l
• Lagrangian = Norm Constrained Training:
\$l ³ 0 : (¶w L(y, w) = l¶w wT w) Ù ( wT w £ c)
• Lagrangian = L2 Regularized Training:
– Hold λ fixed
– Equivalent to solving Norm Constrained!
– For some c
Omitting b &
1 training data
for simplicity
http://en.wikipedia.org/wiki/Lagrange_multiplier
18
Recap #2: Ridge Regularization
• Ridge Regression:
– L2 Regularized Least-Squares = Norm Constrained Model
argmin l wT w + L(w) º
w,b
argmin L(w) s.t. wT w £ c
w,b
• Large λ  more stable predictions
– Less likely to overfit to training data
– Too large λ  underfit
• Works with other loss
– Hinge Loss, Log Loss, etc.
19
Hallucinating Data Points
N
argmin l w w + å( yi - w xi )
T
T
w
N
¶w = 2 l w - 2å x ( yi - w xi )
2
T
i=1
i=1
• Instead hallucinate D data points?
D
(
argmin å 0 - w
w
d=1
D
(
¶w = 2å l ed w
d=1
D
)
N
l ed +å( yi - w xi )
T
l ed
T
D
2
T
d=1
d=1
{(
led , 0
)}
D
d=1
2
i=1
)
T
N
- 2å x ( yi - w xi )
T
T
i=1
= 2å l e w = 2å l wd = 2 l w
T
d
T
Identical to
Regularization!
Unit vector
along d-th
Dimension
Omitting b &
for simplicity
20
– Spam filter for Alice
– Spam filter for Bob
• Limited training data for both…
– … but Alice is similar to Bob
21
S = {(x , y )}
• Two Training Sets
(1)
– N relatively small
S
(2)
(1)
i
(1)
i
N
i=1
= {(x , y )}
(2)
i
(2)
i
N
i=1
• Option 1: Train Separately
N
argmin l w w + å( yi - w x i
T
w
i=1
N
argmin l v v + å( yi - v x i
T
v
)
(1) 2
T
(1)
(2)
i=1
T
Both models have high error.
)
(2) 2
Omitting b &
for simplicity
22
S = {(x , y )}
• Two Training Sets
(1)
– N relatively small
S
(2)
(1)
i
(1)
i
N
i=1
= {(x , y )}
(2)
i
(2)
i
N
i=1
• Option 2: Train Jointly
N
argmin l w w + å( yi - w x i
T
w,v
(1)
i=1
N
+ l v v + å( yi - v x i
T
(2)
i=1
T
)
(1) 2
T
)
(2) 2
Doesn’t accomplish anything!
(w & v don’t depend on each other)
Omitting b &
for simplicity
23
N
argmin l w w + l v v + g ( w - v) ( w - v) + å( yi - w x i
T
T
T
w,v
(1)
i=1
Standard
Regularization
Regularization
• Prefer w & v to be “close”
T
N
) + å( y
(1) 2
(2)
i
i=1
)
(2) 2
- v xi
T
Training Loss
– Controlled by γ
• Larger γ helps!
• γ not too large
24
Lasso
L1-Regularized Least-Squares
25
L1 Regularized Least Squares
N
argmin l w + å( yi - w xi )
T
w
i=1
2
N
argmin l w + å( yi - w xi )
2
w
T
2
i=1
• L2:
w= 2
vs
w =1
=
w =1
vs
w=0
vs
w =1
• L1:
w=2
=
w =1
vs
w=0
Omitting b &
for simplicity
26
Ña R(a) = {c "a' : R(a') - R(a) ³ c(a'- a)}
• Differentiable:
Ña R(a) = ¶a R(a)
• L1:
Ñ wd
ì
-1
if
ïï
wí
+1
if
ï
ïî [ -1,+1] if
wd < 0
wd > 0
wd = 0
Continuous range for w=0!
Omitting b &
for simplicity
27
L1 Regularized Least Squares
N
argmin l w + å( yi - w xi )
T
w
i=1
2
N
argmin l w + å( yi - w xi )
2
w
T
2
i=1
• L2:
Ñwd w = 2wd
2
• L1:
Ñ wd
ì
-1
if
ïï
wí
+1
if
ï
ïî [ -1,+1] if
wd < 0
wd > 0
wd = 0
Omitting b &
for simplicity
28
Lagrange Multipliers
argmin L(y, w) º ( y - w x )
T
2
w
s.t. w = c
Ñ wd
ì
-1
if
ïï
wí
+1
if
ï
ïî [ -1,+1] if
wd < 0
wd > 0
wd = 0
\$l ³ 0 : (¶w L(y, w) Î lÑw w ) Ù ( w £ c)
Solutions tend to
be at corners!
Omitting b &
1 training data
for simplicity
http://en.wikipedia.org/wiki/Lagrange_multiplier
29
Sparsity
• w is sparse if mostly 0’s:
w 0 = å1[wd ¹0]
– Small L0 Norm
d
• Why not L0 Regularization?
– Not continuous!
N
argmin l w 0 + å( yi - w xi )
T
w
• L1 induces sparsity
– And is continuous!
2
i=1
N
argmin l w + å( yi - w xi )
T
w
2
i=1
Omitting b &
for simplicity
30
Why is Sparsity Important?
• Computational / Memory Efficiency
– Store 1M numbers in array
– Store 2 numbers per non-zero
• (Index, Value) pairs
• E.g., [ (50,1), (51,1) ]
– Dot product more efficient: wT x
• Sometimes true w is sparse
– Want to recover non-zero dimensions
31
Lasso Guarantee
N
argmin l w + å( yi - w xi + b)
T
w
2
i=1
• Suppose data generated as: yi ~ Normal ( w*T xi , s 2 )
• Then if:
2 2s 2 log D
l>
k
N
• With high probability (increasing with N):
Supp ( w) Í Supp ( w* )
High Precision
Parameter Recovery
"d : wd ³ lc  Supp ( w) = Supp ( w* )
Sometimes High Recall
Supp ( w* ) = {d w*,d ¹ 0}
http://www.eecs.berkeley.edu/~wainwrig/Papers/Wai_SparseInfo09.pdf 32
Person
Age>10 Male? Height
> 55”
Alice
1
0
1
Bob
0
1
0
Carol
0
0
0
Dave
1
1
1
Erin
1
0
1
Frank
0
1
1
Gena
0
0
0
Harold
1
1
1
Irene
1
0
0
John
0
1
1
Kelly
1
0
1
Larry
1
1
1
33
Recap: Lasso vs Ridge
• Model Assumptions
– Lasso learns sparse weight vector
• Predictive Accuracy
– Lasso often not as accurate
– Re-run Least Squares on dimensions selected by Lasso
• Easy of Inspection
– Sparse w’s easier to inspect
• Easy of Optimization
– Lasso somewhat trickier to optimize
34
Recap: Regularization
argmin l w + å( yi - w xi )
• L2
T
w
2
i=1
N
argmin l w + å( yi - w xi )
• L1 (Lasso)
N
2
T
w
2
i=1
argmin l wT w + l vT v + g ( w - v) ( w - v)
T
w,v
N
+ å( yi - w x i
(1)
i=1
T
N
) + å( y
(1) 2
)
T
(2) 2
Omitting b &
for simplicity
35
(2)
i
- v xi
i=1
• [Insert Yours Here!]
Next Lecture:
Recent Applications of Lasso
Cancer Detection
Personalization
viaexamining(B"