Lecture6 - Temple University

```Made by: Maor Levy, Temple University 2012
1


Up until now: how to reason in a give model
Machine learning: how to acquire a model on
the basis of data / experience
◦ Learning parameters (e.g. probabilities)
◦ Learning structure (e.g. BN graphs)
◦ Learning hidden concepts (e.g. clustering)
2
What?
Parameters
Structure
Hidden
concepts
What
from?
Supervised
Unsupervised
Reinforcement Selfsupervised
What for?
Prediction
Diagnosis
Compression
Discovery
How?
Passive
Active
Online
Offline
Output?
Classification
Regression
Clustering
Details??
Generative
Discriminative
Smoothing
3

Commonly attributed to William of Ockham
fifteen hundred years after Epicurus.
◦ In sharp contrast to the principle of multiple
explanations, it states: Entities should not be
multiplied beyond necessity.


Commonly explained as: when have choices,
choose the simplest theory.
Bertrand Russell: “It is vain to do with more
what can be done with fewer.”
4
f(x)
f(x)
f(x)
x
(a)
f(x)
x
(b)
x
(c)
x
(d)
Given a training set:
(x1, y1), (x2, y2), (x3, y3), … (xn, yn)
Where each yi was generated by an unknown y = f (x),
Discover a function h that approximates the true function f.
5




Input: x = email
Output: y = “spam” or
“ham”
Setup:
◦ Get a large collection of
example emails, each
labeled “spam” or “ham”
◦ Note: someone has to hand
label all this data!
◦ Want to learn to predict
labels of new, future emails
Features: The attributes
used to make the ham /
spam decision
◦ Words: FREE!
◦ Text Patterns: \$dd, CAPS
◦ Non-text:
SenderInContacts
◦ …
Dear Sir.
First, I must solicit your confidence in this
transaction, this is by virture of its nature
as being utterly confidencial and top
secret. …
TO BE REMOVED FROM FUTURE
MESSAGE AND PUT "REMOVE" IN THE
SUBJECT.
FOR ONLY \$99
Ok, I know this is blatantly OT but I'm
beginning to go insane. Had an old Dell
Dimension XPS sitting in the corner and
decided to put it to use, I know it was
working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
6



Naïve Bayes spam
filter
Data:
◦ Collection of emails,
labeled spam or ham
◦ Note: someone has to
hand label all this data!
◦ Split into training, heldout, test sets
Classifiers
◦ Learn on the training set
◦ (Tune it on a held-out
set)
◦ Test it on new emails
Dear Sir.
First, I must solicit your confidence in this
transaction, this is by virture of its nature
as being utterly confidencial and top
secret. …
TO BE REMOVED FROM FUTURE
MESSAGE AND PUT "REMOVE" IN THE
SUBJECT.
FOR ONLY \$99
Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
Dimension XPS sitting in the corner and
decided to put it to use, I know it was
working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
7
SPAM



OFFER IS SECRET
HAM






PLAY SPORTS TODAY
WENT PLAY SPORTS
SECRET SPORTS EVENT
SPORT IS TODAY
SPORT COSTS MONEY
Questions:
◦ Size of Vocabulary? 13 words
◦ P(SPAM) = 3/8
8


SSSHHHHHH

=
◦   =
1 −    =
11100000

1−
p(S) =
3
◦   =  ∗ 1 −
◦   = 8=1   =  (=1) ∗ 1 −
◦
3 ∗ 1 −  5
5
=0
9
SPAM



HAM
OFFER IS SECRET






PLAY SPORTS TODAY
WENT PLAY SPORTS
SECRET SPORTS EVENT
SPORT IS TODAY
SPORT COSTS MONEY
Questions:
◦ P(“SECRET” | SPAM) =
1/3
◦ P(“SECRET” | HAM) =
1/15
10

Bag-of-Words Naïve Bayes:

Generative model

Tied distributions and bag-of-words
◦ Predict unknown class label (spam vs. ham)
◦ Assume evidence features (e.g. the words) are independent
Word at position i,
not ith word in the
dictionary!
◦ Usually, each variable gets its own conditional probability
distribution P(F|Y)
◦ In a bag-of-words model
 Each position is identically distributed
 All positions share the same conditional probs P(W|C)
 Why make this assumption?
11

General probabilistic model:
|Y| x |F|n parameters

General naive Bayes model:
Y
F1
|Y| parameters


F2
Fn
n x |F| x |Y|
parameters
We only specify how each feature depends on the class
Total number of parameters is linear in n
12
SPAM



HAM
OFFER IS SECRET






PLAY SPORTS TODAY
WENT PLAY SPORTS
SECRET SPORTS EVENT
SPORT IS TODAY
SPORT COSTS MONEY
Questions:
◦ MESSAGE M = “SPORTS”
◦ P(SPAM | M) =
3/18 Applying Bayes’ Rule
13
SPAM



HAM
OFFER IS SECRET






PLAY SPORTS TODAY
WENT PLAY SPORTS
SECRET SPORTS EVENT
SPORT IS TODAY
SPORT COSTS MONEY
Questions:
◦ MESSAGE M = “SECRET IS SECRET”
◦ P(SPAM | M) =
25/26 Applying Bayes’ Rule
14
SPAM



HAM
OFFER IS SECRET






PLAY SPORTS TODAY
WENT PLAY SPORTS
SECRET SPORTS EVENT
SPORT IS TODAY
SPORT COSTS MONEY
Questions:
◦ MESSAGE M = “TODAY IS SECRET”
◦ P(SPAM | M) =
0 Applying Bayes’ Rule
15

Model:

What are the parameters?
ham : 0.66
spam: 0.33

the :
to :
and :
of :
you :
a
:
with:
from:
...
0.0156
0.0153
0.0115
0.0095
0.0093
0.0086
0.0080
0.0075
the :
to :
of :
2002:
with:
from:
and :
a
:
...
0.0210
0.0133
0.0119
0.0110
0.0108
0.0107
0.0105
0.0100
Where do these tables come from?
Counts from examples!
16

Posteriors determined by relative probabilities
(odds ratios):
south-west
nation
morally
nicely
extent
seriously
...
:
:
:
:
:
:
inf
inf
inf
inf
inf
inf
screens
minute
guaranteed
\$205.00
delivery
signature
...
:
:
:
:
:
:
inf
inf
inf
inf
inf
inf
What went wrong here?
17

Raw counts will overfit the training data!
◦ Unlikely that every occurrence of “minute” is 100% spam
◦ Unlikely that every occurrence of “seriously” is 100% ham
◦ What about all the words that don’t occur in the training set at all?
0/0?
◦ In general, we can’t go around giving unseen events zero probability

At the extreme, imagine using the entire email as the only feature
◦ Would get the training data perfect (if deterministic labeling)
◦ Would not generalize at all
◦ Just making the bag-of-words assumption gives us some
generalization, but isn’t enough

To generalize better: we need to smooth or regularize the
estimates
18

Maximum likelihood estimates:
r

g
g
Problems with maximum likelihood estimates:
◦ If I flip a coin once, and it’s heads, what’s the estimate for
◦ What if I flip 10 times with 8 heads?
◦ What if I flip 10M times with 8M heads?

Basic idea:
◦ We have some prior expectation about parameters
◦ Given little evidence, we should skew towards our prior
◦ Given a lot of evidence, we should listen to the data
19

Laplace’s estimate (extended):
◦ Pretend you saw every outcome k extra times




c (x) is the number of occurrences of this value of the variable x.
|x| is the number of values that the variable x can take on.
k is a smoothing parameter.
N is the total number of occurrences of x (the variable, not the
value) in the sample size.
◦ What’s Laplace with k = 0?
◦ k is the strength of the prior

Laplace for conditionals:
◦ Smooth each condition independently:
20

In practice, Laplace often performs poorly for
P(X|Y):
◦ When |X| is very large
◦ When |Y| is very large

Another option: linear interpolation
◦ Also get P(X) from the data
◦ Make sure the estimate of P(X|Y) isn’t too different from
P(X)
◦ What if  is 0? 1?
21


For real classification problems, smoothing is
critical
New odds ratios:
helvetica
seems
group
ago
areas
...
: 11.4
: 10.8
: 10.2
: 8.4
: 8.3
verdana
Credit
ORDER
<FONT>
money
...
:
:
:
:
:
28.8
28.4
27.2
26.9
26.5
Do these make more sense?
22

Now we’ve got two kinds of unknowns

How to learn?
◦ Parameters: the probabilities P(Y|X), P(Y)
◦ Hyperparameters, like the amount of
smoothing to do: k
◦ Learn parameters from training data
◦ Must tune hyperparameters on different
data
 Why?
◦ For each value of the hyperparameters,
train and test on the held-out
(validation)data
◦ Choose the best value and do a final test
on the test data
23

Data: labeled instances, e.g. emails marked
spam/ham
◦ Training set
◦ Held out (validation) set
◦ Test set

Features: attribute-value pairs which characterize
each x

Experimentation cycle
◦ Learn parameters (e.g. model probabilities) on training
set
◦ Tune hyperparameters on held-out set
◦ Compute accuracy on test set
◦ Very important: never “peek” at the test set!

Evaluation

Overfitting and generalization
◦ Accuracy: fraction of instances predicted correctly
◦ Want a classifier which does well on test data
◦ Overfitting: fitting the training data very closely, but
not generalizing well to test data
Training
Data
Held-Out
Data
Test
Data
24

Need more features– words aren’t enough!

Can add these information sources as new
variables in the Naïve Bayes model
◦
◦
◦
◦
◦
◦
Have you emailed the sender before?
Have 1K other people just gotten the same email?
Is the sending information consistent?
Is the email in ALL CAPS?
Do inline URLs point where they say they point?
Does the email address you by (your) name?
25

Input: x = pixel grids

Output: y = a digit 0-9
26




Input: x = images (pixel grids)
Output: y = a digit 0-9
Setup:
◦ Get a large collection of example
images, each labeled with a digit
◦ Note: someone has to hand label all
this data!
◦ Want to learn to predict labels of
new, future digit images
0
1
2
Features: The attributes used to make
the digit decision
◦ Pixels: (6,8)=ON
◦ Shape Patterns: NumComponents,
AspectRatio, NumLoops
◦ …
1
??
27

Simple version:
◦ One feature Fij for each grid position <i,j>
◦ Boolean features
◦ Each input maps to a feature vector, e.g.
◦ Here: lots of features, each is binary valued

Naïve Bayes model:
28
1
0.1
1
0.01
1
0.05
2
0.1
2
0.05
2
0.01
3
0.1
3
0.05
3
0.90
4
0.1
4
0.30
4
0.80
5
0.1
5
0.80
5
0.90
6
0.1
6
0.90
6
0.90
7
0.1
7
0.05
7
0.25
8
0.1
8
0.60
8
0.85
9
0.1
9
0.50
9
0.60
0
0.1
0
0.80
0
0.80
29
2 wins!!
30

◦ Linear regression

What you learned in high school math
◦ From a new perspective

Linear model
◦ y=mx+b
◦ hw(x) = y = w1 x + w0

Find best values for parameters
◦ “maximize goodness of fit”
◦ “maximize probability” or “minimize loss”
31
◦ Assume true function f is given by
y = f (x) = m x + b + noise
where noise is normally distributed
◦ Then most probable values of parameters
found by minimizing squared-error loss:
Loss(hw ) = Σj (yj – hw(xj))2
32
House price in \$1000
1000
900
800
700
600
500
400
300
500
1000 1500 2000 2500 3000 3500
House size in square feet
33
House price in \$1000
1000
900
800
700
600
Loss
500
w0
400
w1
300
500
1000 1500 2000 2500 3000 3500
House size in square feet
y = w1 x + w0
Linear algebra gives
an exact solution to
the minimization
problem
34
w1 =
M å xi yi - å xi å yi
Måx 2
i
(å x )
2
i
1
w1
w0 = å yi - å xi
M
M
35
36
w = any point
loop until convergence do:
for each wi in w do:
wi = wi – α ∂ Loss(w)
∂ wi
Loss
w0
w1
37

You learned this in math class too
◦ hw(x) = w ∙ x = w xT = Σi wi xi

The most probable set of weights, w*
(minimizing squared error):
◦ w* = (XT X)-1 XT y
38



To avoid overfitting, don’t just minimize loss
Maximize probability, including prior over w
Can be stated as minimization:
◦ Cost(h) = EmpiricalLoss(h) + λ Complexity(h)

For linear models, consider
◦ Complexity(hw) = Lq(w) = ∑i | wi |q
◦ L1 regularization minimizes sum of abs. values
◦ L2 regularization minimizes sum of squares
39
w2
w2
w*
w*
w1
w1
Cost(h) = EmpiricalLoss(h) + λ Complexity(h)
L1 regularization
L2 regularization
40
41
ìï 1 if w x + w ³ 0
1
0
f (x) = í
ïî 0 if w1 x + w0 < 0
42



Pick training example <x,y>
Update (α is learning rate)
◦ w1  w1+α(y-f(x))x
◦ w0  w0+α(y-f(x))


Converges to linear separator (if exists)
Picks “a” linear separator (a good one?)
43
44
Maximizes the “margin”
Support Vector Machines
45




Not linearly separable for x1, x2
What if we add a feature?
x3= x12+x22
See: “Kernel Trick”
X2
X1
X3
46

If the process of learning good values for
parameters is prone to overfitting,
can we do without parameters?

Nearest neighbor for digits:

Encoding: image is vector of intensities:

What’s the similarity function?
◦ Take new image
◦ Compare to all training images
◦ Assign based on closest example
◦ Dot product of two images vectors?
◦ Usually normalize vectors so ||x|| = 1
◦ min = 0 (when?), max = 1 (when?)
48
x2
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
4.5
5
5.5
6
6.5
7
x1
Using logistic regression (similar to linear regression) to do linear classification
49
x1
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
4.5
5
5.5
6
6.5
7
x2
Using nearest neighbors to do classification
50
x1
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
4.5
5
5.5
6
6.5
7
x2
Even with no parameters, you still have hyperparameters!
51
Edge length of neighborhood
Average neighborhood size for 10-nearest neighbors, n dimensions, 1M uniform points
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
25
50
75
100 125 150 175 200
Number of dimensions
52
Proportion of points in exterior shell
Proportion of points that are within the outer shell, 1% of thickness of the hypercube
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
25
50
75
100 125 150 175 200
Number of dimensions
53

References:
◦ Peter Norvig and Sebastian Thrun, Artificial Intelligence, Stanford
University
http://www.stanford.edu/class/cs221/notes/cs221-lecture5fall11.pdf
54
```