SVM

```Course
Summer
Mining
Data
Summer Course: Data Mining
Vector Machines
SupportSupport
Vector Machines
and other penalization classifiers
Presenter: Georgi Nalbantov
Presenter: Georgi Nalbantov
August 2009
2/20
Course
Summer
Mining
Data
Contents

Purpose

Linear Support Vector Machines

Nonlinear Support Vector Machines

(Theoretical justifications of SVM)

Marketing Examples

Other penalization classification methods

Conclusion and Q & A

(some extensions)
3/20
Course
Summer
Mining
Data
Purpose

Task to be solved (The Classification Task):
Classify cases (customers) into “type 1” or “type 2” on the basis of
some known attributes (characteristics)

Chosen tool to solve this task:
Support Vector Machines
4/20
Course
Summer
Mining
Data

Given data on explanatory and explained variables, where the explained variable
can take two values {  1 }, find a function that gives the “best” separation between
the “-1” cases and the “+1” cases:
Given:
( x1, y1 ), … , ( xm , ym )
Find:
: {1}

n

{1}
n
“best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k )
is minimal

Existing techniques to solve the classification task:



Linear and Quadratic Discriminant Analysis
Logit choice models (Logistic Regression)
Decision trees, Neural Networks, Least Squares SVM
5/20
Course
Summer
Mining
Data
Support Vector Machines: Definition

Support Vector Machines are a non-parametric tool for classification/regression

Support Vector Machines are used for prediction rather than description purposes

Support Vector Machines have been developed by Vapnik and co-workers
6/20
Course
Summer
Mining
Data
Linear Support Vector Machines
Number of art books purchased

∆
●
“The Art History of Florence”
∆
∆

Nissan Levin and Jacob Zahavi in Lattin,
Carroll and Green (2003).

Problem: How to identify buyers and nonbuyers using the two variables:
 Months since last purchase
 Number of art books purchased
∆
∆
∆
∆
●
∆
●
●
∆
●
●
A direct marketing company wants to sell a
new book:
●
●
∆ ●
●
●
●
Months since last purchase
7/20
Course
Summer
Mining
Data
Linear SVM: Separable Case

Main idea of SVM:
Number of art books purchased
separate groups by a line.
∆
●
∆
∆

∆
However: There are infinitely many lines
that have zero training error…
∆
∆

∆
●
●
●
●
●
●
●
●
Months since last purchase
… which line shall we choose?
8/20
Course
Summer
Mining
Data
Number of art books purchased
Linear SVM: Separable Case
∆
●
∆
∆
∆
SVM use the idea of a margin around the
separating line.

The thinner the margin,

the more complex the model,

The best line is the one with the
largest margin.
margin
∆
∆

∆
●
●
●
●
●
●
●
●
Months since last purchase
9/20
Course
Summer
Mining
Data
Linear SVM: Separable Case

w1x1 + w2x2 + b = 0
x2
Number of art books purchased
The line having the largest margin is:
w
∆
∆
∆
∆
∆
∆

●
margin
●
●
●
●

●

●

●
x1 = months since last purchase
x2 = number of art books purchased
Note:

x1
Months since last purchase
Where

w1xi 1 + w2xi 2 + b  +1
w1xj 1 + w2xj 2 + b  –1
for i  ∆
for j  ●
10/20
Course
Summer
Mining
Data
Linear SVM: Separable Case

Number of art books purchased
x2
The width of the margin is given by:
w
∆
∆
margin 
∆
1  ( 1)
w12  w 22

2
|| w ||
∆
∆
∆

●
margin
●
●
●
●
2 w
●
maximize
the margin
●
●
x1
Months since last purchase
Note:
w 2
minimize
w
2
2
minimize
11/20
Course
Summer
Mining
Data
Linear SVM: Separable Case
2 w
x2
maximize
the margin
∆
∆
w 2
w
2
2
minimize
minimize
∆
∆
∆

margin
∆
minimize L( w )  w
●
●
●
●
●
The optimization problem for SVM is:
●

●

x1
2
subject to:

●
2
w1xi 1 + w2xi 2 + b  +1
w1xj 1 + w2xj 2 + b  –1
for i  ∆
for j  ●
12/20
Course
Summer
Mining
Data
Linear SVM: Separable Case
“Support vectors”
x2

∆
∆
∆
“Support vectors” are those points that lie
on the boundaries of the margin
∆
∆
∆
●
●
●
●
●

●
●
●
x1
The decision surface (line) is determined
only by the support vectors. All other
points are irrelevant
13/20
Course
Summer
Mining
Data
Linear SVM: Nonseparable Case

Non-separable case: there is no line
separating errorlessly the two groups

Here, SVM minimize L(w,C) :
Training set: 1000 targeted customers
x2
∆
●
∆
∆
L( w,C )  w
∆
∆
∆
∆
●
●
∆
●
●
●
2
C  i

i
 ∆
●
2
maximize
the margin
minimize the
training errors
●
L(w,C) = Complexity +
∆ ●
●
●

●
subject to:

x1
Errors


w1xi 1 + w2xi 2 + b  +1 – i
w1xj 1 + w2xj 2 + b  –1 + i
I,j  0
for i  ∆
for j  ●
14/20
Course
Summer
Mining
Data
Linear SVM: The Role of C
x2
∆
x2
∆
∆
●
C=5
∆
●
∆
●
∆
∆
●
●
●
x1

Bigger C
increased complexity
( thinner margin )

∆
C=1
∆
●
●
∆
●
●
x1

Smaller C
decreased complexity
( wider margin )
smaller number errors
bigger number errors
( better fit on the data )
( worse fit on the data )
Vary both complexity and empirical error via C … by affecting the optimal w and optimal
number of training errors
15/20
Course
Summer
Mining
Data
Bias – Variance trade-off
16/20
Course
Summer
Mining
Data
From Regression into Classification

We have a linear model, such as
y  b * x  const

We have to estimate this relation using our training data set and having in mind
the so-called “accuracy”, or “0-1” loss function (our evaluation criterion).

The training data set we have consists of only MANY observations, for instance:
Training data:
Output (y)
Input (x)
-1
0.2
1
0.5
1
0.7
...
..
-1
-0.7
17/20
Course
Summer
Mining
Data
From Regression into Classification

We have a linear model, such as
y  b * x  const

We have to estimate this relation using our
training data set and having in mind the socalled “accuracy”, or “0-1” loss function (our
evaluation criterion).

The training data set we have consists of
only MANY observations, for instance:
y
1
-1
Training data:
Output (y)
Input (x)
-1
0.2
1
0.5
1
0.7
...
..
-1
-0.7
x
x
Support vector
“margin”
Support vector
18/20
Course
Summer
Mining
Data
From Regression into Classification:
Support Vector Machines

flatter line  greater penalization
equivalently:

smaller slope  bigger margin
y
y  b * x  const
1
-1
x
x
“margin”
19/20
Course
Summer
Mining
Data
From Regression into Classification:
Support Vector Machines
y  b1 * x1  b2 * x2  const
y  b1 * x1  b2 * x2  const
y
x2
x2
x1
x1

flatter line  greater penalization
“margin”
equivalently:

smaller slope  bigger margin
20/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case

Mapping into a higher-dimensional space
x2
∆
∆
∆
∆
∆
 x11
x
 21
 

 xl1
∆
●
∆
●
●
∆
●
●
●
●

∆ ●
●
●
x12 
x 22 
 

xl 2 

●
x1
2 x11 x12
2 x 21 x 22

2 xl 1xl 2
x122 
2 
x 22

 

2
xl 2 
Optimization task: minimize L(w,C)
L(w,C)  w

 x112
 2
 x 21
 
 2
 xl 1
subject to:
2
2

Ci
i

w1xi21  w2 2 xi 1xi 2  w3 xi22  b  1  i
∆

w1x 2j 1  w 2 2 x j 1x j 2  w 3 x 2j 2  b  1  j
●
21/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
 Map the data into higher-dimensional space:
 x1 
 
x 
 2

1, 2 , 1 
 1, 1   1, 2 , 1 
1, 1   1,  2 , 1 
 1, 1   1,  2 , 1 
1, 1  
 x 


 2 x1 x2 
 x2 
2


2
1
x2
●
(-1,1)
(-1,-1)
∆
∆
●
●
x22
∆
∆
(1,1)
x1
∆
2  3
●

●
x12
(1,-1)
2 x1 x2
22/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
 Find the optimal hyperplane in the transformed space
 x1 
 
x 
 2

1, 2 , 1 
 1, 1   1, 2 , 1 
1, 1   1,  2 , 1 
 1, 1   1,  2 , 1 
1, 1  
 x 


 2 x1 x2 
 x2 
2


2
1
x2
●
(-1,1)
(-1,-1)
∆
●
●
x22
∆
∆
(1,1)
x1
∆
∆
●

●
x12
(1,-1)
2 x1 x2
23/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
 Observe the decision surface in the original space (optional)
 x1 
 
x 
 2

1, 2 , 1 
 1, 1   1, 2 , 1 
1, 1   1,  2 , 1 
 1, 1   1,  2 , 1 
1, 1  
 x 


 2 x1 x2 
 x2 
2


2
1
x2
●
∆
●
●
x22
∆
∆
●
x1
∆
∆
●
x12
2 x1 x2
24/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
 Dual formulation of the (primal) SVM minimization problem
Primal
m in
w
2
2

Dual
C i
yi w  xi   b  1  i
yi   1
i
i
i
Subject to
i  0
   
max
Subject to
0  i  C

i
i
yi  0
yi   1
1
2
i
i
j
j
yi yj  xi  xj 
25/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
 Dual formulation of the (primal) SVM minimization problem
 x1 
 
x 
 2

 x12 


 2 x1 x2 
 x2 
2


( xi )  ( x j ) 
x ,
2
i1
( x
2 xi1 xi 2 , xi22
 x

2
j1
i1 , xi 2 )  ( x j1 , x j 2 )
 x x 
i
Dual
max
   
i
1
2
i
i
i
j

i
j
yi yj  xi  xj 

, 2 x j1 x j 2 , x 2j 2 

2

2
j
K ( xi , x j )  ( xi )  ( x j )
(kernel function)
Subject to
0  i  C
i
yi  0
yi   1
26/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
 Dual formulation of the (primal) SVM minimization problem
 x1 
 
x 
 2
 x12 


 2 x1 x2 
 x2 
2



Dual
max
x ,
( x
i1
2 xi1 xi 2 , x
2
j1
, xi 2 )  ( x j1 , x j 2 )
 x x 
i
 x
2
i2 
, 2 x j1 x j 2 , x

2
i
1
2
i
i
( xi )  ( x j ) 
2
i1
   
2
j2

max
i
yi yj  xi  xj 
j
1

i

 2 i j yi yj  ( xi )  ( xj )
i
i
j

2
j
max
 i  12 i j yi yj  xi  xj 
i
K ( xi , x j )  ( xi )  ( x j )
(kernel function)
j
i
j

i
2
Subject to
0  i  C
i
yi  0
yi   1
27/20
Course
Summer
Mining
Data
Strengths and Weaknesses of SVM

Strengths of SVM:
 Training is relatively easy
 No local minima
 It scales relatively well to high dimensional data
 Trade-off between classifier complexity and error can be controlled
explicitly via C
 Robustness of the results
 The “curse of dimensionality” is avoided

Weaknesses of SVM:


What is the best trade-off parameter C ?
Need a good transformation of the original space
28/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem

Two types of ketchup: Heinz and Hunts

Seven Attributes
 Feature Heinz
 Feature Hunts
 Display Heinz
 Display Hunts
 Feature&Display Heinz
 Feature&Display Hunts
 Log price difference between Heinz and Hunts

Training Data: 2498 cases (89.11% Heinz is chosen)

Test Data: 300 cases (88.33% Heinz is chosen)
29/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem

Cross-validation mean squared
errors, SVM with RBF kernel
Choose a kernel mapping:
K (xi , x j )  (xi  x j )
Linear kernel
K (xi , x j )  (xi  x j 1)d
Polynomial kernel
K (xi, xj )  e

C
min
max
σ
 xi xj / 2 2
2
RBF kernel
Do (5-fold ) cross-validation procedure to
find the best combination of the manually
adjustable parameters (here: C and σ)
30/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Training Set
Model
Linear Discriminant
Analysis
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
68
204
272
Heinz
58
2168
2226
Hunts
25.00%
75.00%
100.00%
Heinz
2.61%
97.39%
100.00%
89.51%
31/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Training Set
Model
Logit Choice Model
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
214
58
272
Heinz
497
1729
2226
Hunts
78.68%
21.32%
100.00%
Heinz
22.33%
77.67%
100.00%
77.79%
32/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Training Set
Model
Support Vector
Machines
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
255
17
272
Heinz
6
2220
2226
Hunts
93.75%
6.25%
100.00%
Heinz
0.27%
99.73%
100.00%
99.08%
33/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Training Set
Model
Majority Voting
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
0
272
272
Heinz
0
2226
2226
Hunts
0%
100%
100.00%
Heinz
0%
100%
100.00%
89.11%
34/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Test Set
Model
Linear Discriminant
Analysis
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
3
32
35
Heinz
3
262
265
Hunts
8.57%
91.43%
100.00%
Heinz
1.13%
98.87%
100.00%
88.33%
35/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Test Set
Model
Logit Choice Model
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
29
6
35
Heinz
63
202
265
Hunts
82.86%
17.14%
100.00%
Heinz
23.77%
76.23%
100.00%
77%
36/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Test Set
Model
Support Vector
Machines
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
25
10
35
Heinz
3
262
265
Hunts
71.43%
28.57%
100.00%
Heinz
1.13%
98.87%
100.00%
95.67%
•37/36
•Part II
•Penalized classification and regression methods

Support Hyperplanes

Nearest Convex Hull classifier

Soft Nearest Neighbor

Application:
An example Support Vector Regression financial study

Conclusion
•38/36
•Classification:
•Support Hyperplanes
•+
•+
•+
•+
•+ •+
•+
•+
•+
•+
•+
•+

•Consider a (separable) binary
classification case: training data (+,-)
and a test point x.
•There are infinitely many
hyperplanes that are semi-consistent
(= commit no error) with the training
data.
•39/36
•Classification:
•Support Hyperplanes
•+
•Support hyperplane
of x
•+
•+
•+ •+
•+
•+
•+
•+
•+
•+
•+

•For the classification of the test
point x, use the farthest-away hplane that is semi-consistent with
training data.
•The SH decision surface. Each
point on it has 2 support h-planes.
•40/36
•Classification:
•Support Hyperplanes
SH, linear kernel
•+
•+ •+
•+ •+
•+
SVM, linear kernel
•+
•+ •+
•+ •+
•+
SH, RBF kernel, =5
•+
•+ •+
•+ •+
•+
SVM, RBF kernel, =5
•+
•+ •+
•+ •+
•+
SH, RBF kernel,=35
•+
•+ •+
•+ •+
•+
SVM, RBF kernel,=35
•+
•+ •+
•+ •+
•+
•Toy Problem Experiment with Support Hyperplanes and Support Vector Machines
•41/36
•Classification:
•Support Vector Machines and Support Hyperplanes
•
Support Vector
Machines
•
Support Hyperplanes
•42/36
•Classification:
•Support Vector Machines and Nearest Convex Hull cl.
•
Support Vector
Machines
•
Nearest Convex Hull
classification
•43/36
•Classification:
•Support Vector Machines and Soft Nearest Neighbor
•
Support Vector
Machines
•
Soft Nearest Neighbor
•44/36
•Classification: Support Hyperplanes
•
Support Hyperplanes
•
(bigger penalization)
•
Support Hyperplanes
•45/36
•Classification: Nearest Convex Hull classification
•
Nearest Convex Hull classification
•
(bigger penalization)
•
Nearest Convex Hull
classification
•46/36
•Classification: Soft Nearest Neighbor
Soft Nearest Neighbor
•
•
(bigger penalization)
•
Soft Nearest Neighbor
•47/36
•Classification: Support Vector Machines,
•Nonseparable Case
•
Support Vector Machines
•48/36
•Classification: Support Hyperplanes,
•Nonseparable Case
•
Support Hyperplanes
•49/36
•Classification: Nearest Convex Hull classification,
•Nonseparable Case
•
Nearest Convex Hull
classification
•50/36
•Classification: Soft Nearest Neighbor,
•Nonseparable Case
•
Soft Nearest Neighbor
•51/36
Summary: Penalization Techniques for Classification
•Penalization methods for classification: Support Vector Machines (SVM), Support Hyperplanes (SH), Nearest
Convex Hull classification (NCH), and Soft Nearest Neighbour (SNN). In all cases, the classificarion of test point x is
dete4rmined using the hyperplane h. Equivalently, x is labelled +1 (-1) if it is farther away from set S_ (S+).
52/20
Course
Summer
Mining
Data
Conclusion

Support Vector Machines (SVM) can be applied in the binary
and multi-class classification problems

SVM behave robustly in multivariate problems

Further research in various Marketing areas is needed to justify
or refute the applicability of SVM

Support Vector Regressions (SVR) can also be applied

http://www.kernel-machines.org

Email: [email protected]
```