### svm

```Support Vector Machine
Rong Jin
Linear Classifiers
denotes +1
denotes -1
• How to find the linear decision boundary that
linearly separates data points from two classes?
Linear Classifiers
denotes +1
denotes -1
Linear Classifiers
denotes +1
denotes -1
Any of these
would be fine..
..but which is best?
Andrew W. Moore
Classifier Margin
denotes +1
denotes -1
Andrew W. Moore
Define the margin of
a linear classifier as
the width that the
boundary could be
increased by before
hitting a datapoint.
Maximum Margin
denotes +1
denotes -1
The maximum margin
linear classifier is the
linear classifier with
the maximum margin.
This is the simplest
kind of SVM (called an
Linear SVM)
Andrew W. Moore
Why Maximum Margin ?
1.
denotes +1
denotes -1
2.
3.
Andrew W. Moore
If we’ve made a small error
in the location of the
boundary (it’s been jolted
in its perpendicular
direction) this gives us least
chance of causing a
misclassification.
There’s some theory (using
VC dimension) that is
related to (but not the
same as) the proposition
that this is a good thing.
Empirically it works very
very well.
Estimate the Margin
denotes +1
denotes -1
x
• What is the distance expression for a point x to a line
wx+b= 0?
Andrew W. Moore
Estimate the Margin
denotes +1
denotes -1
x
• What is the classification margin for wx+b= 0?
Andrew W. Moore
Maximize the Classification Margin
denotes +1
denotes -1
Andrew W. Moore
x
Maximum Margin
denotes +1
denotes -1
Andrew W. Moore
x
Maximum Margin
denotes +1
denotes -1
x
Andrew W. Moore
Maximum Margin
• Linear equality and inequality constraints
• Well studied problem in OR
T
Find
T
arg m in c  d u 
u Ru
2
u
Subject to
a 11 u 1  a 12 u 2  ...  a 1 m u m  b1
a 21 u 1  a 22 u 2  ...  a 2 m u m  b 2
:
inequality
constraints
a n 1u 1  a n 2 u 2  ...  a nm u m  b n
a ( n  1 )1u 1  a ( n  1 ) 2 u 2  ...  a ( n  1 ) m u m  b ( n  1 )
a ( n  2 )1u 1  a ( n  2 ) 2 u 2  ...  a ( n  2 ) m u m  b ( n  2 )
:
a ( n  e )1u 1  a ( n  e ) 2 u 2  ...  a ( n  e ) m u m  b ( n  e )
Andrew W. Moore
equality
constraints
And subject to
Linearly Inseparable Case
denotes +1
denotes -1
Andrew W. Moore
This is going to be a problem!
What should we do?
Linearly Inseparable Case
denotes +1
denotes -1
Andrew W. Moore
• Relax the constraints
• Penalize the relaxation
Linearly Inseparable Case
denotes +1
denotes -1
Andrew W. Moore
• Relax the constraints
• Penalize the relaxation
Linearly Inseparable Case
denotes +1
denotes -1
3
1
2
problem
Andrew W. Moore
Linearly Inseparable Case
Andrew W. Moore
Linearly Inseparable Case
Support
Vector
Machine
Regularized
logistic
regression
Andrew W. Moore
Dual Form of SVM
How to decide b ?
Andrew W. Moore
Dual Form of SVM
wx b 1
denotes +1
denotes -1
w
w  x  b  1
Andrew W. Moore
Suppose we’re in 1-dimension
What would SVMs
do with this data?
x=0
Andrew W. Moore
Suppose we’re in 1-dimension
Not a big surprise
x=0
Positive “plane”
Andrew W. Moore
Negative “plane”
Harder 1-dimensional Dataset
x=0
Andrew W. Moore
Harder 1-dimensional Dataset
x=0
Andrew W. Moore
Harder 1-dimensional Dataset
x=0
Andrew W. Moore
Common SVM Basis Functions
• Polynomial terms of x of degree 1 to q
Andrew W. Moore
1




2 x1 


2 x2 


:




2 xm


2


x1


2
x
2




:


2
x


m
Φ (x ) 

2 x1 x 2 


2 x1 x 3 



:



2 x1 x m 


2
x
x
2 3 



:


2
x
x
1 m 



:


 2x x 
m  1 2003,
m 
Andrew W. Moore
Constant Term
Basis
Functions
Linear Terms
Terms
Number of terms (assuming m
input dimensions) = (m+2)choose-2
= (m+2)(m+1)/2
= m2/2
Dual Form of SVM
Andrew W. Moore
Dual Form of SVM
Andrew W. Moore
Products














Φ (a )  Φ (b ) 















 
 
2 a1  
2a2  
 
:
 
 
2am
 
2
 
a1
 
2
a2
 
 
:
 
2
am
 



2 a1 a 2
 
2 a1 a 3  
 
:
 
2 a1 a m  
 
2 a2a3  
 
:
 
2 a1 a m  
 
:
 
2 a m 1 a m  
1


2 b1 
2 b2 

:


2 bm

2

b1

2
b2


:

2
bm

2 b1b 2 

2 b1b 3 

:

2 b1b m 

2 b 2 b3 

:

2 b1b m 

:

2 b m 1b m 
1
1
+
m
 2a b
i
i
i 1
+
m
a
2
i
2
bi
i 1
+
m
m
  2a a
i
i 1 j  i  1
j
bi b j
Products
Just out of casual, innocent, interest, let’s look
at another function of a and b:
( a .b  1)
 ( a .b )  2 a .b  1
2
1  2  a i bi 
i 1
2
m


   a i bi   2  a i bi  1
i 1
 i 1

m
Φ (a )  Φ (b ) 
m
2
m
m
a
m
2
i
b 
i 1
2
i

m
  2a a
i
j
aba
i
i 1
bi b j
m
i
j 1
j
b j  2  a i bi  1
i 1
i 1 j  i  1
m

 (a b )
i
i 1
Andrew W. Moore
m
i
m
2
 2
m
aba
i 1 j  i  1
i
i
m
j
b j  2  a i bi  1
i 1
Kernel Trick
Andrew W. Moore
Kernel Trick
Andrew W. Moore
SVM Kernel Functions
• Polynomial kernel function
• Radial basis kernel function (universal kernel)
Andrew W. Moore
Kernel Tricks
• Replacing dot product with a kernel function
• Not all functions are kernel functions
• Are they kernel functions ?
Andrew W. Moore
Kernel Tricks
Mercer’s condition
To expand Kernel function k(x,y) into a dot product, i.e.
k(x,y)=(x)(y), k(x, y) has to be positive semi-definite
function, i.e., for any function f(x) whose f 2 ( x ) d x is finite,
the following inequality holds
Andrew W. Moore
Kernel Tricks
• Introducing nonlinearity into the model
• Computationally efficient
Andrew W. Moore
Nonlinear Kernel (I)
Andrew W. Moore
Nonlinear Kernel (II)
Andrew W. Moore
Reproducing Kernel Hilbert Space (RKHS)
• Reproducing Kernel Hilbert Space H
• Eigen decomposition:
• Elements of space H:
• Reproducing property
Reproducing Kernel Hilbert Space (RKHS)
• Representer theorem
Kernelize Logistic Regression
• How can we introduce nonlinearity into the
logistic regression model?
Andrew W. Moore
Diffusion Kernel
• Kernel function describes the correlation or
similarity between two data points
• Given that a similarity function s(x,y)
• Non-negative and symmetric
• Does not obey Mercer’s condition
• How can we generate a kernel function based on
this similarity function?
A graph theory approach …
Andrew W. Moore
Diffusion Kernel
• Create a graph for all the training examples
• Each vertex corresponds to a data point
• The weight of each edge is the similarity s(x,y)
• Graph Laplacian
• Negative semi-definite
Andrew W. Moore
Diffusion Kernel
Consider a simple Laplacian
Consider L2, L3, …
What do these matrices represent?
A diffusion kernel
Andrew W. Moore
Diffusion Kernel: Properties
• Positive definite
• How to compute the diffusion kernel ?
Andrew W. Moore
Doing Multi-class Classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2”
:
SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Andrew W. Moore
Kernel Learning
What if we have multiple kernel functions
• Which one is the best ?
• How can we combine multiple kernels ?
Kernel Learning
References
• An excellent tutorial on VC-dimension and
Support Vector Machines:
C.J.C. Burges. A tutorial on support vector machines for
pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible: (Not for beginners
including myself)
Statistical Learning Theory by Vladimir Vapnik, WileyInterscience; 1998
• Software: SVM-light,