Neural Networks and SVM

```Neural Networks and SVM
Stat 600
Neural Networks
• History: started in the 50s and peaked in the 90s
• Idea: learning the way the brain does.
• Numerous applications
– Handwriting, face, speech recognition
– Vehicles that drive themselves
– Models of reading, sentence production, dreaming
Non-linear Regression
•
•
•
•
At the end this is a non-linear regression problem.
Let us consider our usual data set:
Y (response, numerical or categorical)
X1,…,Xp my predictors
• In the linear model we model Y as:
•
•
•
•
•
•
Y=Xb + e
Here we say that Y is a function of
Y = g(Ha) + e
Where h = f(Xb)
So essentially it is non-linear because of the functions g or f.
The function g is generally chosen as the logit transform, [1+e-z]-1
Model Form

1
Y i  1  exp(  H i b )]
m 1
1
 [1  exp(  b 0   b j [1  exp(  X ia j )] ]]
'
j 1
1
 ei
If g(Z)  Z then (linear function)
E ( Y i )  b 0  b 1 H i 1  ...  b m  1 H im  1
H ij  a
j0
a
j1 X i1
 ...  a
m
E ( Yi )  ( b 0   b ja
j 1

j0

j1 X i1
 ...  
jp  1 X ip  1
m
j 0 )  (  b ja
j 1
jp  1 X ip  1
m
j 1 ) X i 1  ...  (  b j a
j 1
jp  1 )
X ip  1
Parameter Estimation
• In order to control the level of overfitting we use penalized least squares
which penalize for the overfit using a Ridge regression like squared error
penalty.
• The penalty is imposed NOT on the number of parameters but on the
MAGNITUDE of the parameters. The criterion is given by:
n
m 1
m 1 p 1
Q   [Y i  f ( Xi , β , a 1 ,... a m  1 ]   [  b    a ij ]
i 1
2
i0
2
i
i  0 j 1
2
Rcode for NNet
#neural networks
Library(nnet)
nnetmodel=nnet(class~.,data=train.all,size=8,decay=.2,linout=FALSE,
entropy=TRUE)
nnetmodel
nnetpred1=predict(nnetmodel,newdata=train.all,type="class")
nnetpred2=predict(nnetmodel,newdata=test.all,type="raw")
table(nnetpred1,test.all\$class)
library(devtools)
source_url('https://gist.github.com/fawda123/7471137/raw/c720af2cea5f31
2717f020a09946800d55b8f45b/nnet_plot_update.r')
plot.nnet(nnetmodel)
Example: Apple data
B1
B2
Temp
I1
Max
I2
Min
I3
H1
RH
I4
H2
DP
I5
H3
Stemp
I6
H4
Latitude
I7
H5
Longitude
I8
H6
Elevation
I9
H7
gdd I10
H8
O1 class
year I11
date I12
Fitting Neural Networks
• Generally the gradient descent method is used to fit the models where:
( r  1)
b pm
a
( r  1)
jm
n
Qi
i 1
 b pm
n
Qi
 b pm   r 
(r)
a
(r)
jm
r
i 1
(r)
a
(r)
jm
n
 b pm   r   ki z mi
(r)
i 1
a
(r)
jm
n
  r  s mi x ji
i 1
• The r is the learning rate taken as a constant and can be optimized by a
line search that minimizes error function at each update.
Issues
•
•
•
•
Starting Values: Pick weights close to zero to start the process
Overfitting: Ridge or other penalties are used
Scaling inputs: good idea to scale weights
Number of hidden layers: better to have too many than too few
Support Vector Machines
• Highly flexible, powerful modeling methods
• Remember in linear regression we seek parameter estimates that
minimize SSE, and a drawback is that outliers affect this minimization.
• In Robust regression we use HUBER weights to minimize the effect of
influential observations.
• SVM for regression uses a similar function to Huber but with a difference.
– In SVM (given the threshold) set by the researcher, data points with residuals within the
threshold DO NOT contribute to the regression fit, while data points with absolute
difference greater than the threshold contribute a linear scale amount.
– Samples that fit the model well have NO effect on the regression.
– If threshold is set high, ONLY the outliers affect the regression.
SVM Estimation
• To estimate the model parameters SVM uses a user specified loss function
Le but also adds a penalty.
• The SVM coefficients minimize:
n
p
i 1
j 1
Cost  Le ( y i  yˆ i )   b
2
j
• The cost penalty is specified by the user and penalizes for a LARGE residual
(this is opposite of Ridge regression and Nnet, which puts the penalty for
large betas).
Svm PLOT FOR PROTEIN DATA X1 TO X7
o
x
x
oo
x
x
x
x x
x
xx x xclassification
xx x
xx xo xx
o
xx
xx
xxxoo
o
x
oplot
x
xoo
x
SVM
x
o
x
x
o
x o o
x x xx
x
xx
xxx
xxxxx
xx
x
x
x
xxx
x x xo
x
x
xxo o
o o
o
x
```