### Regression Analysis

```Classification and Prediction:
Regression Analysis
DePaul University
What Is Numerical Prediction
(a.k.a. Estimation, Forecasting)
 (Numerical) prediction is similar to classification
 construct a model
 use model to predict continuous or ordered value for a given input
 Prediction is different from classification
 Classification refers to predicting categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or predictor
variables and a dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson regression,
log-linear models, regression trees
2
Linear Regression
 Linear regression: involves a response variable y and a single predictor
variable x  y = w0 + w1 x
x
y
 Goal: Using the data estimate weights (parameters) w0 and w1 for the
line such that the prediction error is minimized
3
Linear Regression
y  w 0  w 1x
y
Observed Value
of y for xi
ei
Predicted Value
of y for xi
Slope = β1
Error for this x value
Intercept = w0
xi
x
Linear Regression
 Linear regression: involves a response variable y and a
single predictor variable x 
y = w0 + w1 x
 The weights w0 (y-intercept) and w1 (slope) are regression coefficients
 Method of least squares: estimates the best-fitting straight line
 w0 and w1 are obtained by minimizing the sum of the squared errors (a.k.a.
residuals)
2
2
e
i

i
i
 yˆ i )
i

w1 can be obtained by
setting the partial
derivative of the SSE to 0
and solving for w1,
ultimately resulting in:
 (y

( y i  ( w 0  w1 x i ))
w 
1
 (x
i
2
 x )( y i  y )
i
 (x
i
 x)
2
w  yw x
0
1
i
5
Multiple Linear Regression
 Multiple linear regression: involves more than one predictor variable
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
 Solvable by extension of least square method
 Many nonlinear functions can be transformed into the above
x1
x2
y
6
Least Squares Generalization
 Simple Least Squares:
Determine linear coefficients ,  that minimize sum of
squared error (SSE).
Use standard (multivariate) differential calculus:
 differentiate SSE with respect to , 
 find zeros of each partial differential equation
 solve for , 
 One dimension:
N
SSE 
 (y
j
 (    x j ))
2
N  number of samples
j 1
 
cov[ x , y ]
  y  x
x , y  means of training x, y
var[ x ]
yˆ t      x t
for test sample x t
Least Squares Generalization
 Multiple dimensions
To simplify notation and derivation, change  to 0, and
add a new feature x0 = 1 to feature vector x:
d
yˆ   0  1 

 i  xi  β  x
T
i 1
x0
1
1
1
1
1
x1
x2
y
Least Squares Generalization
 Multiple dimensions
d
yˆ   0  1 

 i  xi  β  x
T
i 1
Calculate SSE and determine :
N
SSE 

d
(yj 
j 1

 i  xi, j )  ( y  X β )  ( y  X β )
2
T
i0
y  vector of all training responses
yj
X  matrix of all training samples x j
1
β  (X X ) X y
T
yˆ t  β  x t
T
for test sample x t
Extending Application of Linear Regression
 The inputs X for linear regression can be:
Original quantitative inputs
Transformation of quantitative inputs, e.g. log, exp, square
root, square, etc.
Polynomial transformation
 example: y = 0 + 1x + 2x2 + 3x3
Dummy coding of categorical inputs
Interactions between variables
 example: x3 = x1  x2
 This allows use of linear regression techniques to fit much
more complicated non-linear datasets.
Example of fitting polynomial curve with linear model
Regularization
 Complex models (lots of parameters) are often prone to overfitting
 Overfitting can be reduced by imposing a constraint on the overall
magnitude of the parameters (i.e., by including coefficients as part of the
optimization process)
 Two common types of regularization in linear regression:
 L2 regularization (a.k.a. ridge regression). Find  which minimizes:
N

d
(yj 
j 1

i0
d
 i  xi )     i
2
2
i 1
  is the regularization parameter: bigger  imposes more constraint
 L1 regularization (a.k.a. lasso). Find  which minimizes:
N
 (y
j 1
d
j


i0
d
 xi )    |  i |
2
i
i 1
Example: Web Traffic Data
13
1D Poly Fit
Example of too much “bias”  underfitting
14
Example: 1D and 2D Poly Fit
15
Example: 1D Ploy Fit
Example of too much “variance”  overfitting
16
17
 Possible ways of dealing with high bias
 More complex model (e.g., adding polynomial terms such as x12, x22 ,
x1.x2, etc.)
 Use smaller regularization coefficient .
 Note: getting more training data won’t necessarily help in this case
 Possible ways dealing with high variance
 Use more training instances
 Reduce the number of features
 Use simpler models
 Use a larger regularization coefficient .
18
Other Regression-Based Models
 Generalized linear models
 Foundation on which linear regression can be applied to modeling
categorical response variables
 Variance of y is a function of the mean value of y, not a constant
 Logistic regression: models the probability of some event occurring as a
linear function of a set of predictor variables
 Poisson regression: models the data that exhibit a Poisson distribution
 Log-linear models (for categorical data)
 Approximate discrete multidimensional prob. distributions
 Also useful for data compression and smoothing
 Regression trees and model trees
 Trees to predict continuous values rather than class labels
19
Regression Trees and Model Trees
 Regression tree: proposed in CART system (Breiman et al. 1984)
 CART: Classification And Regression Trees
 Each leaf stores a continuous-valued prediction
 It is the average value of the predicted attribute for the training instances
that reach the leaf
 Model tree: proposed by Quinlan (1992)
 Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
 A more general case than regression tree
 Regression and model trees tend to be more accurate than linear
regression when instances are not represented well by simple linear
models
20
Evaluating Numeric Prediction
 Prediction Accuracy
 Difference between predicted scores and the actual results (from evaluation set)
 Typically the accuracy of the model is measured in terms of variance (i.e., average
of the squared differences)
 Common Metrics (pi = predicted target value for test instance i, ai =
actual target value for instance i)
 Mean Absolute Error: Average loss over the test set
MAE 
( p1  a1 )  ...  ( p n  a n )
n
 Root Mean Squared Error: compute the standard deviation (i.e., square root of the
co-variance between predicted and actual ratings)
( p1  a1 )  ...  ( p n  a n )
2
RMSE 
2
n
21
```