### Chapter 3 Multiple Linear Regression

```Chapter 3 Multiple Linear Regression
Ray-Bing Chen
Institute of Statistics
National University of Kaohsiung
1
3.1 Multiple Regression Models
• Multiple regression model: involve more than one
regressor variable.
• Example: The yield in pounds of conversion
depends on temperature and the catalyst
concentration.
2
• E(y) = 50 +10 x1 + 7 x2
3
• The response y may be related to k regressor or
predictor variables: (multiple linear regression
model)
• The parameter j represents the expected change
in the response y per unit change in xi when all of
the remaining regressor variables xj are held
constant.
4
• Multiple linear regression models are often used
as the empirical models or approximating
functions. (True model is unknown)
• The cubic model:
• The model with interaction effects:
• Any regression model that is linear in the
parameters is a linear regression model, regardless
of the shape of the surface that it generates.
5
6
• The second-order model with interaction:
7
8
3.2 Estimation of the Model
Parameters
3.2.1 Least-squares Estimation of the Regression
Coefficients
• n observations (n > k)
• Assume
– The error term , E() = 0 and Var() = 2
– The errors are uncorrelated.
– The regressor variables, x1,…, xk are fixed.
9
• The sample regression model:
• The least-squares function:
• The normal equations:
10
• Matrix notation:
11
• The least-squares function:
12
• The fitted model corresponding to the levels of the
regressor variable, x:
• The hat matrix, H, is an idempotent matrix and is a
symmetric matrix. i.e. H2 = H and HT = H
• H is an orthogonal projection matrix.
• Residuals:
13
• Example 3.1 The Delivery Time Data
– y: the delivery time,
– x1: the number of cases of product stocked,
– x2: the distance walked by the route driver
– Consider y = 0 + 1 x1 + 2 x2 + 
14
15
16
3.2.2 A Geometrical Interpretation of Least
Square
• y = (y1,…,yn) is the vector of observations.
• X contains p (p = k+1) column vectors (n ×1), i.e.
X = (1,x1,…,xk)
• The column space of X is called the estimation
space.
• Any point in the estimation space is X.
• Minimize square distance
S()=(y-X)’(y-X)
17
• Normal equation: X ' ( y  Xˆ )  0
18
3.2.3 Properties of the Least Square Estimators
• Unbiased estimator:
E(ˆ )  E((X ' X ) 1 X ' y)  E((X ' X ) 1 X ' X )  
• Covariance matrix:
Cov(ˆ )   2 ( X ' X ) 1
• Let C=(X’X)-1
• The LSE is the best linear unbiased estimator
• LSE = MLE under normality assumption
19
3.2.4 Estimation of 2
• Residual sum of squares:
SSRe s  e' e
 ( y  Xˆ )' ( y  Xˆ )
 y ' y  2ˆ ' X ' y  ˆ ' ( X ' X ) ˆ
 y ' y  ˆ ' X ' y
• The degree of freedom: n – p
• The unbiased estimator of 2: Residual mean
squares
SSRe s
MS Re s 
n p
20
• Example 3.2 The Delivery Time Data
• Both estimates are in a sense correct, but they
depend heavily on the choice of model.
• The model with small variance would be better.
21
3.2.5 Inadequacy of Scatter Diagrams in Multiple
Regression
• For the simple linear regression, the scatter
diagram is an important tool in analyzing the
relationship between y and x.
• However it may not be useful in multiple
regression.
– y = 8 – 5 x1 + 12 x2
– The y v.s. x1 plot do not exhibit any apparent
relationship between y and x1
– The y v.s. x2 plot indicates the linear
relationship with the slope  8.
22
23
• In this case, constructing scatter diagrams of y v.s.
xj (j = 1,2,…,k) can be misleading.
• If there is only one (or a few) dominant regressor,
or if the regressors operate nearly independently,
the matrix scatterplots is most useful.
24
3.2.6 Maximum-Likelihood Estimation
• The Model is y = X + 
•  ~N(0, 2I)
• The likelihood function and log-likelihood
function:
L(  ,  ) 
2
1
exp(( y  X )' ( y  X ) /(2 2 ))
(2 2 ) n / 2
n
1
2
2
l (  ,  )   (ln(2 )  ln( )) 
( y  X )' ( y  X )
2
2
2
• The MLE of 2
25
3.3 Hypothesis Testing in Multiple
Linear Regression
• Questions:
– What is the overall adequacy of the model?
– Which specific regressors seem important?
• Assume the errors are independent and follow a
normal distribution with mean 0 and variance 2
26
3.3.1 Test for Significance of Regression
• Determine if there is a linear relationship between
y and xj, j = 1,2,…,k.
• The hypotheses are
H0: β1 = β2 =…= βk = 0
H1: βj 0 for at least one j
• ANOVA
• SST = SSR + SSRes
• SSR/2 ~ 2k, SSRes/2 ~ 2n-k-1, and SSR and SSRes
are independent
SSR / k
MS R
F0 

~ Fk ,nk 1
SSRe s /(n  k  1) MS Re s
27
• E ( MS Re s )   2
*'
'
*

X
X

c
c
E ( MS R )   2 
k 2
 *  ( 1 ,..., k )'
 x11  x1  x1k  x k 


Xc   



x  x  x  x 
nk
k 
 n1 1
• Under H1, F0 follows F distribution with k and nk-1 and a noncentrality parameter of
 *' X c' X c  *

2
28
• ANOVA table
29
30
• Example 3.3 The Delivery Time Data
31
– R2 always increase when a regressor is added to
the model, regardless of the value of the
contribution of that variable.
2
R
SSRe s /(n  p)
 1
SST /(n  1)
variable to the model if the addition of the
variable reduces the residual mean squares.
32
3.3.2 Tests on Individual Regression Coefficients
• For the individual regression coefficient:
– H0: βj = 0 v.s. H1: βj  0
– Let Cjj be the j-th diagonal element of (X’X)-1.
The test statistic:
ˆ j
ˆ j
t0 

~ t n k 1
se( ˆ )
ˆ 2 C
jj
j
– This is a partial or marginal test because any
estimate of the regression coefficient depends
on all of the other regression variables.
– This test is a test of contribution of xj given the
other regressors in the model
33
• Example 3.4 The Delivery Time Data
34
• The subset of regressors:
35
• For the full model, the regression sum of square
SSR ( )  ˆ ' X ' y
• Under the null hypothesis, the regression sum of
squares for the reduce model
SS ( )  ˆ ' X ' y
R
1
1
1
• The degree of freedom is p-r for the reduce model.
• The regression sum of square due to β2 given β1
SSR ( 2 | 1 )  SSR ( )  SSR (1 )
• This is called the extra sum of squares due to β2
and the degree of freedom is p - (p - r) = r
• The test statistic
SSR (  2 | 1 ) / r
F0 
~ Fr ,n p
MS Re s
36
• If β2  0, F0 follows a noncentral F distribution
with

1
2
 2' X 2' [ I  X 1 ( X 1' X 1 ) 1 X 1' ] X 2  2
• Multicollinearity: this test actually has no power!
• This test has maximal power when X1 and X2 are
orthogonal to one another!
• Partial F test: Given the regressors in X1, measure
the contribution of the regressors in X2.
37
• Consider y = β0 + β1 x1 + β2 x2 + β3 x3 + 
SSR(β1| β0 , β2, β3), SSR(β2| β0 , β1, β3)
and SSR(β3| β0 , β2, β1) are signal-degree-of –
freedom sums of squares.
• SSR(βj| β0 ,…, βj-1, βj, … βk) : the
contribution of xj as if it were the last variable
• This F test is equivalent to the t test.
• SST = SSR(β1 ,β2, β3|β0) + SSRes
• SSR(β1 ,β2 , β3|β0) = SSR(β1|β0) +
SSR(β2|β1, β0) + SSR(β3 |β1, β2, β0)
38
• Example 3.5 Delivery Time Data
39
3.3.3 Special Case of Orthogonal Columns in X
• Model: y = Xβ +  = X1β1+ X2β2 + 
• Orthogonal: X1’X2 = 0
• Since the normal equation (X’X)β= X’y,
 X 1' X 1

 0

0  ˆ1   X 1' y 
    ' 
'
X 2 X 2  ˆ 2   X 2 y 
• ˆ1  ( X 1' X 1 ) 1 X 1' y and ˆ2  ( X 2' X 2 ) 1 X 2' y
40
41
3.3.4 Testing the General Linear Hypothesis
• Let T be an m  p matrix, and rank(T) = r
• Full model: y = Xβ + 
SSRe s (FM )  y' y  ˆ ' X ' y (n - p degree of freedom)
• Reduced model: y = Z + , Z is an n  (p-r)
matrix and  is a (p-r) 1 vector. Then
ˆ  (Z ' Z ) 1 Z ' y
SSRe s ( RM )  y' y  ˆ' Z ' y (n - p  r degree of freedom)
• The difference: SSH = SSRes(RM) – SSRes(FM)
with r degree of freedom. SSH is called the sum of
squares due to the hypothesis H0: Tβ = 0
42
• The test statistic:
SSH / r
F
~ Fr ,n p
SSRe s ( FM ) /(n  p)
43
44
• Another form:
ˆ 'T '[T ( X ' X ) 1 T ' ]1 Tˆ / r
F
SSRe s ( FM ) /(n  p)
• H0: Tβ = c v.s. H1: Tβ c Then
(Tˆ  c)'[T ( X ' X ) 1 T ' ]1 (Tˆ  c) / r
F
~ Fr ,n p
SSRe s ( FM ) /(n  p)
45
```