### 17.4 Error Variable: Required Conditions

```Simple Linear Regression and
Correlation
(Continue..,)
Reference:
Chapter 17 of Statistics for Management and Economics, 7th
Edition, Gerald Keller.
1
17.4 Error Variable: Required
Conditions
• The error e is a critical part of the regression
model.
• Four requirements involving the distribution of e
must be satisfied.
–
–
–
–
The probability distribution of e is normal.
The mean of e is zero: E(e) = 0.
The standard deviation of e is se for all values of x.
The set of errors associated with different values of y
are all independent.
2
Observational and Experimental
Data
Observational
Experimental
• Y ,X : random variables
• Y: random variable,
• Ex: Y = return,
X = inflation
• X : controlled
• Models: Regression
Correlation
(Bivariate normal)
• Ex:Y = blood pressure
X = medicine dose
• Models: Regression
3
17.5 Assessing the Model
• For our assumed model, the least squares
method will produces a regression line whether
or not there are linear relationship between x
and y.
40
40
30
30
20
20
10
Y
Y
10
0
0
X
2
4
6
8
10
12
0
0
2
4
6
8
10
12
X
4
• Consequently, it is important to assess
how well the linear model fits the data as
we have assumed linear model.
• Several methods are used to assess the
model. All are based on the sum of
squares for errors, SSE.
5
Sum of Squares for Errors
– This is the sum of squared differences between
the observed points and the points on the
regression line.
– It can serve as a measure of how well the line fits
the data. SSE is defined by
n
SSE    yi  yˆ i 
2
i 1
6
Standard Error of Estimate, se
– The standard deviation of the error
variables shows the dispersion around the
true line for a certain x.
– If se is big we have big dispersion around
the true line. If se is small the observations
tend to be close to the line. Then, the
model fits the data well.
– Therefore, we can, use se as a measure of
the suitability of using a linear model.
7
The se is not known, therefore use an
estimate of it
An estimate of se is given by se, the
standard error of estimate:
SSE
se 
n2
s e  se
8
The Example with the
food company:
200
180
160
140
SALES
120
100
200
400
600
800
1000
1200
9
(when s2 unknown)
Model: X ~ N(, s2 )
Hypothesis: H0:  = 50
H1:  ≠ 50
Test statistic:
if H0 is true
t
X 
s
n
~ t´n1
Level of significance: α
10
Rejection region: Reject H0 if
tobs>tcrit or tobs<-tcrit
Observation: tobs
Conclusion:
tobs>tcrit or tobs<-tcrit
reject H0
tobs<tcrit or tobs>-tcrit
don’t reject
H0
Interpretation: We have empirical
support for the hypothesis
11
Testing the Slope
We test if there is a slope. We can write it formally as follows
H 0 : 1  0
H 1 : 1  0
Test statistic
Under H0,
t
t
b  1
sb1
b  1
sb1
Confidence interval:

b
sb1
~ t n 2 where s b1 
se
(n  1) s x2
b1  t / 2:( n2) sb1
12
Coefficient of determination
To measure the strength of the linear
relationship we use the coefficient of
determination. It is a measure of how much
the variation in Y is explained by the
variation in X.
(How many % of the variation in Y can be
explained by the model)
13
Food Company example: Call X=ADVER and Y=SALES
i
xi
yi  y
yi
1
2
3
4
5
6
7
8
9
276
552
720
648
336
396
1056
1188
372
115.0
135.6
153.6
117.6
106.8
150.0
164.4
190.8
136.8
Total
5544
1270.6
Mean
616
 yi  y 2
-26.177778
685.27605
-5.577778
31.11160
12.422222
154.31160
-23.577778
555.91160
-34.377778 1181.83160
8.822222
77.83160
23.222222
539.27160
49.622222 2462.36494
-4.377778
19.16494
yˆ i
118.0087
136.8165
148.2648
143.3584
122.0974
126.1860
171.1613
180.1563
124.5506
5707.076
 yi  yˆ i 2
9.052434
1.479981
28.464554
663.494881
234.009911
567.104757
45.714584
113.288358
150.048384
1812.658
141.178
SST   yi  y 
2
2
SSE   yi  yˆ i 
2
SSR   yˆ i  yˆ 
2
2
2
ˆ
ˆ






y

y

y

y

y

y
 i
 i i  i
SST  SSE  SSR
14
Standard error of the estimate
se 
Standard deviation of
SSE
1812.658

 16.09196
n2
9-2
b1
sb1 
se
(n  1) s
2
x

16.09196
(9 - 1)  104832
 0.01757183
H 0 : 1  0
H 1 : 1  0
b1
0.06814
T st at ist ic t 

 3.877798
s b1 0.01757183
Degree of freedom  9  2  7
R2  1
SSE
1812.658
 1
 0.6823841
SST
5707.076
15
Coefficient of determination
The regression model
Overall variability in y
The error
Variation in the dependent variable Y
= variation explained by the independent variable
+ unexplained variation
SST=SSR+SSE
The greater the explained variable, better the model
Co-efficient of determination is a measure of explanatory power
16
of the model
17
Correlations
SALES
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
1
,
9
,826**
,006
9
SALES
,826**
,006
9
1
,
9
**. Correlation is sig nificant at the 0.01 level
(2-tailed).
18
Model Summary
Model
1
R
,826a
R Sq uare
,682
R Sq uare
,637
Std. Error of
the Estimate
16,0867
19
ANOVAb
Model
1
Reg ression
Residual
Total
Sum of
Squares
3885,156
1811,484
5696,640
df
1
7
8
Mean Square
3885,156
258,783
F
15,013
Sig .
,006a
b. Dependent Variable: SALES
20
Coefficientsa
Model
1
(Constant)
Unstandardized
Coefficients
B
Std. Error
99,273
12,077
6,806E-02
,018
Standardized
Coefficients
Beta
,826
t
8,220
3,875
Sig .
,000
,006
a. Dependent Variable: SALES
21
Cause - effect
Note that conclusions about cause and
effect, X
Y is based on knowledge of
the subject.
Experimental studies can decide it.
It is hard to say with observational studies
Eg: Is smoking
lungcancer true?.
The regression model only shows the linear
relationship. We will make the same
inferences even if we switch the variables! 22
Coefficientsa
Model
1
(Constant)
SALES
Unstandardized
Coefficients
B
Std. Error
-798,855
370,905
10,020
2,586
Standardized
Coefficients
Beta
,826
t
-2,154
3,875
Sig .
,068
,006