### Regression

```REGRESSION
Jennifer Kensler
Laboratory for Interdisciplinary Statistical
Analysis
LISA helps VT researchers benefit
from the use of Statistics
Experimental Design • Data Analysis • Interpreting Results
Grant Proposals • Software (R, SAS, JMP, SPSS...)
Collaboration
Walk-In Consulting
Monday—Friday 12-2PM for questions
requiring <30 mins
From our website request a meeting for
Meet with LISA before collecting your data
Short Courses
apply statistics in their research
All services are FREE for VT researchers. We assist with research—not class projects or homework.
www.lisa.stat.vt.edu
TOPICS

Simple Linear Regression

Multiple Linear Regression

Regression with Categorical Variables
3
TYPES OF STATISTICAL ANALYSES
Explanatory Variable(s)
Response Variable
Categorical
Continuous
Categorical &
Continuous
Categorica Contingency
l
Table or
Logistic
Regression
Logistic
Regression
Logistic
Regression
Continuou
s
Regression
ANCOVA or
Regression
with
categorical
variables
ANOVA
4
SIMPLE LINEAR REGRESSION
5
SIMPLE LINEAR REGRESSION

Simple Linear Regression (SLR) is used to model
the relationship between two continuous
variables.
Sullivan (pg. 193)

Scatterplots are used to graphically examine the
relationship between two quantitative variables.
6
TYPES OF RELATIONSHIPS BETWEEN TWO
CONTINUOUS VARIABLES

Positive and negative linear relationships
7
TYPES OF RELATIONSHIPS BETWEEN TWO
CONTINUOUS VARIABLES

Curved Relationship

No Relationship
8
CORRELATION

The Pearson Correlation Coefficient measures
the strength of a linear relationship between two
quantitative variables. The sample correlation
coefficient is

1
−   −
=
−1

=1
where  and  are the sample means of the x and y
variables respectively, and  and  are the sample
standard deviations of the x and y variables
respectively.
9
PROPERTIES OF THE CORRELATION
COEFFICIENT
−1 ≤  ≤ 1
 Positive values of r indicate a positive linear
relationship.
 Negative values of r indicate a negative linear
relationship.
 Values close to +1 or -1 indicate a strong linear
relationship.
 Values close to 0 indicate that there is no linear
relation between the variables.
 We only use r to discuss linear relationships
between two variables.
 Note: Correlation does not imply causation.

10
SIMPLE LINEAR REGRESSION
Can we describe the
behavior between
the two variables
with a linear equation?
The variable on the x-axis is often called the
explanatory or predictor variable.
 The variable on the y-axis is called the response
variable.

11
SIMPLE LINEAR REGRESSION

Objectives of Simple Linear Regression

Determine the significance of the predictor variable
in explaining variability in the response variable.


Predict values of the response variable for given
values of the explanatory variable.


(i.e. Is per capita GDP useful in explaining the variability in
life expectancy?)
(i.e. if we know the per capita GDP can we predict life
expectancy?)
Note: The predictor variable does not necessarily
cause the response.
12
SIMPLE LINEAR REGRESSION MODEL

The Simple Linear Regression model is given by
= 0 + 1  +
where  is the response of the ith observation
0 is the y-intercept
1 is the slope
is the value of the predictor variable for the ith
observation
~iid Normal 0,  2 is the random error
= 1, … ,
13
SLR ESTIMATION OF PARAMETERS
The equation for the least-squares regression line
is given by
= 0 + 1
where  is the predicted value of the response for a
given value of x


1 =

0 =  − 1
2 =  2 =

=1
−
−2
2
14
THE RESIDUAL


The residual is the observed value of y minus the
predicted value of y.
The residual for observation i is given by
=  −
15
SIMPLE LINEAR REGRESSION
ASSUMPTIONS

Linearity

Observations are independent
Based on how data is collected.
 Check by plotting residuals in the order of which the
data was collected.


Constant variance


Check using a residual plot (plot residuals vs. ).
The error terms  are normally distributed.

Check by making a histogram or normal quantile plot
of the residuals.
16
DIAGNOSTICS: RESIDUAL PLOT


A residual plot is used to check the assumption of
constant variance and to check model fit (is a line
a good fit).
Good residual plot: no pattern
17
DIAGNOSTICS
Left: Residuals show non-constant variance.
 Right: Residuals show non-linear pattern.

18
DIAGNOSTICS: NORMAL QUANTILE PLOT
Left: Residuals are not normal
 Right: Normality assumption appropriate

19
ANOVA TABLE FOR SIMPLE LINEAR
REGRESSION
Source
Regression
Error
Total
SS
df

=
−
2
=1

=
−
2
=1

=
−
2
1
n-2
MS
=

1
F

P-value
( > 1−;1,−2 )

=
−2
n-1
=1
The F-test tests whether there is a linear relationship between
the two variables.
Null Hypothesis 0 : 1 = 0
Alternative Hypothesis  :  ≠ 0
20
TEST FOR PARAMETERS

Test whether the true y-intercept is different
from 0.


H0 : 0 = 0
: 0 ≠ 0
Test whether the true slope is different from 0.
H0 : 1 = 0
: 1 ≠ 0
 Note: For simple linear regression this test is
equivalent to the overall F-test.

21
COEFFICIENT OF DETERMINATION

The coefficient of determination, 2 , is the
percent of variation in the response variable
explained by the least squares regression line.

SSE
2
=
=1−

SSTO
Note: 0 ≤ 2 ≤ 1
2
2
 We also have  =

22
MUSCLE MASS EXAMPLE

A nutritionist randomly selected 15 women from
each ten year age group beginning with age 40
and ending with age 79. The nutritionist recorded
the age and muscle mass of each women. The
nutritionist would like to fit a model to explore
the relationship between age and muscle mass.
(Kutner et al. pg. 36)
23
JMP: MAKING A SCATTERPLOT

To analyze the data click Analyze and then
select Fit Y by X.
24
JMP: MAKING A SCATTERPLOT

As shown below
Y, Response: Muscle Mass
X, Factor: Age
25
JMP: SCATTERPLOT

This results in a scatter plot.
26
JMP: SIMPLE LINEAR REGRESSION

To perform the simple linear regression click on
the Red Arrow and then select Fit Line.
27
SIMPLE LINEAR REGRESSION RESULTS
The results on the right are
displayed.

28
JMP: DIAGNOSTICS
Click on the Red Arrow
next to Linear Fit and
select Plot Residuals.

29
DIAGNOSTIC PLOTS
The plots to the right are
output.

30
MULTIPLE LINEAR REGRESSION
31
MULTIPLE LINEAR REGRESSION


Similar to simple linear regression, except now
there is more than one explanatory variable.
Body fat can be difficult to measure. A researcher
would like to come up with a model that uses the
more easily obtained measurements of triceps
skinfold thickness, thigh circumference and
midarm circumference to predict body fat.
(Kutner et al. pg. 256)
32
FIRST ORDER MULTIPLE LINEAR
REGRESSION MODEL

The multiple linear regression model with p-1
independent variables is given by
= 0 + 1 1 + 2 2 + ⋯ + −1 ,−1 +
where 0 , 1 , … , −1 are parameters
1 , 2 , … , ,−1 are known constants
~(0,  2 )
= 1, … ,
33
MULTIPLE LINEAR REGRESSION ANOVA
TABLE
Source
Regression
Error
Total
SS
df
MS
2
np
=
2
n1

=
−
2
=1

=
−
=1

=
−
F
P-value
( > 1−;−1,− )
p-1
=
− 1

−
=1
The ANOVA F-test tests
0 : 1 = 2 = ⋯ = −1 = 0
: Not all of the ′ s are 0
Tests can also be performed for individual parameters.
(i.e. 0 :  = 0 vs.  :  ≠ 0)
34
COEFFICIENT OF MULTIPLE
DETERMINATION
The coefficient of multiple determination, 2 , is
the percent of variation in the response y
explained by the set of 1 , … , −1 explanatory
variables.

2
=
=1−

2
0≤ ≤1
2
 The adjusted coefficient of determination,  ,
introduces a penalty for more explanatory
variables.

−
2
= 1 −

−1

35
ASSUMPTIONS OF MULTIPLE LINEAR
REGRESSION

Observations are independent


Constant variance


Based on how data is collected (plot residuals in the
order of which the data was collected).
Check using a residual plot (plot residuals vs. , plot
residuals vs. each predictor variable).
The error terms  are normally distributed.

Check by making a histogram or normal quantile plot
of the residuals.
36
COMMERCIAL RENTAL RATES

A real estate company would like to build a
model to help clients make decisions about
properties. The company has information about
rental rate (Y), age (X1), operating expenses and
taxes (X2), vacancy rates (X3), and total square
footage (X4). The information is regarding luxury
real estate in a specific location. (Kutner et al. pg.
251)
37
JMP: COMMERCIAL RENTAL RATES

First, examine the data. Click Analyze, then
Multivariate Methods, then Multivariate.
38
JMP: SCATTERPLOT MATRIX

For Y, Columns enter Y, X1, X2, X3 and X4.
Then click OK.
39
JMP: CORRELATIONS AND SCATTERPLOT
MATRIX
40
JMP: FITTING THE REGRESSION MODEL

Click Analyze and then select Fit Model.
41
JMP: FITTING THE REGRESSION MODEL
Y: Y, Highlight X1, X2, X3 and X4 and click Add.
Then click Run.

42
FITTING THE MODEL


Examining the parameter estimates we see that
X3 is not significant.
Fit a new model this time omitting X3.
43
SOME JMP OUTPUT
44
JMP: CHECKING ASSUMPTIONS

Included output
Need residuals:
 Click the red arrow next to Y Response → Save
Columns → Residuals

45
JMP: CHECK NORMALITY ASSUMPTION
Analyze → Distribution → Y, Columns: Residual
Y
 Click the red arrow next to Distribution Residual
Y and select Normal Quantile Plot.

46
JMP: CHECKING RESIDUALS VS.
INDEPENDENT VARIABLES
Analyze → Fit Y by X →
Y, Columns: Residual Y
X, Factor: X1, X2, X4

47
OTHER MULTIPLE LINEAR REGRESSION
ISSUES
Outliers
 Higher Order Terms
 Interaction Terms
 Multicollinearity
 Model Selection

48
REGRESSION WITH
CATEGORICAL VARIABLES
49
REGRESSION WITH CATEGORICAL
VARIABLES


Sometimes there are categorical explanatory
variables that we would like to incorporate into
our model.
Suppose we would like to model the profit or loss
of banks last year based on bank size and type of
bank (commercial, mutual savings, or savings
and loan). (Kutner et al. pg. 340)
50
REGRESSION MODEL WITH CATEGORICAL
VARIABLES
= 0 + 1 1 + 2 2 + 3 3 +
where 1 is the size of bank i
1 if bank i is commmerical
0 if mutual savings
2 =
−1 if savings and loan
0 if bank i is commmerical
1 if mutual savings
3 =
−1 if savings and loan
~(0,  2 )


Note: There are other ways the categorical variables
could have been coded, but this is how JMP codes
them.
51
REGRESSION WITH CATEGORICAL
VARIABLES

A school district would like to determine if a new
school district is also interested in the effect of
Approximately half the students are assigned to
the treatment group (new reading program) and
half to the control group (traditional method).
The students are tested at the beginning and end
of the school year and the change in their score is
recorded.
52
JMP INSTRUCTIONS

Analyze  Fit Model
Y: Score Change
Days Absent
Run Model
Response Score Change  Estimates Show
Prediction Expression
53
JMP OUTPUT
Treatment and days absent had significant
effects on improvement.

54
DIAGNOSTICS: CONSTANT VARIANCE

Residual by Predicted plot produced
automatically.
55
DIAGNOSTICS: CONSTANT VARIANCE

Residual by Factor Plots
First Save Residuals: Response Score Change  Save
Columns  Residuals
 Produce Plots: Analyze  Fit Y by X  Y, Response:
Residuals Score Change; X, Factor: Treatment, Days
Absent

56
DIAGNOSTICS: NORMALITY

Analyze  Distribution  Y, Columns:
Residual Score Change
57
CONCLUSIONS



Simple linear regression allows us to find the
best fit line between a continuous explanatory
variable and a continuous response variable.
Multiple linear regression allows use to explore
the relationship between a continuous response
variable and multiple explanatory variables.
(Also allows for higher order terms to be
introduced.)
Regression with categorical variables allows us to
incorporate categorical predictor variables into
the model.
58
SAS, SPSS AND R

For information about using SAS, SPSS and R to
do regression:
http://www.ats.ucla.edu/stat/sas/topics/regression.ht
m
http://www.ats.ucla.edu/stat/spss/topics/regression.ht
m
http://www.ats.ucla.edu/stat/r/sk/books_pra.htm
59
REFERENCES


Michael Sullivan III. Statistics Informed
Decisions Using Data. Upper Saddle River, New
Jersey: Pearson Education, 2004.
Michael H. Kutner, Christopher J. Nachtsheim,
John Neter and William Li. Applied Linear
Statistical Models. New York: McGraw-Hill
Irwin, 2005.
60
```