Simple Linear Regression

Report
Regression
Shibin Liu
SAS Beijing R&D
Agenda
•
•
•
•
•
•
2
0. Lesson overview
1. Exploratory Data Analysis
2. Simple Linear Regression
3. Multiple Regression
4. Model Building and Interpretation
5. Summary
Agenda
• 0. Lesson overview
•
•
•
•
•
3
1. Exploratory Data Analysis
2. Simple Linear Regression
3. Multiple Regression
4. Model Building and Interpretation
5. Summary
Lesson overview
Response Variable
ANOVA
4
Predictor Variable
Lesson overview
Continuous
Correlation
analysis
5
Continuous
Linear
regression
Lesson overview
Continuous response
Continuous predictor
Correlation
analysis
•
•
•
•
6
Measure linear association
Examine the relationship
Screen for outliers
Interpret the correlation
Lesson overview
Continuous response
Continuous predictor
Linear
regression
•
•
•
7
Define the linear association
Determine the equation for
the line
Explain or predict variability
Lesson overview
What do you
want to
examine?
Descriptive
Statistics
Inferential
Statistics
The location, spread,
and shape of the data’s
distribution
The difference between
groups on one or more
variables
Summary
statistics
or
graphics?
How
many
groups?
Summary
statistics
Both
The relationship between
variables
Which kind
of variables?
Categorical response
variable
Continuous
only
Two
Two or
more
SUMMARY
STATISTICS
DISTRIBUTION
ANALYSIS
Descriptive
Statistics
Descriptive Statistics,
histogram, normal,
probability plots
CORRELATIONS
TTEST
ONE-WAY
FREQUENCIES
& TABLE
ANALYSIS
Frequency tables,
chi-square test
LINEAR
MODELS
LINEAR
REGRESSION
LOGISTIC
REGRESSION
Analysis of variance
Lesson 1
8
Lesson 2
Lesson 3 & 4
Lesson 5
Agenda
• 0. Lesson overview
• 1. Exploratory Data Analysis
•
•
•
•
9
2. Simple Linear Regression
3. Multiple Regression
4. Model Building and Interpretation
5. Summary
Exploratory Data Analysis: Introduction
Height
Weight
Continuous variable
Scatter plot
Correlation analysis
Exploratory data analysis
10
Continuous variable
Linear regression
Exploratory Data Analysis: Objective
• Examine the relationship between continuous variable
using a scatter plot
• Quantify the degree of association between two
continuous variables using correlation statistics
• Avoid potential misuses of the correlation coefficient
• Obtain Pearson correlation coefficients
11
Exploratory Data Analysis:
Using Scatter Plots to
Describe Relationships between Continuous Variables
Scatter plot
Correlation analysis
Exploratory data analysis
Relationship
Trend
Range
Outlier
Communicate
analysis result
12
X: Predict variable
Y: Response variable
Coordinate: values of X and Y
Exploratory Data Analysis:
Using Scatter Plots to
Describe Relationships between Continuous Variables
Model
Terms2
Squared
Quadratic
13
Exploratory Data Analysis:
Using Correlation to
Measure Relationships between Continuous Variables
Scatter plot
Correlation analysis
Exploratory data analysis
Linear
association
Negative
14
Zero
Positive
Exploratory Data Analysis:
Using Correlation to
Measure Relationships between Continuous Variables
Person correlation coefficient:
For population
For sample
15
Exploratory Data Analysis:
Using Correlation to
Measure Relationships between Continuous Variables
Person correlation coefficient:
-1
0
r
+1
Correlation analysis
No linear relationship
Strong negative
Strong positive
linear relationship
linear relationship
16
Exploratory Data Analysis:
Hypothesis testing for a Correlation
Correlation Coefficient Test
H0:  = 0
Ha :  ≠ 0
Correlation
Population
parameter
Sample
statistic

r
• A p-value does not measure the magnitude of the association.
• Sample size affects the p-value.
Rejecting the null hypothesis only means that you can be confident that the true population
correlation is not 0. small p-value can occur (as with many statistics) because of very large
sample sizes. Even a correlation coefficient of 0.01 can be statistically significant with a large
enough sample size. Therefor, it is important to also look at the value of r itself to see whether
it is meaningfully large.
17
Exploratory Data Analysis:
Hypothesis testing for a Correlation
-1
rr
18
0.72
0
+1
rr
0.81
Exploratory Data Analysis:
Avoiding Common Errors in Interpreting Correlations
Cause and Effect
Correlation does not imply causation
Besides causality, could
other reasons account for
strong correlation between
two variables?
19
Exploratory Data Analysis:
Avoiding Common Errors in Interpreting Correlations
Cause and Effect
Correlation does not imply causation
Weight
Height
A strong correlation between two variables
does not mean change in one variable causes
the other variable to change, or vice versa.
20
Exploratory Data Analysis:
Avoiding Common Errors in Interpreting Correlations
Cause and Effect
Correlation does not imply causation
21
Exploratory Data Analysis:
Avoiding Common Errors in Interpreting Correlations
Cause and Effect
Correlation does not imply causation
22
Exploratory Data Analysis:
Avoiding Common Errors in Interpreting Correlations
Cause and Effect
SAT score bounded
to college entrance
or not
X: the percent of students who take the SAT
exam in one of the states
Y: SAT scores
23
Exploratory Data Analysis:
Avoiding Common Errors: Types of Relationships
Pearson correlation
coefficient: r -> 0
curvilinear
parabolic
quadratic
24
Exploratory Data Analysis:
Avoiding Common Errors: outliers
Data one
Data two
r =0.02
25
r =0.82
Exploratory Data Analysis:
Avoiding Common Errors: outliers
What to do with outlier?
?
Why an outlier
Valid
Compute two
correlation
coefficients
Collect data
Replicate data
26
Report both
coefficients
Error
Exploratory Data Analysis:
Scenario: Exploring Data Using Correlation and Scatter Plots
Fitness
?
27
oxygen
consumption
Exploratory Data Analysis:
Exploring Data with Correlations and Scatter Plots
28
Exploratory Data Analysis:
Exploring Data with Correlations and Scatter Plots
What’s the Pearson
correlation coefficient of
Oxygen_Consumption
with Run_Time?
29
What’s the p-value for the
correlation of
Oxygen_Consumption
with Performance?
Exploratory Data Analysis:
Exploring Data with Correlations and Scatter Plots
30
Exploratory Data Analysis:
Examining Correlations between Predictor Variables
31
Exploratory Data Analysis:
Examining Correlations between Predictor Variables
What are the two highest
Pearson correlation
coefficient s?
32
Exploratory Data Analysis
Question 1.
The correlation between tuition and rate of graduation at U.S.
college is 0.55. What does this mean?
a) The way to increase graduation rates at your college is to raise
tuition
b) Increasing graduation rates is expensive, causing tuition to rise
c) Students who are richer tend to graduate more often than
poorer students
d) None of the above.
Answer: d
33
Agenda
• 0. Lesson overview
• 1. Exploratory Data Analysis
• 2. Simple Linear Regression
• 3. Multiple Regression
• 4. Model Building and Interpretation
• 5. Summary
34
Simple Linear Regression: Introduction
35
Simple Linear Regression: Introduction
-1
Variable A
0
Variable B
+1
Variable C
Linear relationships
36
Variable D
Simple Linear Regression: Introduction
r
Same
r
Different
37
Simple Linear Regression: Introduction
Simple Linear Regression
Y: variable of
primary interest
Regression Line
X: explains
variability in Y
38
Simple Linear Regression: Objective
• Explain the concepts of Simple Linear Regression
• Fit a Simple Linear Regression using the Linear
Regression task
• Produce predicted values and confidence intervals.
39
Simple Linear Regression:
Scenario: Performing Simple Linear Regression
Fitness
Run_Time
Simple Linear Regression
Oxygen_Consumption
Linear
regression
40
Simple Linear Regression:
The Simple Linear Regression Model
41
Simple Linear Regression:
The Simple Linear Regression Model
Question 2.
What does epsilon represent?
a)
b)
c)
d)
The intercept parameter
The predictor variable
The variation of X around the line
The variation of Y around the line
Answer: d
42
Simple Linear Regression:
How SAS Performs Linear Regression
Minimize
Method of
least square
Best Linear Unbiased Estimators
. Are unbiased estimators
. Have minimum variance
43
Simple Linear Regression:
Measuring How Well a Model Fits the Data
Regression model
Baseline model
VS.
44
Simple Linear Regression:
Comparing the Regression Model to a Baseline Model
Type of variability
Base line model:
Better model:
45

Explain more variability
Equation
Explained (SSM)
( − )2
Unexplained (SSE)
( −  )2
Total
( − )2
Simple Linear Regression:
Hypothesis Testing for Linear Regression
H0: 1 = 0
Linear regression
H: 1 ≠ 0
46
Simple Linear Regression:
Assumptions of Simple Linear Regression
Linear regression
Assumptions:
1 .The mean of Y is linearly related to X.
2. Errors are normally distributed
3. Errors have equal variances.
4. Errors are independent.
47
Simple Linear Regression:
Performing Simple Linear Regression
Task >Regression>Linear Regression
48
Simple Linear Regression:
Performing Simple Linear Regression
Task >Regression>Linear Regression
49
Simple Linear Regression:
Performing Simple Linear Regression
Question 3.
In the model Y=X, if the parameter estimate (slope) of X is 0, then
which of the following is the best guess (predicted value) for Y
when X is equals to 13?
a)
b)
c)
d)
e)
13
The mean of Y
A random number
The mean of X
0
Answer: b
50
Simple Linear Regression:
Confidence and Prediction Intervals
51
Simple Linear Regression:
Confidence and Prediction Intervals
Question 4.
Suppose you have a 95% confidence interval around the mean.
How do you interpret it?
a) The probability is .95 that the true population mean of Y for a
particular X is within the interval.
b) You are 95% confident that a newly sampled value of Y for a
particular X is within the interval.
c) You are 95% confident that your interval contains the true
population mean of Y for a particular X.
Answer: c
52
Simple Linear Regression:
Confidence and Prediction Intervals
53
Simple Linear Regression:
Confidence and Prediction Intervals
54
Simple Linear Regression:
Producing Predicted Values of the Response Variable
data Need_Predictions;
input Runtime @@;
datalines;
9 10 11 12 13
;
run;
55
Simple Linear Regression:
Producing Predicted Values of the Response Variable
data Need_Predictions;
input Runtime @@;
datalines;
9 10 11 12 13
;
run;
56
Simple Linear Regression:
Producing Predicted Values of the Response Variable
18
57
Agenda
• 0. Lesson overview
• 1. Exploratory Data Analysis
• 2. Simple Linear Regression
• 3. Multiple Regression
• 4. Model Building and Interpretation
• 5. Summary
58
Multiple Regression
• 0. Lesson overview
• 1. Exploratory Data Analysis
• 2. Simple Linear Regression
• 3. Multiple Regression
• 4. Model Building and Interpretation
• 5. Summary
59
Multiple Regression: Introduction
Response Variable
Predictor Variable
Response Variable
Predictor Variable Predictor Variable
Simple Linear
Regression
Multiple Linear
Regression
More than one Predictor Variable
 = 0 + 1 1 + … +   + 
60
Multiple Regression: Introduction
Simple Linear
Regression
Multiple Linear
Regression
When k=2
 = 0 + 1 1 + … +   + 
61
Multiple Regression: Objective
• Explain the mathematical model for multiple regression
• Describe the main advantage of multiple regression
versus simple linear regression
• Explain the standard output from the Linear Regression
task.
• Describe common pitfalls of multiple linear regression
62
Multiple Regression
Advantages and Disadvantages of Multiple Regression
Multiple Linear Regression
Advantages
Disadvantages
127 possible model
Complex to interpret
63
Multiple Regression
Picturing the Model for Multiple Regression
Multiple Linear Regression
 = 0 + 1 1 + … +   + 
 = 0
1 = 2 =0
64
1 ≠ 0
2 ≠ 0
+1 
  0
Multiple Regression
Picturing the Model for Multiple Regression
Multiple Linear Regression
 = 0 + 1 1 + … +   + 
65
Multiple Regression
Common applications
Multiple Linear Regression is a powerful tool
for the following tasks:
1. Prediction, which is used to develop a model
future values of a response variable (Y) based
one its relationships with other predictor
variables (Xs).
2. Analytical or Explanatory Analysis, which is used
to develop an understanding of the relationships
between the response variable and predictor
variables
66
Multiple Regression
Analysis versus Prediction in Multiple Regression
Prediction
1. The terms in the model, the values of their
coefficients, and their statistical significance are
of secondary importance.
2. The focus is on producing a model that is the
best at predicting future values of Y as a function
of the Xs.
 =  0 +  11 +
67
… + 
Multiple Regression
Analysis versus Prediction in Multiple Regression
Analytical or Explanatory Analysis
1. The focus is understanding the relationship
between the dependent variable and independent
variables.
2. Consequently, the statistical significance of the
coefficient is important as well as the magnitudes
and signs of the coefficients.
 =  0 +  11 +
68
… + 
Multiple Regression
Hypothesis Testing for Multiple Regression
Multiple Linear Regression
 = 0 + 1 1 + … +   + 
H0: 1 = 2 = ⋯ =  = 0
H:     ≠ 0
69
H0: The regression model
does not fit the data better
than the baseline model.
H: The regression model
does fit the data better
than the baseline model.
Multiple Regression
Hypothesis Testing for Multiple Regression
Question 4.
Match below items left and right?
a
b
b
70
1. At least one slope of the regression in the
population is not 0 and at least one
predictor variable explains a significant
amount of variability in the response
model
2. No predictor variable explains a significant
amount of variability in the response
variable
3. The estimated linear regression model
does not fit the data better than the
baseline model
a) Reject the null hypothesis
hypothesis
b) Fail to reject the null
hypothesis
Multiple Regression
Assumptions for Multiple Regression
Linear regression model
Assumptions:
1 .The mean of Y is linearly related to X.
2. Errors are normally distributed
3. Errors have equal variances.
4. Errors are independent.
71
Multiple Regression: Scenario: Using Multiple
Regression to Explain Oxygen Consumption
Age
Performance
72
Multiple Regression:
Adj.
2
Adjust R2
2
R
( − )(1 − R2 )
=1−
−
i = 1 if there is an intercept and 0 otherwise
n = the number of observations used to fit the model
p = the number of parameters in the model
73
Multiple Regression:
Regression
74
Performing Multiple Linear
Multiple Regression:
Regression
What’s the p-value of
the overall model?
Should we reject the null
hypothesis or not?
Based on our evidence, do we
reject the null hypothesis that
the parameter estimate is 0?
75
Performing Multiple Linear
Multiple Regression:
Performing Multiple Linear
Regression
Oxygen_
Consumption
Performance
Oxygen_
Consumption
Performance
Oxygen_
Consumption
RunTime
76
RunTime
?
Multiple Regression:
Performing Multiple Linear
Regression
Performance
-0.82049
Collinearity
77
RunTime
Agenda
•
•
•
•
0. Lesson overview
1. Exploratory Data Analysis
2. Simple Linear Regression
3. Multiple Regression
• 4. Model Building and Interpretation
• 5. Summary
78
Model Building and Interpretation :
Introduction
Age
Performance
79
Model Building and Interpretation :
Introduction
?
80
Model Building and Interpretation :
Introduction
Stepwise selection methods
Forward
All possible regressions rank criteria:
R2
Adjusted R2
Backward
Mallows’ Cp
‘No selection’
is the default
Stepwise
81
Model Building and Interpretation: Objectives
• Explain the Linear Regression task options for
the model selection
• Describe model selection options and
interpret output to evaluate the fit of several
models
82
Model Building and Interpretation :
Approaches to Selecting Models: Manual
Full Model
83
Model Building and Interpretation :
SAS and Automated Approaches to Modeling
Stepwise selection methods
All possible regressions rank criteria:
R2
Forward
Run all methods
Adjusted R2
Look for commonalities
Backward
Mallows’ Cp
Narrow down models
‘No selection’
is the default
Stepwise
84
Model Building and Interpretation :
The All-Possible Regressions Approach to Model Building
Fitness
Predictor variables
85
128 possible models
Model Building and Interpretation :
Evaluating Models Using Mallows' Cp Statistic
Cp
Mallows' Cp Statistic
Model Bias
Under-fitting
86
Over-fitting
Model Building and Interpretation :
Evaluating Models Using Mallows' Cp Statistic
Cp
Mallows' Cp Statistic
Model Bias
Parameter estimation
For Prediction
Hockings' criterion: Cp <= 2p –pfull +1
Mallows' criterion: Cp <= p
criteria
87
Model Building and Interpretation :
Viewing Mallows' Cp Statistic
Linea Regression task
Partial output
+1=p
88
Cp
Model Building and Interpretation :
Viewing Mallows' Cp Statistic
Mallows' criterion: Cp <= p
Partial output
Which of these models has
the fewest parameters?
89
In this output, how many
models have a value for
Cp that is less than or
equal to p?
Model Building and Interpretation :
Viewing Mallows' Cp Statistic
First of all, what is the p for
the full model?
Partial output
Cp <= 12 – 8 +1
90
Hockings' criterion: Cp <= 2p –pfull +1
Pfull = 8 (7 vars +1intercept )
How many models meet
Hockings' criterion for Cp for
parameter estimation?
Model Building and Interpretation :
Viewing Mallows' Cp Statistic
Question 5.
What happens when you use the all-possible regressions method?
Select all that apply.
y
y
y
y
91
1. You compare the R-square, adjusted R-square, and Cp statistics
to evaluate the models.
2. SAS computes al possible models
3. You choose a selection method (stepwise, forward, or
backward)
4. SAS ranks the results.
5. You cannot reduce the number of models in the output
6. You can produce a plot to help identify models that satisfy
criteria for the Cp statistic.
Model Building and Interpretation :
Viewing Mallows' Cp Statistic
Question 6.
Match below items left and right.
c
b
a
92
1. Prefer to use R-square for
evaluating multiple linear
regression models (take into
account the number of terms in
the model).
2. Useful for parameter estimation
3. Useful for prediction
a. Mallows' criterion for Cp.
b. Hockings' criterion for Cp.
c. adjusted R-square
Model Building and Interpretation :
Using Automatic Model Selection
93
Model Building and Interpretation :
Using Automatic Model Selection
94
Model Building and Interpretation :
Estimating and Testing Coefficients for Selected Models –
Prediction model
95
Model Building and Interpretation :
Estimating and Testing Coefficients for Selected Models –
Explanatory Model
96
Model Building and Interpretation :
Estimating and Testing Coefficients for Selected Models
97
Model Building and Interpretation :
The Stepwise Selection Approach to Model Building
Stepwise selection methods
Forward
Backward
Stepwise
98
Model Building and Interpretation :
The Stepwise Selection Approach to Model Building
Forward
Forward selection method starts with no variable, then select the most
significant variable, until there is no significant variable. The variable
added will not be removed even it becomes in-significant later.
99
Model Building and Interpretation :
The Stepwise Selection Approach to Model Building
Backward
Backward selection method starts with all variables in, then remove
the most in-significant variable, until all variables left are significant.
Once the variable is removed, it cannot re-enter.
100
Model Building and Interpretation :
The Stepwise Selection Approach to Model Building
Stepwise
Stepwise combines the thoughts of both Forward and Backward selection. It starts
with no variable, then select the most significant variable as the Forward ,
however, like Backward selection, stepwise method can drop the in-significant
variable one at a time. until there is no significant variable.
Stepwise method stops when all terms in the model are significant , and all terms
out off model are not significant.
101
Model Building and Interpretation :
The Stepwise Selection Approach to Model Building
Application
Stepwise selection methods
Forward
Identify candidate models
Backward
Use expertise to choose
Stepwise
102
Model Building and Interpretation :
Performing Stepwise Regression: Forward selection
103
Model Building and Interpretation :
Performing Stepwise Regression: Forward selection
104
Model Building and Interpretation :
Performing Stepwise Regression: Backward selection
105
Model Building and Interpretation :
Performing Stepwise Regression: Backward selection
106
Model Building and Interpretation :
Performing Stepwise Regression: Stepwise selection
107
Model Building and Interpretation :
Performing Stepwise Regression: Stepwise selection
108
Model Building and Interpretation :
Using Alternative Significance Criteria for Stepwise Models
Stepwise Regression Models
With default significant levels
Using 0.05 significant levels
109
Model Building and Interpretation :
Comparison of Selection Methods
Stepwise selection methods
All-possible regression
110
Use fewer computer resources
Generate more candidate models
that might have nearly equal R2 and
Cp statistics.
Agenda
•
•
•
•
•
0. Lesson overview
1. Exploratory Data Analysis
2. Simple Linear Regression
3. Multiple Regression
4. Model Building and Interpretation
• 5. Summary
111
Home Work: Exercise 1
1.1 Describing the Relationship between Continuous Variables
Percentage of body fat, age, weight, height, and 10 body circumference
measurements (for example, abdomen) were recorded for 252 men. The data are
stored in the BodyFat2 data set. Body fat one measure of health, was accurately
estimated by an underwater weighing technique. There are two measures of
percentage body fat in this data set.
Case
PctBodyFat1
PctBodyFat2
Density
Age
Weight
Height
112
case number
percent body fat using Brozek’s equation, 457/Density-414.2
percent body fat using Siri’s equation, 495/Density-450
Density(gm/cm^3)
Age(yrs)
weight(lbs)
height (inches)
Home Work: Exercise 1
113
Home Work: Exercise 1
1.1 Describing the Relationship between Continuous Variables
a. Generate scatter plots and correlations for the variables Age, Weight, Height, and the
circumference measures versus the variable PctBodyFat2.
Important! The Correlation task limits you to 10 variables at a time for scatter plot matrices, so for this
exercise, look at the relationships with Age, Weight, and Height separately from the circumference variables
(Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist)
Note: Correlation tables can be created using more than 10 VAR variables at a time.
b.
c.
d.
e.
f.
g.
114
What variable has the highest correlation with PctBodyFat2?
What is the value for the coefficient?
Is the correlation statistically significant at the 0.05 level?
Can straight lines adequately describe the relationships?
Are there any outliers that you should investigate?
Generate correlations among the variable (Age, Weight, Height), among one another,
and among the circumference measures. Are there any notable relationships?
Home Work: Exercise 2
2.1 Fitting a Simple Linear Regression Model
Use the BodyFat2 data set for this exercise:
a. Perform a simple linear regression model with PctBodyFat2 as the response
variable and Weight as the predictor.
b. What is the value of the F statistic and the associated p-value? How would you
interpret this with regard to the null hypothesis?
c. Write the predicted regression equation.
d. What is the value of the R2 statistic? How would you interpret this?
e. Produce predicted values for PctBodyFat2 when Weight is 125, 150, 175, 200 and
225. (see SAS code in below comments part)
f. What are the predicted values?
g. What’s the value of PctBodyFat2 when Weight is 150?
115
Home Work: Exercise 3
3.1 Performa Multiple Regression
a. Using the BodyFat2 data set, run a regression of PctBodyFat2 on the variables
Age, Weight, Height, Neck, Chest, Abdomen, Hip, thigh, Knee, Ankle, Biceps,
Forearm, and Wrist. Compare the ANOVA table with that from the model with
only Weight in the previous exercise. What is the different?
b. How do the R2 and the adjusted R2 compare with these statistics for the Weight
regression demonstration?
c. Did the estimate for the intercept change? Did the estimate for the coefficient of
Weight change?
116
Home Work: Exercise 3
3.2 Simplifying the model
a. Rerun the model in the previous exercise, but eliminate the variable with the
highest p-value. Compare the result with the previous model.
b. Did the p-value for the model change notably?
c. Did the R2 and adjusted R2 change notably?
d. Did the parameter estimates and their p-value change notably?
3.3 More simplifying of the model
a. Rerun the model in the previous exercise, but eliminate the variable with the
highest p-value.
b. How did the output change from the previous model?
c. Did the number of parameters with a p-value less than 0.05 change?
117
Home Work: Exercise 4
4.1 Using Model Building Techniques
Use the BodyFat2 data set to identify a set of “best” models.
a. Using the Mallows' Cp option, use an all-possible regression technique to identify
a set of candidate models that predict PctBodyFat2 as a function of the variables
Age, Weight, Height, Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps,
Forearm, and Wrist .
Hint: select the best 60 models based on Cp to compare
b. Use a stepwise regression method to select a candidate model. Try Forward
selection, Backward selection, and Stepwise selection.
c. How many variables would result from a model using Forward selection and a
significant level for entry criterion of 0.05, instead of the default of 0.50?
118
Thank you!

similar documents