Report

REGRESSION Jennifer Kensler Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use of Statistics Experimental Design • Data Analysis • Interpreting Results Grant Proposals • Software (R, SAS, JMP, SPSS...) Collaboration Walk-In Consulting Monday—Friday 12-2PM for questions requiring <30 mins From our website request a meeting for personalized statistical advice Great advice right now: Meet with LISA before collecting your data Short Courses Designed to help graduate students apply statistics in their research All services are FREE for VT researchers. We assist with research—not class projects or homework. www.lisa.stat.vt.edu TOPICS Simple Linear Regression Multiple Linear Regression Regression with Categorical Variables 3 TYPES OF STATISTICAL ANALYSES Explanatory Variable(s) Response Variable Categorical Continuous Categorical & Continuous Categorica Contingency l Table or Logistic Regression Logistic Regression Logistic Regression Continuou s Regression ANCOVA or Regression with categorical variables ANOVA 4 SIMPLE LINEAR REGRESSION 5 SIMPLE LINEAR REGRESSION Simple Linear Regression (SLR) is used to model the relationship between two continuous variables. Sullivan (pg. 193) Scatterplots are used to graphically examine the relationship between two quantitative variables. 6 TYPES OF RELATIONSHIPS BETWEEN TWO CONTINUOUS VARIABLES Positive and negative linear relationships 7 TYPES OF RELATIONSHIPS BETWEEN TWO CONTINUOUS VARIABLES Curved Relationship No Relationship 8 CORRELATION The Pearson Correlation Coefficient measures the strength of a linear relationship between two quantitative variables. The sample correlation coefficient is 1 − − = −1 =1 where and are the sample means of the x and y variables respectively, and and are the sample standard deviations of the x and y variables respectively. 9 PROPERTIES OF THE CORRELATION COEFFICIENT −1 ≤ ≤ 1 Positive values of r indicate a positive linear relationship. Negative values of r indicate a negative linear relationship. Values close to +1 or -1 indicate a strong linear relationship. Values close to 0 indicate that there is no linear relation between the variables. We only use r to discuss linear relationships between two variables. Note: Correlation does not imply causation. 10 SIMPLE LINEAR REGRESSION Can we describe the behavior between the two variables with a linear equation? The variable on the x-axis is often called the explanatory or predictor variable. The variable on the y-axis is called the response variable. 11 SIMPLE LINEAR REGRESSION Objectives of Simple Linear Regression Determine the significance of the predictor variable in explaining variability in the response variable. Predict values of the response variable for given values of the explanatory variable. (i.e. Is per capita GDP useful in explaining the variability in life expectancy?) (i.e. if we know the per capita GDP can we predict life expectancy?) Note: The predictor variable does not necessarily cause the response. 12 SIMPLE LINEAR REGRESSION MODEL The Simple Linear Regression model is given by = 0 + 1 + where is the response of the ith observation 0 is the y-intercept 1 is the slope is the value of the predictor variable for the ith observation ~iid Normal 0, 2 is the random error = 1, … , 13 SLR ESTIMATION OF PARAMETERS The equation for the least-squares regression line is given by = 0 + 1 where is the predicted value of the response for a given value of x 1 = 0 = − 1 2 = 2 = =1 − −2 2 14 THE RESIDUAL The residual is the observed value of y minus the predicted value of y. The residual for observation i is given by = − 15 SIMPLE LINEAR REGRESSION ASSUMPTIONS Linearity Observations are independent Based on how data is collected. Check by plotting residuals in the order of which the data was collected. Constant variance Check using a residual plot (plot residuals vs. ). The error terms are normally distributed. Check by making a histogram or normal quantile plot of the residuals. 16 DIAGNOSTICS: RESIDUAL PLOT A residual plot is used to check the assumption of constant variance and to check model fit (is a line a good fit). Good residual plot: no pattern 17 DIAGNOSTICS Left: Residuals show non-constant variance. Right: Residuals show non-linear pattern. 18 DIAGNOSTICS: NORMAL QUANTILE PLOT Left: Residuals are not normal Right: Normality assumption appropriate 19 ANOVA TABLE FOR SIMPLE LINEAR REGRESSION Source Regression Error Total SS df = − 2 =1 = − 2 =1 = − 2 1 n-2 MS = 1 F P-value ( > 1−;1,−2 ) = −2 n-1 =1 The F-test tests whether there is a linear relationship between the two variables. Null Hypothesis 0 : 1 = 0 Alternative Hypothesis : ≠ 0 20 TEST FOR PARAMETERS Test whether the true y-intercept is different from 0. H0 : 0 = 0 : 0 ≠ 0 Test whether the true slope is different from 0. H0 : 1 = 0 : 1 ≠ 0 Note: For simple linear regression this test is equivalent to the overall F-test. 21 COEFFICIENT OF DETERMINATION The coefficient of determination, 2 , is the percent of variation in the response variable explained by the least squares regression line. SSE 2 = =1− SSTO Note: 0 ≤ 2 ≤ 1 2 2 We also have = 22 MUSCLE MASS EXAMPLE A nutritionist randomly selected 15 women from each ten year age group beginning with age 40 and ending with age 79. The nutritionist recorded the age and muscle mass of each women. The nutritionist would like to fit a model to explore the relationship between age and muscle mass. (Kutner et al. pg. 36) 23 JMP: MAKING A SCATTERPLOT To analyze the data click Analyze and then select Fit Y by X. 24 JMP: MAKING A SCATTERPLOT As shown below Y, Response: Muscle Mass X, Factor: Age 25 JMP: SCATTERPLOT This results in a scatter plot. 26 JMP: SIMPLE LINEAR REGRESSION To perform the simple linear regression click on the Red Arrow and then select Fit Line. 27 SIMPLE LINEAR REGRESSION RESULTS The results on the right are displayed. 28 JMP: DIAGNOSTICS Click on the Red Arrow next to Linear Fit and select Plot Residuals. 29 DIAGNOSTIC PLOTS The plots to the right are then added to the JMP output. 30 MULTIPLE LINEAR REGRESSION 31 MULTIPLE LINEAR REGRESSION Similar to simple linear regression, except now there is more than one explanatory variable. Body fat can be difficult to measure. A researcher would like to come up with a model that uses the more easily obtained measurements of triceps skinfold thickness, thigh circumference and midarm circumference to predict body fat. (Kutner et al. pg. 256) 32 FIRST ORDER MULTIPLE LINEAR REGRESSION MODEL The multiple linear regression model with p-1 independent variables is given by = 0 + 1 1 + 2 2 + ⋯ + −1 ,−1 + where 0 , 1 , … , −1 are parameters 1 , 2 , … , ,−1 are known constants ~(0, 2 ) = 1, … , 33 MULTIPLE LINEAR REGRESSION ANOVA TABLE Source Regression Error Total SS df MS 2 np = 2 n1 = − 2 =1 = − =1 = − F P-value ( > 1−;−1,− ) p-1 = − 1 − =1 The ANOVA F-test tests 0 : 1 = 2 = ⋯ = −1 = 0 : Not all of the ′ s are 0 Tests can also be performed for individual parameters. (i.e. 0 : = 0 vs. : ≠ 0) 34 COEFFICIENT OF MULTIPLE DETERMINATION The coefficient of multiple determination, 2 , is the percent of variation in the response y explained by the set of 1 , … , −1 explanatory variables. 2 = =1− 2 0≤ ≤1 2 The adjusted coefficient of determination, , introduces a penalty for more explanatory variables. − 2 = 1 − −1 35 ASSUMPTIONS OF MULTIPLE LINEAR REGRESSION Observations are independent Constant variance Based on how data is collected (plot residuals in the order of which the data was collected). Check using a residual plot (plot residuals vs. , plot residuals vs. each predictor variable). The error terms are normally distributed. Check by making a histogram or normal quantile plot of the residuals. 36 COMMERCIAL RENTAL RATES A real estate company would like to build a model to help clients make decisions about properties. The company has information about rental rate (Y), age (X1), operating expenses and taxes (X2), vacancy rates (X3), and total square footage (X4). The information is regarding luxury real estate in a specific location. (Kutner et al. pg. 251) 37 JMP: COMMERCIAL RENTAL RATES First, examine the data. Click Analyze, then Multivariate Methods, then Multivariate. 38 JMP: SCATTERPLOT MATRIX For Y, Columns enter Y, X1, X2, X3 and X4. Then click OK. 39 JMP: CORRELATIONS AND SCATTERPLOT MATRIX 40 JMP: FITTING THE REGRESSION MODEL Click Analyze and then select Fit Model. 41 JMP: FITTING THE REGRESSION MODEL Y: Y, Highlight X1, X2, X3 and X4 and click Add. Then click Run. 42 FITTING THE MODEL Examining the parameter estimates we see that X3 is not significant. Fit a new model this time omitting X3. 43 SOME JMP OUTPUT 44 JMP: CHECKING ASSUMPTIONS Included output Need residuals: Click the red arrow next to Y Response → Save Columns → Residuals 45 JMP: CHECK NORMALITY ASSUMPTION Analyze → Distribution → Y, Columns: Residual Y Click the red arrow next to Distribution Residual Y and select Normal Quantile Plot. 46 JMP: CHECKING RESIDUALS VS. INDEPENDENT VARIABLES Analyze → Fit Y by X → Y, Columns: Residual Y X, Factor: X1, X2, X4 47 OTHER MULTIPLE LINEAR REGRESSION ISSUES Outliers Higher Order Terms Interaction Terms Multicollinearity Model Selection 48 REGRESSION WITH CATEGORICAL VARIABLES 49 REGRESSION WITH CATEGORICAL VARIABLES Sometimes there are categorical explanatory variables that we would like to incorporate into our model. Suppose we would like to model the profit or loss of banks last year based on bank size and type of bank (commercial, mutual savings, or savings and loan). (Kutner et al. pg. 340) 50 REGRESSION MODEL WITH CATEGORICAL VARIABLES = 0 + 1 1 + 2 2 + 3 3 + where 1 is the size of bank i 1 if bank i is commmerical 0 if mutual savings 2 = −1 if savings and loan 0 if bank i is commmerical 1 if mutual savings 3 = −1 if savings and loan ~(0, 2 ) Note: There are other ways the categorical variables could have been coded, but this is how JMP codes them. 51 REGRESSION WITH CATEGORICAL VARIABLES A school district would like to determine if a new reading program improves student reading. The school district is also interested in the effect of days absent on reading improvement. Approximately half the students are assigned to the treatment group (new reading program) and half to the control group (traditional method). The students are tested at the beginning and end of the school year and the change in their score is recorded. 52 JMP INSTRUCTIONS Analyze Fit Model Y: Score Change Add: Treatment Days Absent Run Model Response Score Change Estimates Show Prediction Expression 53 JMP OUTPUT Treatment and days absent had significant effects on improvement. 54 DIAGNOSTICS: CONSTANT VARIANCE Residual by Predicted plot produced automatically. 55 DIAGNOSTICS: CONSTANT VARIANCE Residual by Factor Plots First Save Residuals: Response Score Change Save Columns Residuals Produce Plots: Analyze Fit Y by X Y, Response: Residuals Score Change; X, Factor: Treatment, Days Absent 56 DIAGNOSTICS: NORMALITY Analyze Distribution Y, Columns: Residual Score Change 57 CONCLUSIONS Simple linear regression allows us to find the best fit line between a continuous explanatory variable and a continuous response variable. Multiple linear regression allows use to explore the relationship between a continuous response variable and multiple explanatory variables. (Also allows for higher order terms to be introduced.) Regression with categorical variables allows us to incorporate categorical predictor variables into the model. 58 SAS, SPSS AND R For information about using SAS, SPSS and R to do regression: http://www.ats.ucla.edu/stat/sas/topics/regression.ht m http://www.ats.ucla.edu/stat/spss/topics/regression.ht m http://www.ats.ucla.edu/stat/r/sk/books_pra.htm 59 REFERENCES Michael Sullivan III. Statistics Informed Decisions Using Data. Upper Saddle River, New Jersey: Pearson Education, 2004. Michael H. Kutner, Christopher J. Nachtsheim, John Neter and William Li. Applied Linear Statistical Models. New York: McGraw-Hill Irwin, 2005. 60