Report

Regression Methods OLAWALE AWE LISA SHORT COURSE, DEPARTMENT OF STATISTICS, VIRGINIA TECH. NOVEMBER 21, 2013. About What? Laboratory for Interdisciplinary Statistical Analysis Why? Mission: to provide statistical advice, analysis, and education to Virginia Tech researchers How? Collaboration requests, Walk-in Consulting, Short Courses Where? Walk-in Consulting in GLC and various other locations (www.lisa.stat.vt.edu/?q=walk_in) Collaboration meetings typically held in Sandy 312 Statistical Collaborators? Graduate students and faculty members in VT statistics department Requesting a LISA Meeting Go to www.lisa.stat.vt.edu Click link for “Collaboration Request Form” Sign into the website using VT PID and password Enter your information (email, college, etc.) Describe your project (project title, research goals, specific research questions, if you have already collected data, special requests, etc.) Contact assigned LISA collaborators as soon as possible to schedule a meeting Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use of Statistics Collaboration: Visit our website to request personalized statistical advice and assistance with: Experimental Design • Data Analysis • Interpreting Results Grant Proposals • Software (R, SAS, JMP, SPSS...) LISA statistical collaborators aim to explain concepts in ways useful for your research. Great advice right now: Meet with LISA before collecting your data. LISA also offers: Educational Short Courses: Designed to help graduate students apply statistics in their research Walk-In Consulting: M-F 1-3 PM GLC Video Conference Room for questions requiring <30mins Also 11AM-1PM Port (Library/Torg Bridge) and 9.30-11.30 AM ICTAS Café X All services are FREE for VT researchers. We assist with research—not class projects or homework. www.lisa.stat.vt.edu 4 Outline Introduction to Regression Analysis Simple Linear Regression Multiple Linear Regression Regression Model Assumptions Residual Analysis Assessing Multicollinearity: Correlation and VIF Model Selection Procedures Illustrative Example (Brief Demo with SPSS/PASW) Model Diagnostic and Interpretation Introduction Regression is a statistical technique for investigating, describing, and predicting the relationship between two or more variables. Regression has been regarded as the most widely used technique in statistics. As basic to statistics as the Pythagorean theorem is to geometry (Montgomery et al,2006). 6 Regression: Intro Regression Analysis has tremendous applications in almost every field of human endeavor. One of the most popular statistical techniques used by researchers. Widely used in engineering, physical and chemical sciences, economics, management, social sciences, life and biological sciences, etc. Easy to understand and interpret. Simply put, Regression analysis is used to find equations that fit data. 7 When do we use Regression Technique? Response Variable Explanatory Variable(s) Categorical Continuous Categorical & Continuous Categorical Contingency Table or Logistic Regression Logistic Regression Logistic Regression Continuous ANOVA Regression ANCOVA or Regression with categorical variables 8 9 SIMPLE LINEAR REGRESSION Simple Linear Regression Simple Linear Regression (SLR) is a statistical method for modeling the relationship between ONLY two continuous variables. A researcher may be interested in modeling the relationship between Life Expectancy and Per Capita GDP of seven countries as follows. Scatterplots are first used to graphically examine the relationship between the two variables. 10 Types of Relationships Between Two Continuous Variables A scatter plot is a visual representation of the relationship between two variables. Positive and negative linear relationship 11 Other Types of Relationships… Curvilinear Relationships No Relationship 12 Simple Linear Regression Can we describe the behavior between the two variables with a linear equation? The variable on the x-axis is often called the explanatory or predictor variable(X). The variable on the y-axis is called the response variable(Y). 13 Simple Linear Regression Model The Simple Linear Regression model is given by = 0 + 1 + where is the response of the ith observation 0 is the y-intercept 1 is the slope is the value of the predictor variable for the ith observation ~iid Normal 0, 2 is the random error = 1, … , 14 Interpretation of Slope and Intercept Parameter β1 is the difference in the predicted value of Y for one unit difference in X. β0 is the mean response if the predictor variable is zero(has no practical meaning but should be included) If β1>0 there exists positive relationship .It means as variable X increases, Y also increases. If β1 <0 there exists negative relationship between the variables. It means that as variable X decreases, Y increases. If Β1=0, It means there is no relationship between the two variables(see graphs below). 15 Graphs of Relationships Between Two Continuous Variables β1>0 β1<0 β1=0 16 Line of Best fit A line of best fit is a straight line that best represents your data on a scatter plot. Identical to line of a straight line in elementary math class. y=mx+b ,m=slope, b= y-intercept. Residual is r= y- E(r)=0(more on residual later) Where y=observed response =predicted response. 17 Regression Assumptions Linearity between the dependent and independent variable(s). Observations are independent Based on how data is collected. Check by plotting residuals vs the order in which the data was collected. Constant variance of error terms. Check using a residual plot (plot residuals vs. ) The error terms are normally distributed. Check by making a histogram or normal quantile plot of the residuals. 18 Example 1 Consider a data on 15 American Women collected by a researcher as follows: We can fit a model of the form: Weight =β0 +β1Age+ϵ to the data. 19 Scatter Plot of Weight vs Age Line of best fit 20 Model Estimation and Result β0= − β1 β1=r* The estimated regression line is Weight =-87.52+3.45Age Can you interpret these results? 21 Description/Interpretation The above results can be interpreted as follows: -Sig.(P value) of 0.000 indicates that the model is a good fit to the data. It means Age has a significant contribution to the average variability in the weights of the women. -The value of β1 (slope=3.45) indicates a positive relationship between the weight and age. The slope coefficient indicates that for every additional unit increase in age, we can expect weight to increase by an average of 3.45 kilograms. -R indicates that there is high association between the DV and the predictor variable. R-Squared value of 0.991 means that 99% of the average variability in weight of the women is explained by the model. 22 Prediction Using the regression model above, we can predict the weight of a woman who is 73 years old : Weight = -87.52 + 3.45(75) Weight = -87.52 +3.45*75 Weight =171 Exercise: -Using the SLR model above, predict the weight of a woman whose age is 82. Ans: 195kg 23 MULTIPLE REGRESSION Frequently there are many predictors that we want to use simultaneously Multiple linear regression model: = 0 + 1 1 + 2 2 + … + + Similar to simple linear regression, except now there is more than one explanatory variable. In this situation each represents the partial slope of the predictor = 1, … , . Can be interpreted as “the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. ”. • 25 Example 2: Suppose the researcher in our example 1 above is interested in knowing if height also contributes to change in weight: 26 Step 1: Scatterplots 27 Model Estimation with SPSS 28 Multiple Regression The new model is therefore written as Weight = 0 + 1 + 2 ℎ + Error So, the fit is : Weight = -81.53+3.46Age -1.11Height 29 Model Interpretation The result of the model estimation above shows that height does not contribute to the average variability in weight of the women. The high p-value shows that it is not statistically significant(changes in height are not associated with changes in weight). No statistically significant linear dependence of the mean of weight on height was detected. Note that the value of R-Squared and Adjusted R-Squared did not decrease as we add additional independent variable(s). For every one unit increase in age, average weight increases at a rate of 3.46units, while holding height constant. 30 Model Diagnostic and Residual Analysis Residual is a measure of the variability in the response variable not explained by the regression model. Analysis of the residual is always an effective way to discover several violations of model assumption. Plotting residuals is a very effective way to investigate how well the regression model fits the data. A residual plot is used to check the assumption of constant variance and to check model fit (can the model be trusted?). 31 Diagnostics: Residual Plot The residuals should fall in a symmetrical pattern and have a constant spread throughout its range. Good residual plot: no pattern. 32 We Can Plot: Residual vs Independent Variable(s) Residual vs Predicted values Residual vs Order of the data Residual Lag Plot Histogram of Residual Standardized Residual vs Standardized Predicted Value etc. 33 Residuals Column 3 in the table below show the residuals of the regression model: Weight =-87.52+3.45Age Residual is the deviation between the data and the fit. (Actual Y- Predicted Y) 34 Residual Diagnostics: Very Important! Left: Residuals show non-constant variance. Right: Residuals show non-linear pattern. 35 Look at the Figures Below, What Do You Think? 36 Residual Plot 37 Residual Plots 38 What if the Assumptions Are Not Met? Linearity: Transform the dependent variable (see next slide ) Normality: Transform the data (also when outlier is present) Or use robust regression where normality is not required Increase the sample size, if possible Homogeneity: Try transforming the data 39 Some Tips on Transformation Log Y, -Used if Y is positively skewed and has positive values. -If Y has a Poisson distribution(is a count data) 1/Y -If variance of Y is proportional to the 4th power of E(Y) Sin-1 (Y) -Used if Y is a proportion or rate 40 Multicollinearity A usual problem in multiple regression that develops when one or more of the independent variable(s) is highly correlated with one or more of the other independent variables. How the explanatory variables relate to each other is fundamental to understanding their relationship with the response variable. Usually, when you see estimated beta weights larger than 1 in any regression analysis, consider the possibility of multicollinearity. Multicollinearity can be mild, or severe (depending high correlations, or VIFs above 10). 41 Effects on P-Values You will get different p-values for the same variables in different regressions as you add/remove other explanatory variables. A variable can be significantly related to Y by itself, but not be significantly related to Y after accounting for several other variables. In that case, the variable is viewed as redundant. If all the X variables are correlated, it is possible ALL the variables may be insignificant, even if each is significantly related to Y by itself. 42 Multicollinearity Effect on Coefficients Similarly, coefficients of individual explanatory variables can change depending on what other explanatory variables are present. May change signs sporadically. May be excessively large when there is multicollinearity. 43 Multicollinearity Isn’t Tragic In most practical datasets there will be some degree of multicollinearity. If the degree of multicollinearity isn’t too bad (more on its assessment in the next slides) then it can be safely ignored. If you have serious multicollinearity, then your goals must be considered and there are various options. In what follows, we first focus on how to assess multicollinearity, then what to do about it should it be found to be a problem. 44 Assessing Multicollinearity: Two Methods There is typically a measure of multicollinearity in most experiments. We discuss two methods for assessing multicollinearity in this course: (1)Correlation matrix (2)Variance Inflation Factor(VIF) 45 Correlation Matrices A correlation matrix is simply a table indicating the correlations between each pair of explanatory variables. If you haven’t seen it before, the correlation between two variables is simply the square root of R2, combined with a sign indicating a positive or negative association. If you see values close to 1 or -1 that indicates variables are strongly associated with each other and you may have multicollinearity problems. If you see many correlations all greater in absolute value than 0.7, you may also have problems with your model. 46 Correlation Matrix A cursory look at the correlation matrix of the independent variables shows if there is multicollinearity in our experiment. 47 Correlation Matrix Involving the DV Can help to assess the preliminary idea of the bivariate association of the dependent variable with the independent variables. GDP FER MS CE TR GDP 1 0.99 0.95 0.61 FER 0.99 1 0.92 0.55 MS 0.95 0.92 1 0.76 CE 0.61 0.55 0.76 1 TR 0.14 0.09 0.09 0.05 ER 0.87 0.85 0.76 0.46 ER 0.14 0.08 0.09 0.05 1 0.36 0.87 0.85 0.76 0.46 0.36 1 48 Disadvantages of Using Correlation Matrices Correlation matrices only work with two variables at a time. Thus, we can only see pairwise relationships. If a more complicated relationship exists, the correlation matrix won’t find it. Multicollinearity is not a bivariate problem. Use VIFs! 49 Variance Inflation Factors (VIFs) Variance inflation factors measure the relationship of all the variables simultaneously, thus they avoid the “two at a time” disadvantage of correlation matrices. They are harder to explain. There is a VIF for each variable. Loosely, the VIF is based on regressing each variable on the remaining variables. If the remaining variables can explain the variable of interest, then that variable has a high VIF. 50 Using VIFs The use if variance inflation factor is the most reliable way to examine multicollinearity. VIF = 1/Tolerance= 1/1-R2 Tolerance is the proportion of variance in the independent variable not explained by its relationship with the other independent variables. In practice, all VIFs are greater than 1. VIFs are considered “bad or severe” if they exceed 10. 51 So Multicollinearity is an Issue – What Do You Do About It? Remember, if multicollinearity is present but not excessive (no high correlations, no VIFs above 10), you can ignore it. If multicollinearity is a big issue in your dataset, your goal becomes extremely important. 52 Variance Inflation Factor VIFs of 16.545 and 17.149 are ‘severe’. 53 If Your Goal is Prediction… With severe multicollinearity everything fails, except if your goal is just prediction. If your main goal is prediction (using the available explanatory variables to predict the response), then you can safely ignore the multicollinearity. 54 If Interest Centers on the Real Relationships Between the Variables… When you have serious multicollinearity, the variables are sufficiently redundant that you cannot reliably distinguish their effects. There is no single solution for this problem. 55 Some Tips: Drop one of the ‘offending’ variables from the regression equation-but often the variables are so intertwined that you cannot distinguish them. Combine the collinear variables-For example, if in a sociological study you find the variables “father’s education level” and “mother’s education level” are strongly related, it may be sufficient to simply use one variable, “parent’s education level”, which is some function of the two parents. Sometimes, you may not be able to disentangle your explanatory variables. 56 Dealing with Multicollinearity In many situations you get to select some of the explanatory variables (in engineering studies you often get to select almost all of them, in medical studies you can select drug dosage). You can use centered independent variables. Use Ridge Regression or PCA (see Montgomery et al, 2006). Use one of the analytic procedures like LISREL (see Adelodun and Awe,2013). 57 Note… Most importantly, make sure you set up your experiments in a way that you do not “install” multicollinearity. Since multicollinearity diagnostics are so easy to obtain (through stat. packages), no researcher should ever report results of regressions with obvious multicollinearity problems! 58 SHORT QUIZ: Consider the Regression Model Below: = α + β1 + Which values are known/unknown? Which are data, which are parameters? Which term is the slope? Intercept? What are the common assumptions about error structure (fill in the blanks): ~___(___,____) What is the difference between 1 and 1 ? 59 Let’s have a break for few minutes! 60 A Brief Review… You have several explanatory variables and a single response Y. You run the multiple regression first and check the residuals and collinearity diagnostic measures. If the residuals look bad, deal with those first (you may need a transformation or fit a polynomial ). Now suppose you have decent residuals… 61 With Decent Residuals… Check the collinearity measures. If these are problematic (any VIF above 10 or high correlations), then you must start removing or combining variables before you can trust the output. This tends to be a substantive, not a statistical, task. The variables with the highest VIFs can be first targeted for deletion. They are the “most redundant” After you do anything, remember to check the residuals again. Now suppose you have decent residuals and collinearity measures… Look at the p-values, R2. If these are significant, stop. Otherwise continue to the next slide… 62 Mode Selection Procedures “Model selection” refers to determining which of the explanatory variables should be placed in a final model. Usually, we want a parsimonious model, or a model which describes the response well but is as free from multicollinearity as possible. "All models are wrong, but some are useful." So said the statistician George Box. 63 Variable Subset Selection Uses Statistical Criteria to Identify a Set of Predictors for Our Model Variable subset selection: Among a set of potential predictors, choose a subset to include in the model based on some statistical criterion, e.g. p-values Forward selection: Add variables one at a time starting with the x most strongly associated with y. Stop when no other ‘significant’ variables are identified. Drawback: Variables added to the model cannot be taken out again. 64 Variable Subset Selection Continued Backwards elimination: Start with every candidate predictor in the model. Remove variables one at a time until all remaining variables are “significantly” associated with response. Drawback: Variables taken out of the model cannot be added back. Stepwise selection: As forward selection, but at each iteration remove variables which are made obsolete by new additions. Combination of forward and backward methods. 65 A Recap on Meaning and Interpretation of Regression Results Let us review and familiarize ourselves with the meaning of each entity that appears in the regression results. Note that regression procedure in SPSS and JMP are similar. All the estimates and analyses can be done easily using statistical packages. 66 Coefficient of Multiple Determination The coefficient of determination (or covariance), 2 , is the percent of variation in the response y explained by the set of 1 , … , −1 explanatory variables. 2 = =1− 0 ≤ 2 ≤ 1 (closer to 1 is a better model) 2 The adjusted coefficient of determination, , introduces a penalty for more explanatory variables. (takes sample size into account, and so more reliable) − 2 = 1 − −1 67 Interpretation of Terms… P-value: The p-value in a regression provides a test of whether that variable is significantly related to Y, after accounting for everything else. ANOVA is used to evaluate the overall model significance. Standard error: measure of uncertainty of an estimatemeasures the variability in the actual Y value from the predicted Y. R is the correlation which measures how the variables move in association with each other. 68 ANOVA Table for Simple Linear Regression Source Regression Error Total SS df = − 2 =1 = − 2 =1 = − 2 1 n2 MS = 1 F P-value ( > 1−;1,−2 ) = −2 n-1 =1 The F-test tests whether there is a linear relationship between the two variables (used in determining if model is significant). Null Hypothesis 0 : 1 = 0 69 Alternative Hypothesis : ≠ 0 Illustrative Example: Brief SPSS Demo Please wait patiently for a brief SPSS Demo involving example 3. 70 Example 3:Practical Suppose a researcher is interested in measuring the effect of several economic indicators on the GDP of a particular country in Africa(say Nigeria). He may specify the Multiple Linear Regression model as follows: GDPi=β0+ β1FERi+ β2MSi+ β3CEi+ β4TRi+ β5ERi+ . Where i=1,…,50. See the data and estimation of this model in the demo section soon. 71 Where… GDP= Gross Domestic Product(Y) FER=Foreign Exchange Reserve(X1) MS=Money Supply(X2) CE=Capital Expenditure(X3) TR=Treasury Bill Rate(X4) ER=Exchange Rate(X5) =Stochastic Error After inputting the data into SPSS/PASW, click on Analyze-Regression-Linear… 72 MODEL DIAGNOSTIC AND INTERPRETATION Look at the Following (SPSS) Regression Output: Can You Diagnose and Interpret these Results? 74 Residuals Plots for final model Histogram of Residuals. Some Lessons… High R2 value does not always indicate a good model! Always check your residuals after each analysis. If you notice non-random patterns in your residuals, it means that your model is missing something. Possibilities include: -A missing variable. -A missing higher-order term of a variable in the model to explain the curvature. -A missing interaction between terms already in the model. -etc. While trying to fit a parsimonious model, these possibilities can be explored further and figured out by the researcher. 78 Some References Michael Sullivan III. Statistics Informed Decisions Using Data. Upper Saddle River, New Jersey: Pearson Education, 2004. Michael H. Kutner, Christopher J. Nachtsheim, John Neter and William Li. Applied Linear Statistical Models. New York: McGraw-Hill Irwin, 2005. Gordon,Robert A(1968). Issues in multiple regression.American Journal of Sociology Vol.73.pp.592-616. Schroeder,Mary Ann(1990).Diagnosing and Dealing with Multicollinearity.Western Journal of Nursing Research,12(2),175-187. “Multicollinearity”.Dr. Bunty Ethington EDPR 7/8542. University of Memphis. Montgomery et al(2006).Introduction to Linear Regression Analysis.3rd Ed. Wiley Series. Awe et al(2013).Regression Model Diagnostic,Test and Robustification in the Presence of Multicollinear Covariates. International Journal of Electronic and Computer Research(India). Vol.2(2). Adelodun, A. A. and Awe, O.O.(2013). Using LISREL for Empirical Research. Transnational Journal of Science and Technology(Macedonia).Vol.3(8).pp. 1-14. www.google.com/search?q=prediction&newwindow. www.lisa.stat.vt.edu/ 79 Acknowledgement Thanks to the following: Dr. Eric Vance Dr. Chris Franck Tonya Pruitt 80