Report

Simple Linear Regression and Correlation (Continue..,) Reference: Chapter 17 of Statistics for Management and Economics, 7th Edition, Gerald Keller. 1 17.4 Error Variable: Required Conditions • The error e is a critical part of the regression model. • Four requirements involving the distribution of e must be satisfied. – – – – The probability distribution of e is normal. The mean of e is zero: E(e) = 0. The standard deviation of e is se for all values of x. The set of errors associated with different values of y are all independent. 2 Observational and Experimental Data Observational Experimental • Y ,X : random variables • Y: random variable, • Ex: Y = return, X = inflation • X : controlled • Models: Regression Correlation (Bivariate normal) • Ex:Y = blood pressure X = medicine dose • Models: Regression 3 17.5 Assessing the Model • For our assumed model, the least squares method will produces a regression line whether or not there are linear relationship between x and y. 40 40 30 30 20 20 10 Y Y 10 0 0 X 2 4 6 8 10 12 0 0 2 4 6 8 10 12 X 4 • Consequently, it is important to assess how well the linear model fits the data as we have assumed linear model. • Several methods are used to assess the model. All are based on the sum of squares for errors, SSE. 5 Sum of Squares for Errors – This is the sum of squared differences between the observed points and the points on the regression line. – It can serve as a measure of how well the line fits the data. SSE is defined by n SSE yi yˆ i 2 i 1 6 Standard Error of Estimate, se – The standard deviation of the error variables shows the dispersion around the true line for a certain x. – If se is big we have big dispersion around the true line. If se is small the observations tend to be close to the line. Then, the model fits the data well. – Therefore, we can, use se as a measure of the suitability of using a linear model. 7 The se is not known, therefore use an estimate of it An estimate of se is given by se, the standard error of estimate: SSE se n2 s e se 8 The Example with the food company: 200 180 160 140 SALES 120 100 200 400 600 800 1000 1200 ADVER 9 Earlier: Inference about μ, (when s2 unknown) Model: X ~ N(, s2 ) Hypothesis: H0: = 50 H1: ≠ 50 Test statistic: if H0 is true t X s n ~ t´n1 Level of significance: α 10 Rejection region: Reject H0 if tobs>tcrit or tobs<-tcrit Observation: tobs Conclusion: tobs>tcrit or tobs<-tcrit reject H0 tobs<tcrit or tobs>-tcrit don’t reject H0 Interpretation: We have empirical support for the hypothesis 11 Testing the Slope We test if there is a slope. We can write it formally as follows H 0 : 1 0 H 1 : 1 0 Test statistic Under H0, t t b 1 sb1 b 1 sb1 Confidence interval: b sb1 ~ t n 2 where s b1 se (n 1) s x2 b1 t / 2:( n2) sb1 12 Coefficient of determination To measure the strength of the linear relationship we use the coefficient of determination. It is a measure of how much the variation in Y is explained by the variation in X. (How many % of the variation in Y can be explained by the model) 13 Food Company example: Call X=ADVER and Y=SALES i xi yi y yi 1 2 3 4 5 6 7 8 9 276 552 720 648 336 396 1056 1188 372 115.0 135.6 153.6 117.6 106.8 150.0 164.4 190.8 136.8 Total 5544 1270.6 Mean 616 yi y 2 -26.177778 685.27605 -5.577778 31.11160 12.422222 154.31160 -23.577778 555.91160 -34.377778 1181.83160 8.822222 77.83160 23.222222 539.27160 49.622222 2462.36494 -4.377778 19.16494 yˆ i 118.0087 136.8165 148.2648 143.3584 122.0974 126.1860 171.1613 180.1563 124.5506 5707.076 yi yˆ i 2 9.052434 1.479981 28.464554 663.494881 234.009911 567.104757 45.714584 113.288358 150.048384 1812.658 141.178 SST yi y 2 2 SSE yi yˆ i 2 SSR yˆ i yˆ 2 2 2 ˆ ˆ y y y y y y i i i i SST SSE SSR 14 Standard error of the estimate se Standard deviation of SSE 1812.658 16.09196 n2 9-2 b1 sb1 se (n 1) s 2 x 16.09196 (9 - 1) 104832 0.01757183 H 0 : 1 0 H 1 : 1 0 b1 0.06814 T st at ist ic t 3.877798 s b1 0.01757183 Degree of freedom 9 2 7 R2 1 SSE 1812.658 1 0.6823841 SST 5707.076 15 Coefficient of determination The regression model Overall variability in y The error Variation in the dependent variable Y = variation explained by the independent variable + unexplained variation SST=SSR+SSE The greater the explained variable, better the model Co-efficient of determination is a measure of explanatory power 16 of the model 17 Correlations ADVER ADVER SALES Pearson Correlation Sig . (2-tailed) N Pearson Correlation Sig . (2-tailed) N 1 , 9 ,826** ,006 9 SALES ,826** ,006 9 1 , 9 **. Correlation is sig nificant at the 0.01 level (2-tailed). 18 Model Summary Model 1 R ,826a R Sq uare ,682 Adjusted R Sq uare ,637 Std. Error of the Estimate 16,0867 a. Predictors: (Constant), ADVER 19 ANOVAb Model 1 Reg ression Residual Total Sum of Squares 3885,156 1811,484 5696,640 df 1 7 8 Mean Square 3885,156 258,783 F 15,013 Sig . ,006a a. Predictors: (Constant), ADVER b. Dependent Variable: SALES 20 Coefficientsa Model 1 (Constant) ADVER Unstandardized Coefficients B Std. Error 99,273 12,077 6,806E-02 ,018 Standardized Coefficients Beta ,826 t 8,220 3,875 Sig . ,000 ,006 a. Dependent Variable: SALES 21 Cause - effect Note that conclusions about cause and effect, X Y is based on knowledge of the subject. Experimental studies can decide it. It is hard to say with observational studies Eg: Is smoking lungcancer true?. The regression model only shows the linear relationship. We will make the same inferences even if we switch the variables! 22 Coefficientsa Model 1 (Constant) SALES Unstandardized Coefficients B Std. Error -798,855 370,905 10,020 2,586 Standardized Coefficients Beta ,826 t -2,154 3,875 Sig . ,068 ,006 a. Dependent Variable: ADVER 23 Do the analyze in the right order: First the theory of the subject Then the statistical model! 24