CHAPTER 8: LINEAR REGRESSION By Dara Lee and Michelle Smith Period 1 The linear model is just an equation of a straight line through data. The points in a scatterplot don’t all line up, but a straight line can summarize the general pattern. The model can help understand how the variables are associated. LINEAR MODEL An estimate from a model is called the predicted value (ŷ) The difference between observed (y) and predicted values (ŷ) is called the residual (e) Residual=Observed-Predicted (e=y-ŷ) A negative residual means the predicted value is too big. A positive residual means the predicted value is too small. To see if a linear model is appropriate, the residuals plot should be scattered with no interesting features, no direction, no shape, no bends, and no outliers. RESIDUALS The line of best fit is the line for which the sum of the squared residuals (R²) is the smallest. Also known as “line of least squares.” By squaring the residuals, all are made positive for summation. This also emphasizes the largest residuals. The smaller the sum, the better the fit. Equation of the line: ŷ=bo+b1x Equation of the slope: b1=r(Sy/Sx) b0=y-b1x LINE OF BEST FIT The equation for a line that passes through the origin can be written with just a slope and no intercept: y=mx The coordinates of these standard points aren’t written as (x,y)—their coordinates are z-scores: (zx,zy) For every horizontal change in Sx there is a vertical change in r(Sy) Moving one standard deviation away from the mean in “x” moves our estimate “r” standard deviations away from the mean in “y.” In general, moving any number of standard deviations in “x” moves “r” times that number of standard deviations in “y.” CORRELATION AND THE LINE Each predicted “y” tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean; the line is called the regression line. HOW BIG CAN PREDICTED VALUES GET? “r” is the correlation between two variables. The greater the absolute value of the correlation, the stronger the association. The squared correlation gives the fraction of the data’s variation accounted for by the model, and 1-R² is the fraction of the original variation left in the residuals. An R² of 0 means that none of the variance in the data is in the model; all of it is still in the residuals. Squaring the residuals ensures that all are positive so that they can be added to figure out the line of best fit. The smaller the sum, the better the fit. R²: THE VARIATION ACCOUNTED FOR Quantitative Variables Condition: Variables cannot be categorical variables. Straight Enough Condition: Scatterplot must look reasonably straight. The linearity can be checked again after the regression, when residuals can be examined. Outlier Condition: No point should be singled out. To spot outliers, you can check the residuals—they may have large residuals. Outliers can dramatically change a regression model. ASSUMPTIONS AND CONDITIONS CHAPTER 8 PROBLEM #33 Classified ads in the Ithaca Journal offered several used Toyota Corollas for sale. Listed below are the ages of the cars and the advertised prices. Age (yr) Price Advertised ($) 1 12995, 10950 2 10495 3 10995, 10995 4 6995, 7990 5 8700, 6995 6 5990, 4995 9 3200, 2250, 3995 11 2900, 2995 13 1750 a) Find the equation of the regression line. b) Explain the meaning of the slope of the line. c) Explain the meaning of the intercept of the line. d) If you want to sell a 7-year-old Corolla, what price seems appropriate? e) You have a chance to buy one of two cars. They are about the same age and appear to be in equally good condition. Would you rather buy the one with a positive residual or a negative residual? Explain. f) You see a “For Sale” sign on a 10year-old Corolla stating the asking price as $1500. What is the residual? g) Would this regression model be useful in establishing a fair price for a 20year-old car? Explain. CHAPTER 8 PROBLEM #33 a) Predicted price= 12319.6 - 924 x years b) Every extra year of age decreases average value by $924 c) The average new Corolla costs $12,319.60 d) $5851.60 e) Negative residual. Its price is below the predicted value for its age. f) -$1579.60 g) No. After age 13, the model predicts negative prices. The relationship is no longer linear. CHAPTER 8 PROBLEM #37 Here are the data used when the association between the amounts of fat and calories in hamburgers were examined. Fat (g) 19 Calories 410 31 34 35 39 39 43 580 590 570 640 680 660 When a scatterplot was made, the equation of the line of regression was calculated to be: Predicted calories= 211+11.06 x calories/fat gram a) Explain why you cannot use that model to estimate the fat content of a burger with 600 calories. b) Using an appropriate model, estimate the fat content of a burger with 600 calories. CHAPTER 8 PROBLEM #37 a) The regression was for predicting calories from fat, not the other way around. b) Predicted fat grams= -15 + .083 grams/calories Predict 34.8 grams of fat.