Report

Chapter 2 Lecture Slides 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 2: Summarizing Bivariate Data 2 Introduction • Often, scientists and engineers collect data in order to determine the nature of the relationship between two quantities. • An example is: heights and forearm lengths of men. • Data that consist of ordered pairs are called bivariate data. • In many cases, ordered pairs tend to cluster around a straight line when plotted. • The summary statistic most often used to measure the closeness of association between two variables is the correlation coefficient. • When two variables are closely related to each other, it is often of interest to predict the value of one of them when given the value of the other. This is done with the equation of a line known as the least squares line. 3 Section 2.1: The Correlation Coefficient • Data for which items consists of a pair of values is called bivariate. • The graphical summary for bivariate data is a scatterplot. • Display of a scatterplot: 2 y 1 0 -1 0 1 2 3 4 5 6 7 8 x 4 Looking at Scatterplots • If the dots on the scatterplot are spread out randomly, then the two variables are not well related to each other. • If the dots on the scatterplot are spread around a straight line, then one variable can be used to help predict the value of the other variable. 5 Example • This is a plot of height vs. forearm length for men. • We say that there is a positive association between height and forearm length. This is because the plot indicates that taller men tend to have longer forearms. • The slope is roughly constant throughout the plot, indicating that the points are clustered around a straight line. • The line superimposed on the plot is a special line known as the least-squares line. 6 Correlation Coefficient • The degree to which the points in a scatterplot tend to cluster around a line reflects the strength of the linear relationship between x and y. • The correlation coefficient is a numerical measure of the strength of the linear relationship between two variables. • The correlation coefficient is usually denoted by the letter r. 7 Computing r • Let (x1, y1),…,(xn, yn) represent n points on a scatterplot. • Compute the means and the standard deviations of the x’s and y’s. • Then convert each x and y to standard units. That is, compute the z-scores: ( xi x ) / sx and ( yi y ) / s y . • The correlation coefficient is the average of the products of the z-scores, except that we divide by n – 1 instead of n. 1 n xi x yi y r n 1 i 1 sx s y 8 How the Correlation Coefficient Works • The origin of the scaterplot is placed at the point of averages ( x, y) 1 n xi x yi y r n 1 i 1 sx s y 9 Computational Formula n r x x y y i i 1 i n n x x y y i 1 2 i i 1 2 i • This formula is easier for calculations by hand: n r x y nxy i 1 n i 2 2 x n x i i 1 i n 2 2 y n y i i 1 10 Properties of r • It is a fact that r is always between -1 and 1. • Positive values of r indicate that the least squares line has a positive slope. Greater values of one variable are associated with greater values of the other. • Negative values of r indicate that the least squares line has a negative slope. Greater values of one variable are associated with lesser values of the other. 11 More Comments • Values of r close to –1 or 1 indicate a strong linear relationship. • Values of r close to 0 indicate a weak linear relationship. • When r is equal to –1 or 1, then all the points on the scatterplot lie exactly on a straight line. • If the points lie exactly on a horizontal or vertical line, then r is undefined. • If r 0, then x and y are said to be correlated. If r = 0, then x and y are uncorrelated. 12 More Properties of r • An important feature of r is that it is unitless. It is a pure number that can be compared between different samples. • r remains unchanged under each of the following operations: – Multiplying each value of a variable by a positive constant. – Adding a constant to each value of a variable. – Interchanging the values of x and y. • If r = 0, this does not imply that there is not a relationship between x and y. It just indicates that there is no linear relationship. • Outliers can greatly distort r, especially, in small data sets, and present a serious problem for data analysts. • Correlation is not causation. For example, vocabulary size is strongly correlated with shoe size, but this is because both increase with age. Learning more words does not cause feet to grow or vice versus. Age is confounding the results. 13 Example 1 An environmental scientist is studying the rate of absorption of a certain chemical into skin. She places differing volumes of the chemical on different pieces of skin and allows the skin to remain in contact with the chemical for varying lengths of time. She then measures the volume of chemical absorbed into each piece of skin. The scientist plots the percent absorbed against both volume and time. She calculates the correlation between volume and absorption and obtains r = 0.988. She concludes that increasing the volume of the chemical causes the percentage absorbed to increase. She then calculates the correlation between time and absorption, obtaining r = 0.987. She concludes that increasing the time that the skin is in contact with the chemical causes the percentage absorbed to increase as well. Are these conclusions justified? 14 Example 1 Volume (mL) 0.05 0.05 0.05 2.00 2.00 2.00 5.00 5.00 5.00 Time(h) Percent Absorbed 2 48.3 2 51.0 2 54.7 10 63.2 10 67.8 10 66.2 24 83.6 24 85.1 24 87.8 15 Example 2 Volume (mL) 0.05 0.05 0.05 2.00 2.00 2.00 5.00 5.00 5.00 Time(h) Percent Absorbed 2 49.2 10 51.0 24 84.3 2 54.1 10 68.7 24 87.2 2 47.7 10 65.1 24 88.4 16 Controlled Experiments and Confounding • In a controlled experiment the experimenter can choose the values of the factors to reduce confounding. • In controlled experiments, confounding can often be avoided by choosing values for factors in such a way so that the factors are uncorrelated. HW 2.1: 1, 7, 8 17 Section 2.2: The Least-Squares Line Can we predict the strength for a nitrogen content not in the table? Is there any relationship between nitrogen content and strength? 18 Section 2.2: The Least-Squares Line 19 Section 2.2: The Least-Squares Line • The line that we are trying to fit is yi = 0 +1xi +i. • The variable yi is the dependent variable, the xi is the independent variable, and 0 and 1 are called the regression coefficients, and i is called the residual. We only know the values of x and y, we must estimate 0 and 1. • This is what we call simple linear regression. 20 Using the Data • We write the equation of the least-square line as . y ˆ0 ˆ1 x • The quantities ˆ0 and ˆ1 are called the least-squares coefficients. • The least-squares line is the line that fits the data “best”. • To find the least-squares line, we must determine estimates for the slope 0 and 1 intercept that minimize the sum of the squared residuals. n n n S e yi yˆi i 1 2 i i 1 2 i 1 yi ˆ0 ˆ1 xi 2 21 Finding the Equation of the Line • These quantities are n ˆ1 (x i 1 i n x )( yi y ) 2 ( x x ) i and ˆ0 y ˆ1 x . i 1 Note: The true values of 0 and 1 are unknown. 22 Some Shortcut Formulas The expressions on the right are equivalent to those on the left, and are often easier to compute: n n x x x 2 i i 1 n i 1 2 i n y y y 2 i i 1 n (x i 1 i i 1 2 i nx 2 ny 2 n x )( yi y ) xi yi nx y i 1 23 Example 2 Using the weld data in Table 2.1 in the text, compute the least-squares estimates of ˆ0 and ˆ1. Write the equation of the least-squares line. x 0.0319 n y 63.79 n 2 ( x x ) x i i nx = 0.1002 2 i 1 i 1 n n (x i 1 i x )( yi y ) xi yi nx y = 3.32 i 1 n ˆ1 (x i 1 i n x )( yi y ) 2 ( x x ) i 3.32 331 .62 0.1002 i 1 ˆ0 y ˆ1 x 63.79 (331.62)(0.0319) 53.197 y 53.197 331.62x 24 Example 3 Using the equation of the least-squares line for the weld data in Table 2.1 in the text, predict the yield strength for a weld whose nitrogen content is 0.02%. y 53.197 331.62x y 53.197 331.62(0.02) 59.83ksi In the table the yield strength of that weld was 57.67 ksi, so should we use the table or the equation to predict another weld? HW 2.2: 4, 12 25 Section 2.3: Features and Limitations of the Least-Squares Line • Do not extrapolate the fitted line (such as the leastsquares line) outside the range of the data. The linear relationship may not hold there. • We learned that we should not use the correlation coefficient when the relationship between x and y is not linear. The same holds for the least-squares line. When the scatterplot follows a curved pattern, it does not make sense to summarize it with a straight line. • If the relationship is curved, then we would want to fit a regression line that contain squared terms. 26 Section 2.3: Features and Limitations of the Least-Squares Line The relationship between the height of a free-falling object and the time in free fall is not linear. 27 Outliers and Influential Points If we define outliers as points that have unusual residuals, then a point that is far from the bulk of the data, yet near the least-square line, is not an outlier. 28 Measures of Goodness of Fit • A goodness of fit statistic is a quantity that measures how well a model explains a given set of data. • The quantity r2 is the square of the correlation coefficient and we call it the coefficient of determination. n n 2 2 ˆ ( y y ) ( y y ) i i i 2 i 1 i 1 r n 2 ( y y ) i i 1 • The proportion of variance in y explained by regression is the interpretation of r2. • r2 explained the percentage of variability in y that can be explained by x. Where is x in the above equation? 29 Sums of Squares • • • • • 2 ˆ ( y y ) i 1 i n is the error sums of squares and measures the overall spread of the points around the leastsquares line. n 2 ( y y ) is the total sums of squares and measures i 1 i the overall spread of the points around the line y y. The difference ( y y ) ( y yˆ ) is called the regression sum of squares and measures the reduction in the spread of points obtained by using the leastsquares line rather than y y. The coefficient of determination r2 expresses the reduction as a proportion of the spread around y y . Clearly, the following relationship holds: n i 1 2 i n i 1 2 i Total sum of squares = regression sum of squares + error sum of squares 30 Sums of Squares 31 Interpreting Computer Output Regression Analysis: Strength versus Nitrogen Concentration The regression equation is Strength = 53.197 + 331.62 Nitrogen Predictor Constant Nitrogen Coef 53.19715 331.6186 SE Coef 1.02044 27.4872 S = 2.75151 R-Sq = 84.8% T 52.131 12.064 P 0.000 0.000 R-Sq(adj) = 84.3% 1. This is the equation of the least-squares line. 2. Coef The least squares coefficients . 3. R-Sq This is r2, the square of the correlation coefficient r, also called the coefficient of determination. HW 2.3: 7, 10 Supplementary: 6, 7 Summary • • • • Bivariate Data Correlation Coefficient Least-Squares Line Features and Limitations of the Least-Squares Line 33