Objectives 2.1 Scatterplots Scatterplots Explanatory and response variables Interpreting scatterplots Outliers Adapted from authors’ slides © 2012 W.H. Freeman and Company Relationship of two numerical variables Most statistical studies involve more than one variable and the primary questions are about their relationships. Questions one can ask: Which variable(s) are explanatory and which are responses? How is the relationship best described? Do we want to know how one variable affects the value of another? Or do we simply want to measure their association? Is the association positive or negative? How can we predict one variable from the value of the other(s)? Can a straight line be used effectively or is the relationship more complex? How well (close) do the data fit the relationship we describe? How strong (or weak) is the relationship? Is the relationship “significant”? (Can we reject H0: no association?) How do the data deviate from the overall pattern? Examples: variables of interest Here are two data sets which may interest you: The weight of a calf (at certain week) and his/her girth. Does the weight of the calf influence the girth, what sort of relationship is there? Can we reliably predict the girth given its weight. How does the relationship change over time. Your midterm scores. Is there a relationship between the scores in midterm 1 and 2 and midterm 3. Is this relationship strong or weak. If the relationship is strong, then your final grade is pretty much clear. However, if the relationship is weak then those who did well still need to work hard and those who did poorly can still change their grade by working hard. These data sets are available on my website. Our objective in the next few lectures is to plot this data (in a meaningful way). Look at the plot for a relationship and to describe the relationship (this is descriptive statistics). Then we will describe how to measure the strength of the relationship and do prediction. Explanatory and response variables A response variable measures or records an outcome of a study. An explanatory variable explains changes in the response variable. Typically, the explanatory variable is plotted on the x axis, and the response variable is plotted on the y axis. variables: How is one affected by changes in the other one? Blood Alcohol as a function of Number of Beers 0.20 Blood Alcohol Level (mg/ml) Two numerical variables for each of 16 students. Response variable: We are interested in blood the relationship alcohol between the two content 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0 1 2 3 4 5 6 7 Number of Beers Explanatory variable: number of beers 8 9 10 Looking at relationships: Scatterplots In a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph. We look for an overall pattern and for deviations from the pattern. Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05 Interpreting scatterplots After plotting two variables on a scatterplot, we describe the relationship by examining the direction, form, and strength of the association. We look for an overall pattern … Direction: positive, negative, no direction. Form: straight line, curved, clusters, no pattern. Strength: how closely the points fit the “form”. … and for deviations from that pattern. Do the points fit more closely for one part of the form than it does for another? Are there outliers? Would it be appropriate to extrapolate the relationship we see? Form and direction of an association Straight Line Relationship No Relationship Negative Positive Curved Relationship Positive Neither Positive or Negative? Positive association: High values of the response variable tend to occur together with high values of the explanatory variable. Negative association: High values of the response variable tend to occur together with low values of the explanatory variable. Flat (no) association: The values of the response variable are similarly distributed for all values of the other variable. There is no information about the response variable that can be predicted from the explanatory variable. Complex association: For some values of the explanatory variable the variables appear to be positively associated, but for other values of that variable they appear to be negatively associated (curvature). Or information other than the general (average) level of the response variable can be predicted from the explanatory variable. Strength of the association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. This is a weak positive relationship. For a particular median household income (X), you cannot predict the state per capita income (Y) very well. Y varies widely for a given X. This is a very strong positive relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value. Y varies very little for a given X. How to scale a scatterplot Same data in all four plots. There is a negative relationship between swim time and pulse rate. Using an inappropriate scale for a scatterplot will give an incorrect impression and interpretation of the data. Both variables should be given a similar amount of space: • The plot is roughly square. • Space cannot be reduced without removing some points. Outliers An outlier is a data point that is exceptionally unusual or unexpected. They fall outside of the overall pattern of the relationship. This point is unusual in its values but it is not an outlier of the relationship. This point is not in line with the others. It is an outlier of the relationship. Objectives 2.2 Correlation The correlation coefficient r Properties of the correlation coefficient Adapted from authors’ slides © 2012 W.H. Freeman and Company Measuring relationship: correlation The correlation coefficient is a measure of the direction and strength of a linear relationship. It is calculated using the standardized values (z-scores) of both the x and y variables. 1 n x i x y i y r n 1 i1 sx sy z-score for x z-score for y Compute this with your calculator or software! r is positive if the relationship is positive and negative if the is negative. relationship r is always between −1 and 1. The closer it is to −1 or 1, the stronger the relationship. But close to 0 does not necessarily mean no relationship. r has no units of measurement and does not depend on the units for x and y. The correlation coefficient r Time to swim: x 35; sx 0.70 Pulse rate: y 140; s y 9.5 Correlation: r 0.75 This indicates a moderately strong negative relationship. The value of r would be the same if, for example, “Time to Swim” was measured in seconds and “Pulse Rate” was measured in beats per hour. "Time to Swim" is the explanatory variable here, and belongs on the x axis. However, the value of r is the same regardless of how we label or plot the variables. r ranges from −1 to +1 The correlation coefficient r quantifies the strength and direction of a linear relationship between two quantitative variables. Strength: how closely the points follow a straight line. Direction: is positive when individuals with higher X values tend to have higher values of Y, and is negative when individuals with higher X values tend to have lower values of Y. Direction? Form? Strength? Automobiles in Albuquerque were randomly selected (at a shopping center) in 1974 and given an emissions test. Total hydrocarbon emissions level and model year were observed. Negative Straight Line? Weak r = −.483 Direction? Form? Strength? Pollutants were observed over a 28 day period. The carbon pollutants and the ozone level are to be related. Positive Straight Line Moderate r = .687 Direction? Form? Strength? The efficiency of an industrial biofilter is tested at different temperature levels. Positive Straight Line Moderate to Strong r = .891 Direction? Form? Strength? The nickel-to-iron ratio was measured in oat plants and the plant age (in days after emergence) was also recorded. Complex (positive until 50 days, then negative) Curved Strong (if curve is taken into account) r = .479 The correlation measures the degree to which the points fit a straight line, not a curve. Example: correlations between midterm scores Midterm 1 Midterm 2 Midterm 3 Midterm 1 Midterm 2 0.256 Midterm 3 0.435 0.306 We can see from the correlations above, that as expected the correlation between the midterm scores is positive (because the correlation coefficients are all greater than zero). However, none of the correlation coefficients are that large. This means that the association is not strong. This means that the midterm score can not be predicted well from the previous midterm scores. This is good news, it appears that you can improve! The correlation is strongest between midterm 1 and midterm 3, this I did not expect!