Report

By Gina M. Salvati STATISTICS BLITZ!!! *~DISCLAIMER~* Some examples are lifted from Fundamentals of Statistics Third Edition. All rights reserved, yada, yada, yada. I don’t own those examples, and they are noted throughout the presentation. The only people benefitting from this presentation (hopefully) are peer tutors and the students. ENJOY! First, and foremost… DON’T PANIC • Unlike Algebra, Statistics is not an exact math – there will always be some kind of variation among answers. So if your answer doesn’t exactly match what is in the text or (especially) on the answer key on an exam, don’t panic. As long as you are in the same ballpark as the given answer(s), you should be just fine. • Even with that being said, however, always, always, always check your work. It is always important to double-check yourself, no matter which course you take. • A good test-taking tip when it comes to multiple choice is to go with the answer closest to what you come up with. THE CHAMBER OF DOOM (I Mean, Table of Contents) • Types of Curves • Measures of Central Tendency vs. Measures of Dispersion • The Empirical Rule • What to Use and When to Use it {Pt.1}: normalcdf vs. invNorm • How to Calculate on TI-83/84 • The Sampling Distribution of vs. The Sampling Distribution of • Z-scores • The Five-Number Summary, Boxplots, and • What to Use and When to Use it {Pt. 2}: Confidence Intervals Outliers • To Construct a Boxplot • What to Use and When to Use it (Pt. 3}: The Insanity that is Hypothesis Testing • The Interquartile Range • To Determine Outliers • Example • Feat. Type I and Type II Errors • What to Use and When to Use it 2{The • Linear Regression and What All Goes With Finale}: Matched-Pair Data vs. − GOF Testing It • r and r2 • A Quick Look at Probability THE TYPES OF CURVES Left-skew Symmetrical AKA Bell Curve Right-skew Mean = Median = Mode Mean < Median < Mode Mean > Median > Mode In the image above, the relationships of mean, median, and mode are illustrated. MEASURES OF CENTRAL TENDENCY VS. MEASURES OF DISPERSION The Measure of Central Tendency basically means the average or typical data value. For this class, it mostly refers to where our distribution is centered at. There are 3 Measures of Central Tendency: • Mean • The measure we mostly refer to when we say “the average.” • Two types: population () and sample () • Median • The value that lies in the center of our data when it is all put in ascending order. • The most resistant measure of central tendency – i.e., it’s not affected by extreme values (meaning it doesn’t get yanked around by outliers and crazy numbers as much as, say, the mean) • Mode • The number(s) in our data that occur(s) most frequently. MEASURES OF CENTRAL TENDENCY VS. MEASURES OF DISPERSION The Measure of Dispersion refers to what we use to describe the spread of our distribution. Like the Measures of Central Tendency, there are three of these: • Range • Highest number – lowest number • The Interquartile Range (IQR) falls under this • Variance • (standard deviation)2 • Two types: population ( 2 ) and sample ( 2 ) • Standard Deviation • The most common form of dispersion and the one that we use in this course • Two types: population () and sample () HOW TO CALCULATE ON TI-83/84 Flight Time The following data represent the mean flight time (in minutes) of a random sample of seven flights from Las Vegas, Nevada, to Newark, New Jersey, on Continental Airlines: • 282, 270, 260, 266, 257, 260, 267 • Calculate the mean, median, and mode flight time. • Enter data in L1 (Stat, 1:Edit…) • Stat, Calc, 1: 1-Var Stats • ENTER • The list that appears gives you all the information you will be looking for. provides the mean, provides the sample standard deviation (s.d.), and provides the population s.d. • Scroll down until you see Med. This number is your median. • The only thing it doesn’t calculate is the mode (the most reoccurring number), but that’s pretty easy to find on its own. ;) From Fundamentals of Statistics Third Edition text, pg. 125. Z-SCORES The z-score basically represents the distance that a data value is from the mean in terms of the number of standard deviations; i.e., it’s a converted number. Sounds a lot more complicated than it looks. All you need to know is one simple equation: − = Let’s see an example… Z-SCORES Example: Men vs. Women The average 20- to 29-year-old man is 69.6 inches tall, with a standard deviation of 3.0 inches, while the average 20- to 29-year-old woman is 64.1 inches tall, with a standard deviation of 3.8 inches. Who is relatively taller, a 67-inch man or a 62-inch woman? • Break the information into two (2) equations and then compare: • Men: • = 67 −69.6 3.0 = − 2.6 3.0 62 −64.1 3.8 = −2.1 3.8 • Women: • = ≈ −0.867 ≈ −0.553 • The relatively taller person is… • The 67-inch man. From Fundamentals of Statistics Third Edition text, pg. 161. THE FIVE-NUMBER SUMMARY, BOXPLOTS, AND OUTLIERS • The Five-Number Summary basically divvies up your data into quartiles. It also can be used to construct a boxplot. The Five-Number Summary consists of the following: • • • • • Smallest number in the data set (Minimum) First quartile (1 ) Median (2 ) Third quartile (3 ) Largest number in the data set (Maximum) • To get it on your calculator, enter your data in L1 under Stat, 1: Edit…, and then proceed as if you were looking for the mean, median, and mode (Stat, Calc, 1: 1-Var Stats). • Scroll down to the bottom of the list, and you will find the Five-Number Summary. ;) TO CONSTRUCT A BOXPLOT • After entering your data in L1, hit 2nd, Y=. This will bring you to the Stat Plots screen. • Hit ENTER or 1. This will bring you to another screen, where you can actually set up the boxplot. • Turn the plots ON. • Select the Boxplot with Outliers type. This is the first image in the second row in the Types category. • Zoom, 9: ZoomStat • Ta-da! :D THE INTERQUARTILE RANGE • The Interquartile Range (IQR) is the range the middle 50% of observations in the data set. Again, another simple little equation: = 3 − 1 TO DETERMINE OUTLIERS Outliers are basically extreme values – the numbers that are waaay out there on either side of the data set when all the data are put in ascending order. To determine if there are any outliers in a data set, we use the following two equations: • • • • Lower Fence – any numbers below this number are considered outliers = 1 − 1.5() Upper Fence – any numbers above this number are considered outliers = 3 + 1.5() THE FIVE-NUMBER SUMMARY, BOXPLOTS, AND OUTLIERS April Showers The following data represent the number of inches of rain in Chicago, Illinois, during the month of April for 20 randomly selected years. 0.97 2.47 3.94 4.11 5.79 1.14 2.78 3.97 4.77 6.14 1.85 3.41 4.00 5.22 6.28 2.34 2.34 4.02 5.50 7.69 [a] Determine the five-number summary [b] Compute the Interquartile Range (IQR) [c] Determine the Upper and Lower Fences. Are there any outliers? Fundamentals of Statistics Third Edition, pg. 162 • [a] Determine the quartiles THE FIVE-NUMBER SUMMARY, BOXPLOTS, AND OUTLIERS – ANSWERS • Record all data in L1 • Stat, Calc: 1:1-Var Stats • • • • • Min: .97 1 : 2.405 2 /med: 3.985 3 : 5.36 Max: 7.69 • [b] Compute the IQR • Q3 – Q1 = IQR • 5.36 – 2.405 = 2.955 • [c] Determine the Upper and Lower Fences. Are there any outliers? • LF = Q1 – 1.5(IQR) • LF = 2.405 – 1.5(2.955) • LF = -2.0275 • UF = Q3 + 1.5(IQR) • UF = 5.36 + 1.5(2.955) • UF = 9.7925 • Because there are no numbers in the data set that are less than the lower fence or greater than the upper fence, there are no outliers. LINEAR REGRESSION AND WHAT ALL GOES WITH IT • Linear Regression is basically a method of modeling the relationship between the independent/explanatory variable (x) and the dependent variable (y). The simplest way to look at it is to think in terms of equation of lines. The least-squares regression line equation is pretty much slope-intercept form – no joke. • To solve on TI-83/84: Look at this equation setup here. It should look familiar. (Hint: substitute m for a) • Enter data in Stat, 1: Edit… • X values in L1, Y values in L2 • Stat, Calc, 4: LinReg • As shown in the image on the right (), this screen will appear. If you do not see r and r2, you will need to set it up: • 2nd, 0 (Catalog), scroll down to −1 DiagnosticOn (or hit the key to save time), ENTER • Hit ENTER again; a message saying DONE will appear. ;) What is r and r2? Well, let’s take a look… R AND R2 The linear correlation coefficient (r) measures how closely related the data is to the linear regression line The coefficient of determination (r2) holds explaining power: “the linear regression explains __ much of the data.” Note: It is important to remember that correlation doesn’t necessarily imply causation – just because there is a correlation between two variables, it doesn’t always mean that one is causing the other. A QUICK LOOK AT PROBABILITY • Probability measures the likelihood of something happening (and, in some cases, not happening). Like everything else in Statistics, there are a few rules: • The probability of any event (E) must be greater than or equal to 0 and less than or equal to 1, AKA: 0 ≤ ≤ 1 • The sum of all probabilities must equal to one • If an event is impossible, its probability is 0. Conversely, if an event is certain, its probability is 1. • An unusual event has a less than 5% (.05) chance of occurring. This is an important rule to remember, as it holds true in several other sections in this course. THE EMPIRICAL RULE • The Empirical Rule is used to give us an approximation of the number(s) that lies within 1, 2, or 3 standard deviations away from the mean. • 68% of the distribution lies within 1 deviation from the mean • 95% of the distribution lies within 2 deviations of the mean • 99.7% of the distribution lies within 3 deviations of the mean THE EMPIRICAL RULE: A SLIGHTLY BETTER EXAMPLE 95 – 5 = 90 105 + 5 = 110 68% 95% 99.7% =100 90 – 5 = 85 100 – 5 = 95 =5 110 + 5 = 115 100 + 5 = 105 WHAT TO USE AND WHEN TO USE IT {PT. 1}: NORMALCDF VS. INVNORM • Two simple calculator functions that can very easily get mixed up. These are the key things to remember about normalcdf and invNorm: • normalcdf: • When we’re given z-scores or data points* and asked to find the probability/proportion/area under the curve. • invNorm: • When we’re given a percent/probability/area and asked to find the z-score/data point. • A trick is to look for key words. One example is the word percentile. If a question asks you to find a percentile, you will use the invNorm function on your calculator. * Z-scores are what we use when dealing with the standard normal curve ( = 0, = 1); when working with a normal distribution (such is in the example in slides 25-26), we refer to the numbers as data points. NORMALCDF ON THE CALCULATOR • When using the normalcdf function on your calculator, the order of input is the lower bound, followed by the upper bound, then the mean, and finally the standard deviation. Most calculators will have you enter the information manually; others are kind enough to help you out a little: -E99 and E99 are very useful in circumstances when you are not given an upper or lower bound. In some instance, -100 and 100 can be used, but sometimes a much smaller or larger may be called for. To put this in the calculator, press 2nd and then the comma key to get E. Simply type the negative symbol before if doing –E99. NORMALCDF EXAMPLE • If I were to ask what the area under the curve was above 1.50, my graph would look something like this: • If were to ask what the area under the curve was below 1.50, my graph would look like this: • And if I wanted to know the area between, say, -1.75 and 1.5, my graph would look like this: NOTE: In this example, we are working with the standard normal curve. This means that μ = 0 and σ = 1. INVNORM ON THE CALCULATOR • When using the invNorm function on your calculator, the order of input is the area to the left of the z-score we’re seeking, followed by the mean, and finally the standard deviation. Most calculators will have you enter the information manually; others are kind enough to help you out a little: INVNORM EXAMPLE 1.0 • invNorm problems can get a little confusing at times. The key, as with virtually everything in Statistics, is in the wording. • Let’s say I am in charge of a chocolate chip cookie factory. The factory churns out an average of 3500 cookies a day, with a standard deviation of 45. Inventory is just around the corner, and we’re expected to make above the 60th percentile. Over how many cookies are we supposed to make to meet expectations? • The wording here isn’t exactly the greatest, but we are given the information necessary to perform the task. We know that we a have an average (mean, μ) of 3500 cookies and a standard deviation (σ) of 45. We also know that we need to make above the 60th percentile. This means we need to make over the first 60% in order to meet the quota. How do we solve for this? INVNORM EXAMPLE 1.0 (CONT.) • The easiest way to solve is to start out by drawing a picture: • The 60th percentile is illustrated as the first 60% of our distribution. Using some more of our newly discovered mad calculator skills, we can find the number of cookies (our data point) by following these steps: • 2nd, vars, 3: invNorm • Enter, in order, the area to the left side of the data point we are seeking (here, .60), the mean, and the standard deviation • ENTER • The number you get will be the data point we’re looking for. The answer here comes out to be 3511.40062. Because we are talking about cookies and not random numbers, it is best to round. A good practice is to round up, to take into account for potential extra. ANSWER: In order to meet expectations, the factory needs to make over 3511/3512 cookies. THE SAMPLING DISTRIBUTION OF VS. THE SAMPLING DISTRIBUTION OF • A simple process gone wrong, or at least this is what it sometimes feels like with sampling distributions. Actually, sampling distributions are really nifty little paths that help us when dealing with samples and proportions. • There are two that we deal with: The Sampling Distribution of and the Sampling Distribution of . The difference? Let’s find out… THE SAMPLING DISTRIBUTION OF THE SAMPLE MEAN • The sampling distribution of the sample mean (AKA: x-bar, ) is used when we are given information about a sample and asked to find the probability of selecting individuals possessing a certain characteristics we are looking for. • The sampling distribution goes a little something like this: The distribution is approximately normal = = This cute little equation is called the standard error of the mean, or just standard error. Good little vocab hint to remember. ;) Let’s see an example… EXAMPLE PROBLEM #1: SAMPLING DISTRIBUTION OF A simple random sample size n = 49 is obtained from a population with µ = 80 and σ = 14. • • • • Describe the sampling distribution of . What is > 83 ? What is ≤ 75.8 ? What is (59.8 < < 65.9)? From Fundamentals of Statistics, Third Edition, pg. 389 EXAMPLE PROBLEM #1 SOLUTION • The sampling distribution of x-bar is approximately normal. This is pretty much always the case, at least for this course. If the sample is not normal, we cannot proceed with the sampling distribution. To determine if a sample is normal, the general rule of thumb is to look at your sample size (n). If the sample size is 30 or greater (n ≥ 30), you can go ahead; if the sample size is smaller than 30, the information must specify that the population is normal. • The rest of the sampling distribution is as follows: = = 80 14 14 = = = =2 7 49 EXAMPLE PROBLEM #1 SOLUTION (CONT.) • The sampling distribution is approximately normal, with = 80 and = 2. • To solve, it’s best to start out by drawing a “pretty picture” before plugging numbers in the calculator: • For each part, select 2nd, vars, 2: normalcdf. This operation will give you the area under the curve. • First curve: 2nd, vars, 2; (83, E99, 80,2) ≈ .0668 • Second curve: 2nd, vars, 2; (-E99, 75.8, 80, 2) ≈ .0179 • Third curve: 2nd, vars, 2: ( Please note: images aren’t exactly to scale. ^^;; THE SAMPLING DISTRIBUTION OF THE POPULATION PROPORTION • The sampling distribution of the population proportion (AKA p-hat or ) is used when we are given information about a sample and a proportion/percent and asked to find the probability of selecting individuals possessing a certain characteristics we are looking for. • This sampling distribution looks like this: The distribution is approximately normal = = (1 − ) = This equation is for a point estimate. We use this to adjust our numbers in order to fit the distribution. If you’re dealing with decimals, a whole number isn’t going to fall in your distribution. ;) EXAMPLE PROBLEM: SAMPLING DISTRIBUTION OF Smith owns a shipyard. He knows that 5% of all welding done that afternoon will wind up being defective. Out of the 7000 welds in the yard, he examines 300. What’s the probability that between 10 and 20 welding jobs will be defective? SMITH AND HIS SHIPYARD • This kind of a question is a classic nuisance. First of all, how do we determine what numbers represent our sample (n) and the number(s) out of it (x)? • Look at the information. The problem states that out of 7000 welds in the yard, Smith examines 300 welds. 300 is our sample size; the 7000 total welds is population in the shipyard. Out of the 300 welds he’s examined, he wants to know the probability between 10 and 20 welding jobs that could be defective. 10 and 20 represent our x values. • Knowing this, we can now move forward with our information gathering: • • • • p = 5% = .05 n = 300 1 = 10 2 = 20 SMITH AND HIS SHIPYARD (CONT.) • We can now go ahead with our sampling distribution: • • • • = p = .05 = (enter eq on 2010) = .0126 1 =10/300 = .033 2 = 20/300 = .067 • Finally, using our mad calculator skills, we can determine the area between our two boundaries (aka – z-scores): • 2nd, vars, 2 • (LB, UB, , ) • (.033, .067, .05, .0126) TA DA!!! The answer is approximately .8227 WHAT TO USE AND WHEN TO USE IT {PT. 2}: CONFIDENCE INTERVALS A Quick Run-Down • Stat, Tests,… • 7: Z-Int – when given σ (population standard deviation) • 8: T-Int – when given s (sample standard deviation) or no standard deviation when given a data set • A: 1-PropZInt – when given x and n values (a sample, n, and a number out the sample, x) and a percent or proportion • B: 2-PropZInt – when given two x and n values (two samples and numbers out of those samples: x1 and n1, and x2 and n2) WHAT TO USE AND WHEN TO USE IT {PT. 3}: THE INSANITY THAT IS HYPOTHESIS TESTING A Quick Run-Down • Stat, Tests,… • 1: Z-Test – when given σ (population standard deviation) • 2: T-Test – when given s (sample standard deviation) or no standard deviation when given a data set • 5: 1-PropZTest – when given x and n values (a sample, n, and a number out of the sample, x) and a percent or proportion • 6: 2-PropZTest – when given two x and n values (two samples and numbers out of those samples: x1 and n1, and x2 and n2) FEAT. TYPE I AND TYPE II ERRORS The best way to describe these two is to think about hypothesis testing as a court case. The old motto of “innocent until proven guilty” provides us with our null and alternative hypotheses. The following is a mini-cartoon illustrating Type I and Type II errors. WHAT TO USE AND WHEN TO USE IT {THE FINALE}: MATCHED-PAIR DATA VS. CHI-SQUARE GOF TESTING Now, this is where things are confusing. Because these two types of testing appear so alike when entering data into the calculator, it’s easy to get confused. How can we tell the difference between these two kinds of problems??? WHAT TO LOOK FOR, HOW TO SOLVE The problems from these two sections appear to be very similar, but they are very different. It’s important to remember when to use either method of solving. Let’s take a look at a couple of examples… MATCHED-PAIR DATA • On one final exam study guide, there is a problem that talks about a football coach claiming that players can increase their strength by taking a certain supplement. He decides to test this theory by randomly selecting 9 athletes and gives them a strength test on the bench press. 30 days later, after regular training and taking this supplement, he tests them again. • First of all, we need to identify what kind of problem this is. I just identified it as a matched-pair data problem. How can you tell if it’s this and not chi-square goodnessof-fit? Look at the data table: Athlete 1 2 3 4 5 6 7 8 9 Before 215 240 188 212 275 260 225 200 185 After 225 245 188 210 282 275 230 195 190 • We want to test if the coach’s claim holds – that the supplement is effective in increasing the athletes’ strength. MATCHED-PAIR DATA • The “Before” values are the first set of data we gathered. These will go into List 1 (Stat, Edit, 1). The values in the second data set we gathered (“After”) will go into List 2. • Okay, now we need to see how the difference between these two data sets levels out in our test against our level of significance, α = 0.05. To do this, we first need to get the differences, and we do that by highlighting the List 3 heading and entering the equation L1 – L2 and pressing the ENTER key. This will give you a list of numbers – the difference between List 1 and List 2. • Exit the screen by pressing 2nd, mode. Then, go to Stat, Tests, 2 (Z-Test), and use the DATA option for your input. Change the List selection from L1 to L3, and proceed accordingly with your hypothesis testing. • In short: • L1 – “before” values **Note**: When writing out your null and alternate • L2 – “after” values hypotheses, remember the H0 will always be μ = 0, to indicate • L3 heading – L1 – L2 that there is no difference in values. To determine H1, look at • 2nd, mode (to get out) • Stat, Tests, 2 (Z-Test) • Data • Look at L3 the alternative claim. Remember: Problems from Matched-Pair Data and Chi-Square may look somewhat similar, but they are very different. The confusion lies in 1.) identifying the problem and 2.) remembering which equation to put in the L3 heading. * *If you have a TI-84 calculator, there is a trick to avoid this issue. The trick will be covered in another slide. CHI-SQUARE • Again, problems from this section and from matched-pair data may look somewhat similar, but they are very different. • Here is an example: The information provided in one problem from a former exam review gives us the number of wins for track hurdlers and asks us to test the claim that “the probabilities of winning are the same in different positions.” • Again, we have to identify whether this is actually chi-square and not matched-pair data. Look at the chart, which I’ve added to here as well: Starting Position Number of Wins 1 2 3 4 5 6 45 50 36 44 32 33 • Unlike the previous question, which gave us “before” and “after” values, this gives us a table of observed outcomes. This is where we can tell what the difference is: matched-pair data are generally given as “before” and “after,” whereas chi-square gives us an expectation and what we actually observed happening. • Now that we know this much, what is our expectation for this equation? Like any other hypothesis test, we have to identify a null hypothesis and an alternative hypothesis. In this case, our null is what we expected. For a problem like this, we would expect that, if a runner each lane had equal chances of winning, each lane would get a 1/6 chance. CHI-SQUARE (CONT) • How does this work on a calculator? Similar to matched-pair data problems, we start out by entering data (Stat, Edit, 1). In L1, we put our observed data – what actually happened. This would be the number of wins represented in the table above. In L2, we take the total number of wins (here, 240) and multiply it by the chance we expected (1/6). And in the L3 heading, we use the equation (L1 – L2)2/L2. • In short: • L1 – observed outcomes (what really happened); number of wins in graph • L2 – expected outcomes (what we thought was going to happen); total wins * chance of winning (250 * 1/6) • L3 heading – (L1 – L2)2/L2 • 2nd, mode (to get out) • 2nd, Stat, < , Math, 5 (sum) • Find the sum of L3 • This is your test statistic • Proceed accordingly with hypothesis testing CHI-SQUARE (CONT.) • Another way to find the chi-square test statistic is as follows: • L1 – observed outcomes (what really happened) • L2 – expected outcomes (what we thought was going to happen) • 2nd, Mode (to get out) • Stat, Tests, D (alpha, −1 key): 2 GOF-Test • You will see the following on your screen: • Enter your degrees of freedom (df) and hit Calculate • This will give you not only the test statistic, but it will give you the p-value as well. (PLEASE NOTE: this trick only works on TI-84 calculators; TI-83 calculators do not have this function!) YOU AIN’T NEVER HAD A FRIEND LIKE... The Ultimate Wrap-Up for What to Use and When to Use it! :D Given Hypothesis Testing Confidence Intervals σ (population standard deviation) s (sample standard deviation) or no standard deviation when given a data set 1 7 2 8 When given x and n values (a sample, n, and a number out of the sample, x) 5 A When given x and n values for two samples (x1 and n1, x2 and n2) 6 B THE END! This concludes the Statistics Blitz PowerPoint Presentation Created by Gina M. Salvati, CRLA certified tutor and former peer tutor for North Florida Community College and Florida Gateway College Version updated: 2014.9.15