Statistics Blitz - North Florida Community College

Report
By Gina M. Salvati
STATISTICS
BLITZ!!!
*~DISCLAIMER~*
Some examples are lifted from Fundamentals of Statistics Third Edition.
All rights reserved, yada, yada, yada. I don’t own those examples, and they are
noted throughout the presentation.
The only people benefitting from this presentation (hopefully) are peer tutors and the
students.
ENJOY!
First, and foremost…
DON’T PANIC
• Unlike Algebra, Statistics is not an exact math – there will always be some
kind of variation among answers. So if your answer doesn’t exactly match
what is in the text or (especially) on the answer key on an exam, don’t
panic. As long as you are in the same ballpark as the given answer(s), you
should be just fine. 
• Even with that being said, however, always, always, always check your
work. It is always important to double-check yourself, no matter which
course you take.
• A good test-taking tip when it comes to multiple choice is to go with the
answer closest to what you come up with.
THE CHAMBER OF DOOM
(I Mean, Table of Contents)
• Types of Curves
• Measures of Central Tendency vs.
Measures of Dispersion
• The Empirical Rule
• What to Use and When to Use it {Pt.1}:
normalcdf vs. invNorm
• How to Calculate on TI-83/84
• The Sampling Distribution of  vs. The
Sampling Distribution of 
• Z-scores
• The Five-Number Summary, Boxplots, and • What to Use and When to Use it {Pt. 2}:
Confidence Intervals
Outliers
• To Construct a Boxplot
• What to Use and When to Use it (Pt. 3}:
The Insanity that is Hypothesis Testing
• The Interquartile Range
• To Determine Outliers
• Example
• Feat. Type I and Type II Errors
• What to Use and When to Use it 2{The
• Linear Regression and What All Goes With Finale}: Matched-Pair Data vs.  − GOF
Testing
It
• r and r2
• A Quick Look at Probability
THE TYPES OF CURVES
Left-skew
Symmetrical
AKA
Bell Curve
Right-skew
Mean = Median = Mode
Mean < Median < Mode
Mean > Median > Mode
In the image above, the relationships of mean, median, and mode are illustrated.
MEASURES OF CENTRAL TENDENCY VS.
MEASURES OF DISPERSION
The Measure of Central Tendency basically means the average or typical
data value. For this class, it mostly refers to where our distribution is centered
at. There are 3 Measures of Central Tendency:
• Mean
• The measure we mostly refer to when we say “the average.”
• Two types: population () and sample ()
• Median
• The value that lies in the center of our data when it is all put in ascending order.
• The most resistant measure of central tendency – i.e., it’s not affected by
extreme values (meaning it doesn’t get yanked around by outliers and crazy
numbers as much as, say, the mean)
• Mode
• The number(s) in our data that occur(s) most frequently.
MEASURES OF CENTRAL TENDENCY VS.
MEASURES OF DISPERSION
The Measure of Dispersion refers to what we use to describe the spread of our
distribution. Like the Measures of Central Tendency, there are three of these:
• Range
• Highest number – lowest number
• The Interquartile Range (IQR) falls under this
• Variance
• (standard deviation)2
• Two types: population ( 2 ) and sample ( 2 )
• Standard Deviation
• The most common form of dispersion and the one that we use in this course
• Two types: population () and sample ()
HOW TO CALCULATE ON TI-83/84
Flight Time The following data represent the mean flight time (in minutes) of a random
sample of seven flights from Las Vegas, Nevada, to Newark, New Jersey, on Continental
Airlines:
• 282, 270, 260, 266, 257, 260, 267
• Calculate the mean, median, and mode flight time.
• Enter data in L1 (Stat, 1:Edit…)
• Stat, Calc, 1: 1-Var Stats
• ENTER
• The list that appears gives you all the information you will be looking for.  provides the
mean,  provides the sample standard deviation (s.d.), and  provides the population s.d.
• Scroll down until you see Med. This number is your median.
• The only thing it doesn’t calculate is the mode (the most reoccurring number), but that’s
pretty easy to find on its own. ;)
From Fundamentals of Statistics Third Edition text, pg. 125.
Z-SCORES
The z-score basically represents the distance that a data value is from the
mean in terms of the number of standard deviations; i.e., it’s a converted
number. Sounds a lot more complicated than it looks. All you need to know is
one simple equation:
 − 
  =
 
Let’s see an example…
Z-SCORES
Example:
Men vs. Women The average 20- to 29-year-old man is 69.6 inches tall, with a standard deviation of 3.0
inches, while the average 20- to 29-year-old woman is 64.1 inches tall, with a standard deviation of 3.8
inches. Who is relatively taller, a 67-inch man or a 62-inch woman?
• Break the information into two (2) equations and then compare:
• Men:
• =
67 −69.6
3.0
=
− 2.6
3.0
62 −64.1
3.8
=
−2.1
3.8
• Women:
• =
≈ −0.867
≈ −0.553
• The relatively taller person is…
• The 67-inch man.
From Fundamentals of Statistics Third Edition text, pg. 161.
THE FIVE-NUMBER SUMMARY,
BOXPLOTS, AND OUTLIERS
• The Five-Number Summary basically divvies up your data into quartiles. It also can be
used to construct a boxplot. The Five-Number Summary consists of the following:
•
•
•
•
•
Smallest number in the data set (Minimum)
First quartile (1 )
Median (2 )
Third quartile (3 )
Largest number in the data set (Maximum)
• To get it on your calculator, enter your data in L1 under Stat, 1: Edit…, and then
proceed as if you were looking for the mean, median, and mode (Stat, Calc, 1: 1-Var
Stats).
• Scroll down to the bottom of the list, and you will find the Five-Number Summary. ;)
TO CONSTRUCT A BOXPLOT
• After entering your data in L1, hit 2nd,
Y=. This will bring you to the Stat Plots
screen.
• Hit ENTER or 1. This will bring you to
another screen, where you can
actually set up the boxplot.
• Turn the plots ON.
• Select the Boxplot with Outliers type.
This is the first image in the second
row in the Types category.
• Zoom, 9: ZoomStat
• Ta-da! :D
THE INTERQUARTILE RANGE
• The Interquartile Range (IQR) is the range the middle 50% of observations in
the data set. Again, another simple little equation:
 = 3 − 1
TO DETERMINE OUTLIERS
Outliers are basically extreme values – the numbers that are waaay out there
on either side of the data set when all the data are put in ascending order. To
determine if there are any outliers in a data set, we use the following two
equations:
•
•
•
•
Lower Fence – any numbers below this number are considered outliers
 = 1 − 1.5()
Upper Fence – any numbers above this number are considered outliers
 = 3 + 1.5()
THE FIVE-NUMBER SUMMARY,
BOXPLOTS, AND OUTLIERS
April Showers The following data represent the number of inches of rain in
Chicago, Illinois, during the month of April for 20 randomly selected years.
0.97
2.47
3.94
4.11
5.79
1.14
2.78
3.97
4.77
6.14
1.85
3.41
4.00
5.22
6.28
2.34
2.34
4.02
5.50
7.69
[a] Determine the five-number summary
[b] Compute the Interquartile Range (IQR)
[c] Determine the Upper and Lower Fences. Are there any outliers?
Fundamentals of Statistics Third Edition, pg. 162
• [a] Determine the quartiles
THE FIVE-NUMBER SUMMARY,
BOXPLOTS, AND OUTLIERS – ANSWERS
• Record all data in L1
• Stat, Calc: 1:1-Var Stats
•
•
•
•
•
Min: .97
1 : 2.405
2 /med: 3.985
3 : 5.36
Max: 7.69
• [b] Compute the IQR
• Q3 – Q1 = IQR
• 5.36 – 2.405 = 2.955
• [c] Determine the Upper and Lower Fences. Are there any outliers?
• LF = Q1 – 1.5(IQR)
• LF = 2.405 – 1.5(2.955)
• LF = -2.0275
• UF = Q3 + 1.5(IQR)
• UF = 5.36 + 1.5(2.955)
• UF = 9.7925
• Because there are no numbers in the data set that are less than the lower fence or greater than the upper
fence, there are no outliers.
LINEAR REGRESSION AND WHAT
ALL
GOES
WITH
IT
• Linear Regression is basically a method of
modeling the relationship between the
independent/explanatory variable (x) and
the dependent variable (y). The simplest way
to look at it is to think in terms of equation of
lines. The least-squares regression line
equation is pretty much slope-intercept form
– no joke.
• To solve on TI-83/84:
Look at this equation setup
here. It should look familiar.
(Hint: substitute m for a) 
• Enter data in Stat, 1: Edit…
• X values in L1, Y values in L2
• Stat, Calc, 4: LinReg
• As shown in the image on the right (), this
screen will appear. If you do not see r and r2,
you will need to set it up:
• 2nd, 0 (Catalog), scroll down
to
−1
DiagnosticOn (or hit the  key to save
time), ENTER
• Hit ENTER again; a message saying DONE will
appear. ;)
What is r and r2? Well, let’s take a look…
R AND R2
The linear correlation coefficient (r)
measures how closely related the
data is to the linear regression line
The coefficient of determination (r2) holds explaining
power: “the linear regression explains __ much of
the data.”
Note: It is important to remember that correlation doesn’t necessarily imply causation – just because there is a correlation between two variables, it doesn’t always mean that one is causing the
other.
A QUICK LOOK AT PROBABILITY
• Probability measures the likelihood of something happening (and, in some
cases, not happening). Like everything else in Statistics, there are a few rules:
• The probability of any event (E) must be greater than or equal to 0 and less than
or equal to 1, AKA: 0 ≤   ≤ 1
• The sum of all probabilities must equal to one
• If an event is impossible, its probability is 0. Conversely, if an event is certain,
its probability is 1.
• An unusual event has a less than 5% (.05) chance of occurring. This is an
important rule to remember, as it holds true in several other sections in this
course.
THE EMPIRICAL RULE
• The Empirical Rule is used to give us
an approximation of the number(s)
that lies within 1, 2, or 3 standard
deviations away from the mean.
• 68% of the distribution lies within 1
deviation from the mean
• 95% of the distribution lies within 2
deviations of the mean
• 99.7% of the distribution lies within 3
deviations of the mean
THE EMPIRICAL RULE:
A SLIGHTLY BETTER EXAMPLE
95 – 5 = 90
105 + 5 = 110
68%
95%
99.7%
 =100
90 – 5 = 85
100 – 5 = 95
=5
110 + 5 = 115
100 + 5 = 105
WHAT TO USE AND WHEN TO USE IT
{PT. 1}: NORMALCDF VS. INVNORM
• Two simple calculator functions that can very easily get mixed up. These are
the key things to remember about normalcdf and invNorm:
• normalcdf:
• When we’re given z-scores or data points* and asked to find the
probability/proportion/area under the curve.
• invNorm:
• When we’re given a percent/probability/area and asked to find the z-score/data
point.
• A trick is to look for key words. One example is the word percentile. If a question asks you to
find a percentile, you will use the invNorm function on your calculator. 
* Z-scores are what we use when dealing with the standard normal curve ( = 0,  = 1); when working with a normal distribution (such is in the example in slides 25-26), we refer to the numbers as
data points.
NORMALCDF
ON THE CALCULATOR
• When using the normalcdf function on your calculator, the order of input is the lower
bound, followed by the upper bound, then the mean, and finally the standard
deviation. Most calculators will have you enter the information manually; others are
kind enough to help you out a little:
-E99 and E99 are very useful in
circumstances when you are
not given an upper or lower
bound. In some instance, -100
and 100 can be used, but
sometimes a much smaller or
larger may be called for. To
put this in the calculator, press
2nd and then the comma key
to get E. Simply type the
negative symbol before if
doing –E99. 
NORMALCDF EXAMPLE
• If I were to ask what the area
under the curve was above 1.50,
my graph would look something
like this:
• If were to ask what the area under
the curve was below 1.50, my
graph would look like this:
• And if I wanted to know the area
between, say, -1.75 and 1.5, my
graph would look like this:
NOTE:
In this
example, we
are working
with the
standard
normal curve.
This means
that μ = 0 and
σ = 1.
INVNORM
ON THE CALCULATOR
• When using the invNorm function on your calculator, the order of input is the area to
the left of the z-score we’re seeking, followed by the mean, and finally the standard
deviation. Most calculators will have you enter the information manually; others are
kind enough to help you out a little:
INVNORM
EXAMPLE 1.0
• invNorm problems can get a little confusing at times. The key, as with virtually
everything in Statistics, is in the wording.
• Let’s say I am in charge of a chocolate chip cookie factory. The factory churns out an
average of 3500 cookies a day, with a standard deviation of 45. Inventory is just
around the corner, and we’re expected to make above the 60th percentile. Over how
many cookies are we supposed to make to meet expectations?
• The wording here isn’t exactly the greatest, but we are given the information necessary
to perform the task. We know that we a have an average (mean, μ) of 3500 cookies
and a standard deviation (σ) of 45. We also know that we need to make above the
60th percentile. This means we need to make over the first 60% in order to meet the
quota. How do we solve for this?
INVNORM
EXAMPLE 1.0 (CONT.)
• The easiest way to solve is to start out by drawing a picture:
• The 60th percentile is illustrated as the first 60% of our distribution. Using some more of our newly
discovered mad calculator skills, we can find the number of cookies (our data point) by
following these steps:
• 2nd, vars, 3: invNorm
• Enter, in order, the area to the left side of the data point we are seeking (here, .60), the mean,
and the standard deviation
• ENTER
• The number you get will be the data point we’re looking for. The answer here comes out to be
3511.40062. Because we are talking about cookies and not random numbers, it is best to round.
A good practice is to round up, to take into account for potential extra.
ANSWER: In order to meet expectations, the factory needs to make over 3511/3512 cookies.
THE SAMPLING DISTRIBUTION OF  VS.
THE SAMPLING DISTRIBUTION OF 
• A simple process gone wrong, or at least this is what it sometimes feels like
with sampling distributions. Actually, sampling distributions are really nifty little
paths that help us when dealing with samples and proportions.
• There are two that we deal with: The Sampling Distribution of  and the
Sampling Distribution of . The difference? Let’s find out…
THE SAMPLING DISTRIBUTION OF
THE SAMPLE MEAN
• The sampling distribution of the sample mean (AKA: x-bar, ) is used when
we are given information about a sample and asked to find the probability
of selecting individuals possessing a certain characteristics we are looking
for.
• The sampling distribution goes a little something like this:
The distribution is approximately normal
 = 
 =


This cute little equation is called the standard
error of the mean, or just standard error. Good
little vocab hint to remember. ;)
Let’s see an example…
EXAMPLE PROBLEM #1:
SAMPLING DISTRIBUTION OF 
A simple random sample size n = 49 is obtained from a population with µ = 80
and σ = 14.
•
•
•
•
Describe the sampling distribution of .
What is   > 83 ?
What is   ≤ 75.8 ?
What is (59.8 <  < 65.9)?
From Fundamentals of Statistics, Third Edition, pg. 389
EXAMPLE PROBLEM #1
SOLUTION
• The sampling distribution of x-bar is approximately normal. This is pretty much always
the case, at least for this course. If the sample is not normal, we cannot proceed with
the sampling distribution. To determine if a sample is normal, the general rule of thumb
is to look at your sample size (n). If the sample size is 30 or greater (n ≥ 30), you can go
ahead; if the sample size is smaller than 30, the information must specify that the
population is normal.
• The rest of the sampling distribution is as follows:
 =  = 80

14
14
 =
=
=
=2
7

49
EXAMPLE PROBLEM #1
SOLUTION (CONT.)
• The sampling distribution is approximately normal, with  = 80 and  = 2.
• To solve, it’s best to start out by drawing a “pretty picture” before plugging numbers in
the calculator:
• For each part, select 2nd, vars, 2: normalcdf. This operation will give you the area under
the curve. 
• First curve: 2nd, vars, 2; (83, E99, 80,2) ≈ .0668
• Second curve: 2nd, vars, 2; (-E99, 75.8, 80, 2) ≈ .0179
• Third curve: 2nd, vars, 2: (
Please note: images aren’t exactly to scale. ^^;;
THE SAMPLING DISTRIBUTION OF
THE POPULATION PROPORTION
• The sampling distribution of the population proportion (AKA p-hat or ) is used when we are
given information about a sample and a proportion/percent and asked to find the probability
of selecting individuals possessing a certain characteristics we are looking for.
• This sampling distribution looks like this:
The distribution is approximately normal
 = 
 =
(1 − )

=


This equation is for a point estimate. We use this to
adjust our numbers in order to fit the distribution. If
you’re dealing with decimals, a whole number isn’t
going to fall in your distribution. ;)
EXAMPLE PROBLEM: SAMPLING
DISTRIBUTION OF 
Smith owns a shipyard. He knows that 5% of all welding done that afternoon
will wind up being defective. Out of the 7000 welds in the yard, he examines
300. What’s the probability that between 10 and 20 welding jobs will be
defective?
SMITH AND HIS SHIPYARD
• This kind of a question is a classic nuisance. First of all, how do we determine what
numbers represent our sample (n) and the number(s) out of it (x)?
• Look at the information. The problem states that out of 7000 welds in the yard, Smith
examines 300 welds. 300 is our sample size; the 7000 total welds is population in the
shipyard. Out of the 300 welds he’s examined, he wants to know the probability
between 10 and 20 welding jobs that could be defective. 10 and 20 represent our x
values.
• Knowing this, we can now move forward with our information gathering:
•
•
•
•
p = 5% = .05
n = 300
1 = 10
2 = 20
SMITH AND HIS SHIPYARD
(CONT.)
• We can now go ahead with our sampling distribution:
•
•
•
•
 = p = .05
 = (enter eq on 2010) = .0126
1 =10/300 = .033
2 = 20/300 = .067
• Finally, using our mad calculator skills, we can determine the area between our two
boundaries (aka – z-scores):
• 2nd, vars, 2
• (LB, UB,  ,  )
• (.033, .067, .05, .0126)
TA DA!!! The answer is approximately .8227
WHAT TO USE AND WHEN TO USE IT
{PT. 2}: CONFIDENCE INTERVALS
A Quick Run-Down
• Stat, Tests,…
• 7: Z-Int – when given σ (population standard deviation)
• 8: T-Int – when given s (sample standard deviation) or no standard deviation
when given a data set
• A: 1-PropZInt – when given x and n values (a sample, n, and a number out the
sample, x) and a percent or proportion
• B: 2-PropZInt – when given two x and n values (two samples and numbers out of
those samples: x1 and n1, and x2 and n2)
WHAT TO USE AND WHEN TO USE IT
{PT. 3}: THE INSANITY THAT IS
HYPOTHESIS TESTING
A Quick Run-Down
• Stat, Tests,…
• 1: Z-Test – when given σ (population standard deviation)
• 2: T-Test – when given s (sample standard deviation) or no standard deviation
when given a data set
• 5: 1-PropZTest – when given x and n values (a sample, n, and a number out of
the sample, x) and a percent or proportion
• 6: 2-PropZTest – when given two x and n values (two samples and numbers out of
those samples: x1 and n1, and x2 and n2)
FEAT. TYPE I AND TYPE II ERRORS
The best way to describe these two is to think about hypothesis testing as a
court case. The old motto of “innocent until proven guilty” provides us with our
null and alternative hypotheses.
The following is a mini-cartoon illustrating Type I and Type II errors.
WHAT TO USE AND WHEN TO USE IT
{THE FINALE}: MATCHED-PAIR DATA VS.
CHI-SQUARE GOF TESTING
Now, this is where things are confusing. Because these two types of testing appear so
alike when entering data into the calculator, it’s easy to get confused. How can we tell
the difference between these two kinds of problems???
WHAT TO LOOK FOR, HOW TO SOLVE
The problems from these two sections appear to be very similar, but they are very
different. It’s important to remember when to use either method of solving.
Let’s take a look at a couple of examples…
MATCHED-PAIR DATA
• On one final exam study guide, there is a problem that talks about a football coach
claiming that players can increase their strength by taking a certain supplement. He
decides to test this theory by randomly selecting 9 athletes and gives them a strength
test on the bench press. 30 days later, after regular training and taking this supplement,
he tests them again.
• First of all, we need to identify what kind of problem this is. I just identified it as a
matched-pair data problem. How can you tell if it’s this and not chi-square goodnessof-fit? Look at the data table:
Athlete
1
2
3
4
5
6
7
8
9
Before
215
240
188
212
275
260
225
200
185
After
225
245
188
210
282
275
230
195
190
• We want to test if the coach’s claim holds – that the supplement is effective in
increasing the athletes’ strength.
MATCHED-PAIR DATA
• The “Before” values are the first set of data we gathered. These will go into List 1 (Stat,
Edit, 1). The values in the second data set we gathered (“After”) will go into List 2.
• Okay, now we need to see how the difference between these two data sets levels out
in our test against our level of significance, α = 0.05. To do this, we first need to get the
differences, and we do that by highlighting the List 3 heading and entering the
equation L1 – L2 and pressing the ENTER key. This will give you a list of numbers – the
difference between List 1 and List 2.
• Exit the screen by pressing 2nd, mode. Then, go to Stat, Tests, 2 (Z-Test), and use the
DATA option for your input. Change the List selection from L1 to L3, and proceed
accordingly with your hypothesis testing.
• In short:
• L1 – “before” values
**Note**: When writing out your null and alternate
• L2 – “after” values
hypotheses, remember the H0 will always be μ = 0, to indicate
• L3 heading – L1 – L2
that there is no difference in values. To determine H1, look at
• 2nd, mode (to get out)
• Stat, Tests, 2 (Z-Test)
• Data
• Look at L3
the alternative claim.
Remember: Problems from Matched-Pair Data and Chi-Square may look
somewhat similar, but they are very different.
The confusion lies in
1.) identifying the problem
and
2.) remembering which equation to put in the L3 heading. *
*If you have a TI-84 calculator, there is a trick to avoid this issue. The trick will be covered in another slide.
CHI-SQUARE
• Again, problems from this section and from matched-pair data may look somewhat similar, but
they are very different.
• Here is an example: The information provided in one problem from a former exam review gives
us the number of wins for track hurdlers and asks us to test the claim that “the probabilities of
winning are the same in different positions.”
• Again, we have to identify whether this is actually chi-square and not matched-pair data. Look
at the chart, which I’ve added to here as well:
Starting
Position
Number of
Wins
1
2
3
4
5
6
45
50
36
44
32
33
• Unlike the previous question, which gave us “before” and “after” values, this gives us a table of
observed outcomes. This is where we can tell what the difference is: matched-pair data are
generally given as “before” and “after,” whereas chi-square gives us an expectation and what
we actually observed happening.
• Now that we know this much, what is our expectation for this equation? Like any other
hypothesis test, we have to identify a null hypothesis and an alternative hypothesis. In this case,
our null is what we expected. For a problem like this, we would expect that, if a runner each
lane had equal chances of winning, each lane would get a 1/6 chance.
CHI-SQUARE (CONT)
• How does this work on a calculator? Similar to matched-pair data problems, we
start out by entering data (Stat, Edit, 1). In L1, we put our observed data – what
actually happened. This would be the number of wins represented in the table
above. In L2, we take the total number of wins (here, 240) and multiply it by the
chance we expected (1/6). And in the L3 heading, we use the equation (L1 –
L2)2/L2.
• In short:
• L1 – observed outcomes (what really happened); number of wins in graph
• L2 – expected outcomes (what we thought was going to happen); total wins *
chance of winning (250 * 1/6)
• L3 heading – (L1 – L2)2/L2
• 2nd, mode (to get out)
• 2nd, Stat, < , Math, 5 (sum)
• Find the sum of L3
• This is your test statistic
• Proceed accordingly with hypothesis testing
CHI-SQUARE (CONT.)
• Another way to find the chi-square test statistic is as follows:
• L1 – observed outcomes (what really happened)
• L2 – expected outcomes (what we thought was going to happen)
• 2nd, Mode (to get out)
• Stat, Tests, D (alpha, −1 key):  2 GOF-Test
• You will see the following on your screen:
• Enter your degrees of freedom (df) and hit Calculate
• This will give you not only the test statistic, but it will give you the p-value as well.

(PLEASE NOTE: this trick only works on TI-84 calculators; TI-83 calculators do not have this function!)
YOU AIN’T NEVER HAD A FRIEND
LIKE...
The Ultimate Wrap-Up for What to Use and When to Use it! :D
Given
Hypothesis Testing
Confidence Intervals
σ (population standard
deviation)
s (sample standard deviation)
or no standard deviation when
given a data set
1
7
2
8
When given x and n values (a
sample, n, and a number out
of the sample, x)
5
A
When given x and n values for
two samples (x1 and n1, x2 and
n2)
6
B
THE END!
This concludes the Statistics Blitz PowerPoint Presentation 
Created by Gina M. Salvati, CRLA certified tutor and
former peer tutor for North Florida Community College
and Florida Gateway College
Version updated: 2014.9.15

similar documents