### Chapter 1 Slides

```Unit 1 Overview
Significance – How strong is the
evidence of an effect? (Chapter 1)
 Estimation – How large is the effect?
(Chapter 2)
 Generalization – How broadly do the
conclusions apply? (Chapter 3)
 Causation – Can we say what caused the
observed difference? (Chapter 4)

Chapter 1
Significance:
How Strong is the Evidence?
Section 1.1:
Introduction to Chance Models

Organ Donation Study


78.6% in neutral group agreed
41.8% in the opt-in group agreed
The researchers found these results to be
statistically significant.
 This means that if the recruitment method
made no difference in the proportion that
would agree, results as different as we
found would be unlikely to arise by
random chance.

Dolphin Communication
 Can
dolphins communicate abstract ideas?
 In an experiment done in the 1960s, Doris was
instructed which of two buttons to push. She
then had to communicate this to Buzz (who
could not see Doris). If he picked the correct
button, both dolphins would get a reward.
 What are the observational units and
variables in this study?
Dolphin Communication
 In
one set of trials, Buzz chose the correct
button 15 out of 16 times.
 Based on these results, do you think Buzz
knew which button to push or is he just
guessing?
 How might we justify an answer?
 How might we model this situation?
Modeling Buzz and Doris
 Flip
Coins
 One Proportion Applet
Simulation vs. Real Study
coin flip
=
guess by Buzz
=
correct guess
tails
=
wrong guess
chance of
probability of correct button
heads = ½ = when Buzz is just guessing
one set of
16 coin flips
=
one set of 16 attempts by
Buzz
Three S Strategy



Statistic: Compute the statistic from the
observed data.
Simulate: Identify a model that represents a
chance explanation. Repeatedly simulate
values of the statistic that could have
happened when the chance model is true and
form a distribution.
Strength of evidence: Consider whether the
value of the observed statistic is unlikely to
occur when the chance model is true.
Buzz and Doris Redo
Instead of a canvas curtain, Dr. Bastian
constructed a wooden barrier between
Buzz and Doris.
 When tested, Buzz pushed the correct
button only 16 out of 28 times.
 Are these results statistically significant?
 Let’s go to the applet to check this out.

Exploration 1.1: Can Dogs Understand
Human Cues? (pg. 1-12)
Dogs were positioned 2.5 m from experimenter.
 On each side of the experimenter were two cups.
 The experimenter would perform some human cue
(pointing, bowing or looking) towards one of the
cups. (Non-human cues were also done.)
 We will look at Harley’s results.

Section 1.2:
Measuring Strength of Evidence
In the previous section we preformed
tests of significance.
 In this section we will make things slightly
more complicated, formalize the process,
and define new terminology.

We could take a look at
Rock-Paper-Scissors-Lizard-Spock










Scissors cut paper
Paper covers rock
Rock crushes lizard
Lizard poisons Spock
Spock smashes scissors
Scissors decapitate lizard
Lizard eats paper
Paper disproves Spock
Spock vaporizes rock
(and as it always has) Rock crushes scissors
RPS
Rock-Paper-Scissors
Rock smashes scissors
 Paper covers rock
 Scissors cut paper
 Are these choices used
in equal proportions
(1/3 each)?
 One study suggests
that scissors are
chosen less than 1/3
of the time.

Rock-Paper-Scissors
Suppose we are going to test this with 12
players each playing once against a
computer.
 What are the observational units?
 What is the variable?
 Even though there are three outcomes, we
are focusing on whether the player
chooses scissors or not. This is called a
binary variable since we are focusing on 2
outcomes (not both necessarily equally
likely).

Terminology: Hypotheses

When conducting a test of significance,
one of the first things we do is give the
null and alternative hypotheses.
The null hypothesis is the chance
explanation.
 Typically the alternative hypothesis is
what the researchers think is true.

Hypotheses from Buzz and Doris
Null Hypothesis: Buzz will randomly pick
a button. (He chooses the correct button
50% of the time, in the long run.)
 Alternative Hypothesis: Buzz
understands what Doris is communicating
to him. (He chooses the correct button
more than 50% of the time, in the long
run.)

These hypotheses represent the parameter
(long run behavior) not the statistic (the
observed results).
Hypotheses for R-P-S in words
Null Hypothesis: People playing RockPaper-Scissors will equally choose
between the three options. (In particular,
they will choose scissors one-third of the
time, in the long run.)
 Alternative Hypothesis: People playing
Rock-Paper-Scissors will choose scissors
less than one-third of the time, in the long
run.

Note the differences (and similarities)
between these hypotheses and those for
Buzz and Doris.
Hypotheses for R-P-S using symbols

H0: π = 1/3

Ha: π < 1/3
where π is players’ true probability of
throwing scissors
Setting up a Chance Model

Because the Buzz and Doris example had a
50% chance outcome, we could use a coin
to model the outcome from one trial. What
could we do in the case of Rock-PaperScissors?
Same Three S Strategy as Before



Statistic: Compute the statistic from the
observed data. [In a class of 12 students, 2
picked scissors. This sample proportion can be
described using the symbol  (p-hat)].
Simulate: Identify a model that represents a
chance explanation. Repeatedly simulate values
of the statistic that could have happened when
the chance model is true and form a distribution.
Strength of evidence: Consider whether the
value of the observed statistic is unlikely to occur
when the chance model is true.
Applet
We will use the One Proportion Applet for
our test.
 This is the same applet we used last time
except now we will change the proportion
under the null hypothesis.
 Let’s go to the applet and run the test.
(Notice the use of symbols in the applet.)

Null Distribution
P-value
The p-value is the proportion of the
simulated statistics in the null distribution
that are at least as extreme (in the
direction of the alternative hypothesis) as
the value of the statistic actually observed
in the research study.
 We should have seen something similar to
this in the applet
Proportion of samples: 938/5000 = 0.1876

What can we conclude?
Do we have strong evidence that less than
1/3 of the time scissors gets thrown?
 How small of a p-value would you say
gives strong evidence?


Remember the smaller the p-value,
the stronger the evidence against the
null.
Guidelines for evaluating strength of
evidence from p-values
p-value >0.10, not much evidence against
null hypothesis
 0.05 < p-value < 0.10, moderate evidence
against the null hypothesis
 0.01 < p-value < 0.05, strong evidence
against the null hypothesis
 p-value < 0.01, very strong evidence
against the null hypothesis

What can we conclude?





So we do not have strong evidence that
fewer than 1/3 of the time scissors is thrown.
Does this mean we can conclude that 1/3 of
the time scissors is thrown?
Is it plausible that 1/3 of the time scissors is
thrown?
Are other values plausible? Which ones?
What could we do to have a better chance of
getting strong evidence for our alternative
hypothesis?
Summary
The null hypothesis (H0) is the chance
explanation. (=)
 The alternative hypothesis (Ha) is you
are trying to show is true. (< or >)
 A null distribution is the distribution of
simulated statistics that represent the
chance outcome.
 The p-value is the proportion of the
simulated statistics in the null distribution
that are at least as extreme as the value of
the observed statistic.

Summary
The smaller the p-value, the stronger the
evidence against the null.
 P-values less than 0.05 provide strong
evidence against the null.
 π is the population parameter
  is the sample proportion

Exploration 1.2 (pg 1-25)

Can people tell the difference between
bottled and tap water?
Alternative Measure of
Strength of Evidence
Section 1.3
Criminal Justice System vs.
Significance Tests
Innocent until proven guilty. We assume
a defendant is innocent and the
prosecution has to collect evidence to try
to prove the defendant is guilty.
 Likewise, we assume our chance model
(or null hypothesis) is true and we collect
data and calculate a sample proportion.
We then show how unlikely our proportion
is if the chance model is true.

Criminal Justice System vs.
Significance Tests
If the prosecution shows lots of evidence
that goes against this assumption of
innocence (DNA, witnesses, motive,
contradictory story, etc.) then the jury
might conclude that the innocence
assumption is wrong.
 If after we collect data and find that the
likelihood (p-value) of such a proportion is
so small that it would rarely occur by
chance if the null hypothesis is true, then
we conclude our chance model is wrong.

Review
In the water tasting exploration, you could have
obtained a null distribution similar to the one
shown here. (H0: π = 0.25, Ha: π < 0.25 and  = 3/27 = 0.1111)
•
•
•
•
What
What
What
What
does a single dot represent?
does the whole distribution represent?
is the p-value for this simulation?
does this p-value mean?
More Review
The null hypothesis is the chance
explanation.
 Typically the alternative hypothesis is
what the researchers think is true.
 The p-value is the proportion of outcomes
in the null distribution that are at least as
extreme as the value of the statistic
actually observed in the study.
 Small p-values are evidence against the
null.

Strength of Evidence
P-values are one measure for the strength
of evidence and they are, by far, the most
frequently used.
 P-values essentially are measures of how
far the sample statistic is away from the
parameter under the null hypothesis.
 Another measure for this distance we will
look at today is called the standardized
statistic.

Heart Transplant Operations
Example 1.3
Heart Transplants




The British Medical Journal (2004) reported
that heart transplants at St. George’s Hospital
in London had been suspended after a spike
in the mortality rate
Of the last 10 heart transplants, 80% had
resulted in deaths within 30 days
This mortality rate was over five times the
national average.
The researchers used 15% as a reasonable
value for comparison.
Heart Transplants
Does a heart transplant patient at St.
George’s have a higher probability of
dying than the national rate of 0.15?
 Observational units



Variable


The last 10 heart transplantations
If the patient died or not
Parameter

The actual long-run probability of a death after
a heart transplant operation at St. George’s
Heart Transplants
Null hypothesis: Death rate at St.
George’s is the same as the national rate
(0.15).
 Alternative hypothesis: Death rate at
St. George’s is higher than the national
rate.


H0:  = 0.15
Ha:  > 0.15

Our statistic is 8 out of 10 or 0.80
Heart Transplants
Simulation
 Null distribution of 1000 repetitions of
drawing samples of 10 “patients” where
the probability of death is equal to 0.15.
What is the pvalue?
Heart Transplants
Strength of Evidence
 Our p-value is 0, so we have very strong
evidence against the null hypothesis.
 Even with this strong evidence, it would
be nice to have more data.
 Researchers examined the previous 361
heart transplantations at St. George’s and
found that 71 died within 30 days.
 Our new statistic is 71/361 ≈ 0.197
Heart Transplants

Here is a null distribution and p-value
based on the new statistic.
Heart Transplants
 We still have very strong evidence against
the null hypothesis, but not quite as
strong as the first case


Another way to measure strength of
evidence is to standardize the observed
statistic
The Standardized Statistic

The standardized statistic is the
number of standard deviations our sample
statistic is above the mean of the null
distribution.
statistic − mean of null distribution
z =
standard deviation of null distribution

For a single proportion, we will use the
symbol z for standardized statistic.
The standardized statistic

Here are the standardized statistics for our two
studies.
0.80 − 0.15
=
= 5.70
0.114



0.197 − 0.15
=
= 2.47
0.019
In the first, our observed statistic was 5.70
standard deviations above the mean.
In the second, our observed statistic was 2.47
standard deviations above the mean.
Both of these are very strong, but we have
stronger evidence against the null in the first.
Guidelines for strength of evidence

If a standardized statistic is below -2 or
above 2, we have strong evidence against
the null.
Standardized Statistic
Evidence Against Null
between -1.5 and 1.5
not much
below -1.5 or above 1.5
moderate
below -2 or above 2
strong
below -3 or above 3
very strong
Which is Bob and which is Tim?
Do People Use Facial
Prototyping?
Exploration 1.3
Impacting Strength of
Evidence
Section 1.4
Introduction
We’ve now looked at tests of significance
and have seen how p-values and
standardized statistics give information
about the strength of evidence against the
null hypothesis.
 Today we’ll explore factors that affect
strength of evidence.

Bob or Tim?
When the statistic is farther away from the
proportion in the null, there is stronger
evidence against the null.
= 0.82
= 0.65
Predicting Elections
from Faces
Example 1.4
Predicting Elections
candidates based on facial appearances?
 More specifically, can you predict an
election by choosing the candidate whose
face is more competent-looking?
 Participants were shown two candidates
and asked who has the more competentlooking face.

Who has the more competent looking face?

2004 Senate Candidates from Wisconsin
Winner
Loser
Bonus: One is named Tim and the other is
Russ. Which name is the one on the left?

2004 Senate Candidates from Wisconsin
Russ
Tim
Predicting Elections
They determined which face was the more
competent for the 32 Senate races in
2004.
 What are the observational units?



The 32 Senate races
What is the variable measured?

If the method predicted the winner correctly
Predicting Elections
Null hypothesis: The probability this
method predicts the winner equals 0.5.
(H0:  = 0.5)
 Alternative hypothesis: The probability
this method predicts the winner is greater
than 0.5. (Ha∶  > 0.5)


This method predicted 23 of 32 races,
hence  = 23/32 ≈ 0.719, or 71.9%.
Predicting Elections
1000 simulated sets of 32 races using the
One Proportion applet.
Predicting Elections
With a p-value of 0.009 we have strong
evidence against the null hypothesis.
 When we calculate the standardized
statistic we again show strong evidence
against the null.

0.7188 − 0.501
=
= 2.42.
0.09

What do the p-value and standardized
statistic mean?
What effects the strength of evidence?
1.
2.
3.
The difference between the observed
statistic () and null hypothesis
parameter ().
Sample size.
If we do a one or two-sided test.
Difference between  and

What if researchers predicted 26 elections

26/32 = 0.8125 never occurs just by chance
hence the p-value is 0.
Difference between  and

The farther away the observed statistic is
from the average value of the null
distribution (or ), the more evidence
there is against the null hypothesis.
Sample Size
Suppose the sample proportion stays the
same, do you think increasing sample size
will increase, decrease, or have no impact
on the strength of evidence against the null
hypothesis?
Sample Size
The null distribution changes as we
increase the sample size from 32 senate
races to 128 races to 256 races.
 As the sample sizes increases, the
variability (standard deviation) decreases.

Sample Size
What does decreasing variability mean for
statistical significance (with same sample
proportion)?
 32 elections



128 elections


p-value = 0.009 and z = 2.42
p-value = 0 and z =5.07
256 elections


Even stronger evidence
p-value = 0 and z = 9.52
Sample Size
As the sample size increases, the
variability decreases.
 Therefore, as the sample size increases,
the evidence against the null hypothesis
increases (as long as the sample
proportion stays the same and is in the
direction of the alternative hypothesis).

Two-Sided Tests

What if researchers were wrong; instead of the
person with the more competent face being
elected more frequently, it was actually the less
frequently?
H0:  = 0.5
Ha:  > 0.5




With this alternative, if we go a sample
proportion less than 0.5, we would get a very
large p-value.
This is a one-sided test.
Often one-sided is too narrow
In fact most research uses two-sided tests.
Two-Sided Tests

In a two-sided test the alternative can be
concluded when sample proportions are in
either tail of the null distribution.
Null hypothesis: The probability this
method predicts the winner equals 0.50.
(H0: π = 0.50)
Alternative hypothesis: The probability this
method predicts the winner is not 0.50.
(Ha: π ≠ 0.50)
Two-Sided Tests
The change to the alternative hypothesis
also effects how we compute the p-value.
 Remember that the p-value is the
probability (assuming the null hypothesis
is true) of obtaining a proportion that is
equal to or more extreme than the
observed statistic
 In a two-sided test, more extreme goes
in both directions.

Two-Sided Tests

Since our sample proportion was 0.7188 and
0.7188 is 0.2188 above 0.5, we also need to
look at 0.2188 below 0.5. (This gets a bit more
complicated when the distribution is not symmetric, but
the applet will do all the work for you.)

Hence the p-value will include all simulated
proportions 0.7188 and above as well as those
0.2812 and below.
Two-Sided Tests
0.7188 or greater was obtained 9 times
 0.2812 or less was obtained 8 times
 The p-value is (8 + 9 = 17)/1000 = 0.017.
 Two-sided tests increase the p-value (it
about doubles) and hence decrease the
strength of evidence.
 Two-sided tests are said to be more
conservative. More evidence is needed to
conclude alternative.
 Let’s check this out using our applet.

Predicting House Elections




Researchers also predicted the 279 races for
the House of Representatives in 2004
The correctly predicted the winner in 189/279
≈ 0.677, or 67.7% of the races.
The House’s sample percentage (67.7%) is
bit smaller than the Senate (71.9%), but that
the sample size is larger (279) than for the
senate races (32).
Do you expect the strength of evidence to be
stronger, weaker, or essentially the same for
the House compared to the Senate?
Predicting House Elections

Distance of the observed statistic to the
null hypothesis value



The statistic in the House is 0.677 compared to
0.719 in the Senate
Slight decrease in the strength of evidence
Sample size


The sample size is almost 10 times as large
(279 vs. 32)
This will increase the strength of evidence
Predicting House Elections
Null distribution of 279 sample House races
Simulated statistics ≥0.677 didn’t occur hence
the p-value is 0
Predicting House Elections




For the Senate it was 2.49
For the House is 5.90.
The larger sample size for the House
trumped its smaller proportion, so we
have stronger evidence against the null
using the data from the House.
Uniform Colors?
Exploration 1.4
In four contact sports (boxing, tae kwon
do, Greco–Roman wrestling and freestyle
wrestling) in the 2004 Olympics,
participants were randomly assigned a red
or blue uniform.
 Researches then analyzed the results.

Section 1.5
Normal
Approximation
(Theory-Based Test)
Simulation-Based vs. Theory-Based
We will now look at the more traditional
method of determining a p-value through
theory-based techniques.
 When we used simulation-based methods,
we all got slightly different p-values. The
more repetitions we would do, the closer
our p-values will be to each other.
 In theory-based methods, we will use a
theoretical distribution to model our null
distribution and we will all get the same pvalue.

Theory-Based Techniques
Hopefully, you’ve noticed the shape of
most of our simulated null distributions
were quite predictable.
 We can predict this shape using normal
distributions.
 When we do a test of significance using
theory-based methods, only how our pvalues are found will change. Everything
else will stay the same.

The Null Distributions

Our null distributions:



Were typically bell shaped
Centered at the proportion under the null
Their width was dependent mostly on the
sample size.
The Normal Distribution

Both of these are centered at 0.5.



The one on the left represents samples of size 30.
The one on the right represents samples of size 300.
Both could be predicted using normal distributions.
Examples from this chapter

Which ones will normal distributions fit?
When can I use a theory-based test that
uses the normal distribution?





The shape of the randomized null distribution is
affected by the sample size and the proportion
under which you are testing.
The larger the sample size the better.
The closer the null proportion is to 0.5 the better.
A simple rule of thumb to follow is:
 You should have at least 10 successes and 10
failures in your sample to be fairly confident that
a normal distribution will fit the simulated null
distribution nicely.
We will call guidelines like that above as validity
conditions.
Theory-Based Tests







No need to set up some randomization method
Fast and Easy
Can be done with a wide variety of software
We all get the same p-value.
Determining confidence intervals (we will do
this next time) is much easier.

They will all come with some validity
conditions (like the number of success and
failures we have for a single proportion test).
Example 1.5: Halloween Treats
Researchers investigated whether children
show a preference to toys or candy
 Test households in five Connecticut
neighborhoods offered children two plates:




One with candy
One with small, inexpensive toys
The researchers observed the selections of
283 trick-or-treaters between ages 3 and
14.
Halloween Treats
Null: The proportion of trick-or-treaters
who choose candy is 0.5.
 Alternative: The proportion of trick-ortreaters who choose candy is not 0.5.

H0: π= 0.5
 Ha: π ≠ 0.5

Notice we are focusing on candy, but could
have easily done this focusing on the toy.
Halloween Treats

283 children were observed


148 (52.3%) chose candy
135 (47.7%) chose toys
Let’s first run this test using oneproportion applet we have been using.
 When doing this notice what the shape,
center and standard deviation of the null
distribution.

Predicting Standard Deviation
Could you have predicted the center and
shape of the null distribution?
 What about the standard deviation?
 This is a bit harder, but can easily done
with the formula  1 −  / where π is the
proportion under the null and n is the
sample size.


0.5(1−0.5)
283
= 0.0297.
Theory-Based Inference
These predictions work if we have a large
enough sample size.
 We have 148 successes and 135 failures.
Is the sample size large enough to use the
theory-based method?

Use the One Proportion applet to find the
theory-based (normal approximation)
p-value.
Halloween Treats
If half of the population of trick-ortreaters preferred candy, then there’s a
43.9% chance that a random sample of
283 trick-or-treaters would have 148 or
more, or 135 or fewer, choose the candy.
 Since its not a small p-value, we don’t
have strong (or even moderate) evidence
that trick-or-treaters prefer one type of
treat over the other.

Standardized Statistic
Notice that the standardized statistic in the
applet is 0.77 (or sample proportion is 0.77
SD above the mean).
 Remember that a standardized statistic of
more than 2 indicates that the sample
result is far enough from the hypothesized
value to be unlikely if the null were true.
 We had a standardized statistic that was
not more than 2 (or even 1) so we don’t
really have any evidence against the null.

What happens when validity
condition is not met?
Suppose we’re testing 12 repetitions of
the Rock-Paper-Scissors test and 1 of the