CI for a Proportion Slides

```Confidence Interval (CI)
for a Proportion
http://www.rossmanchance.com/iscam2/applets/BinomDist/BinomDist.html
5.0_06&browser=MSIE&vendor=Sun_Microsystems_Inc
Critical Values (95% confidence)
Critical Values (95% confidence)
Critical Values (95% confidence)
Look up 0.975 (area to
the left) to get the
positive z score; 0.025
for the negative.
0.025
0.95
0.025
2 rate
error
rate
:Ceach
low
z 2 Cz *(instructo
r'scritical
notation
for
(level)
usually
confidence
2error

: low
y invalue
 1critical
tail
high
 value)
zprobabilit

Mean / S D of a Sample Proportion
The sample proportion (a statistic) is the count X
divided by the sample size n.
The sample proportion
pˆ  X n
Has mean  pˆ  p
Has standard deviation  pˆ 
p1  p 
n
If both the expected count of Successes and
Failures are at least 10 (np and n(1 – p) both 
10), has approximate Normal distribution.
C% Confidence Interval
With approximately C% = (1 – )100% probability
pˆ is within Z 2
p1  p 
of p
n
Given results from an appropriately obtained sample…
With approximately C% confidence
p is within Z 2
pˆ 1  pˆ 
of pˆ
n
Confidence Interval for a
(Population) Proportion p
X
Point Estimate pˆ 
n
This is the sample proportion.
pˆ 1  pˆ 
ErrorMargin E  Z 2
n
The Z value comes from the confidence C%.
C% Confidence Interval for p
Conditions for use of this method…
• Random sample from a categorical population – 2
categories (S / F)
• If sample w/o replacement: Population at least 20 times
the sample size
• At least 10 Successes and Failures in the sample (this
ensures that the Normal is appropriate)
When we collect data from 1 random sample and compute the
sample proportion, the interval of values
pˆ  E  p  pˆ  E
is a C% confidence interval (CI) for p.*
* which, in this type of application, is unknown.
Example
What proportion of students smoke?
p = ??
p
N = ??
= probability a student smokes
= proportion of all students who smoke
A simple random sample of 368 students is surveyed.
n = 368
X = # of sampled students who smoke
varies depending on the sample
p = ??
N = ??
n = 368
X = # of sampled students who
smoke. Varies.
The survey is conducted: 79 of the 368 are smokers.
X = 79 is observed for one sample. (Other samples
would yield (somewhat) different values.)
n – X = 368 – 79 = 289
79/368 = 0.215 = pˆ
Not p (nearly impossible it’s exactly p).
pˆ = 0.215 is the statistic estimating the parameter p.
the sample (observed) proportion
the point estimate of p
A simple random sample of 368 students finds that 79 smoke.
Obtain a 95% confidence interval for the proportion of all
students who smoke.
The book would say…true proportion of students who smoke.
Each person in the population is a Success (smoker) or Failure
(nonsmoker); the sample is random
The population is huge (much bigger than 20(368))
There are 79 smokers; 289 nonsmokers. Both are well above 5.
The confidence interval based on the Normal distribution can be
used.
ˆp  79 368  0.215
E  Z 2
pˆ 1  pˆ 
n
0.2150.785
E  1.96
 1.960.0214  0.042
368
0.215 ± 0.042 (within 0.042 of 0.215)
0.215 – 0.042 = 0.173
0.215 + 0.042 = 0.257
0.173 < p < 0.257
Between 0.173 and 0.257.
Interpretation
We are approximately 95% confident that the
proportion of all* students who smoke is
between 0.173 and 0.257.
*It’s important to say “all” or
“population proportion”
Confidence Interval Example
Proper formatting of CIs:
0.215  0.042
(0.173, 0.257)
0.173 to 0.257
0.173  p  0.257
For the last three: The low value is written first.
All CIs should be accompanied by a statement
interpreting them, including the confidence level (here
95%) and an indication that you are making a statement
about an unknown parameter p = population proportion.
Confidence Interval for p
To use this formula we need
•
Random sample from a categorical population – 2
categories (S / F)
•
If sampling w/o replacement: Population at least 20
times the sample size
•
At least 5 Successes and Failures in the sample (this
ensures that the Normal is appropriate; 10 is better)
If one of these is violated: The confidence is not really the value
of C used in the formula.
If the sample is not random the confidence associated with this
method could be anything – but is likely to be much lower than
C.
Example 2
A marketer works for an electronics store. He wishes to
estimate the percent of coupons that will be redeemed at
the stores. 927 customers are randomly sampled and
sent coupons; 27 of them redeem their coupon.
Obtain a 90% CI for the proportion of all customers that
redeem this coupon.
Check conditions:
Example 2
Of the 927 coupons, 27 are redeemed…
n  927
x  27
27
ˆp 
 0.0291
927
C  0.90  z*  1.645
TABLE
1.64 or 1.65
either is fine
Don’t overround. Keep at least
3 significant figures in
intermediate results.
Example 2
Error margin…
E
 z*
ˆp1  ˆp 
n
Example 2
Error margin…
ˆp1  ˆp 
E  z*
n
0.0291 0.9709
 1.645
927
0.9709 =
(1 – 0.0291)
proportion not
redeemed
Example 2
Error margin…
ˆp1  ˆp 
E  z*
n
0.0291 0.9709
 1.645
927
 1.6450.00552
Be careful. These can be small. Keep at least 3
significant figures.
Example 2
Error margin…
ˆp1  ˆp 
E  z*
n
0.0291 0.9709
 1.645
927
 1.6450.00552
 0.0091
Example 2
Of the 927 coupons, 27 are redeemed…
ˆp  0.0291 E  0.0091
0.0291 0.0091 (within 0.91%of 2.91%)
0.0200 p  0.0382 (between 2.00%and 3.82%)
We are (approximately) 90% confident that between
2.00% and 3.82% of all coupons will be redeemed.
Example 3
What proportion of voters currently approve of the
President’s handling of the economic situation?
p = _____________________________
p
= probability voter approves
= proportion of all voters who approve
A random sample of 1000 likely voters is taken, using
random digit dialing.
n = 1000
X = # of sampled voters approve
Example 3
p = ??
n = 1000
X = # of sampled voters who
approve. Varies.
The survey is conducted: 557 of the sampled voters
approve.
0.557 .
Compute X/n = __________
Which of the following is correct?
ˆp  0.557
p  0.557
Fill in the blanks with the appropriate terms…
ˆp 0.557 is the statistic estimating the parameter p.
Example 3
p = ??
n = 100
X = # of sampled voters who
approve. Varies.
The survey is conducted: 557 of the sampled voters
approve.
0.557 .
Compute X/n = __________
Which of the following is correct?
ˆp  0.557
p  0.557
Fill in the blanks
ˆp 0.557 is the statistic estimating the parameter p.
Example 3
Obtain a 95% confidence interval for the proportion of
all people who approve.
1st: Check Conditions
•
•
•
Random sample from a categorical population – 2
categories (S / F)
Population at least 20 times the sample size
At least 5 Successes and Failures in the sample (this
ensures that the Normal is appropriate; 10 is better)
The number of Successes and Failures are 557 and 443
respectively, both well above 5. The confidence interval
based on the Normal distribution can be used.
Summary of Information
x  557
n  1000
ˆp  557 1000 0.557
C  95 %
z*  1.96
Summary of Information
n  1000 ˆp  0.557 z*  1.96
E  z*
ˆp1  ˆp 
n
0.5570.443
E  1.96
1000
E  1.960.01571
E  0.031
Final Numbers
Within 0.031of 0.557…
0.557 – 0.031 = 0.526
0.557 + 0.031 = 0.588
Between 0.526 and 0.588. Any of these suffices…
0.557  0.031
(0.527, 0.588)
0.527 to 0.588
0.527  p  0.588
Assessing the Error Margin
The error margin covers random sampling errors.
It does not account for errors due to improper
sampling, or inaccurate data collection.
Is the sample drawn from a collection of units that
may not be representative of the entire population?
If so, perhaps the interval is appropriate for the
population defined by that collection.
That is: Define a new (reduced) population.
Is any judgment required in categorizing the units as
Success and Failure?
What Confidence Means
Imagine a population for which 45% of the population
approves of the state’s governor.
Consider all samples of size n = 1000 from this
population. For each sample a 90% CI is obtained.
Before any sampling is done…before any data is
collected:
The probability of a randomly chosen sample giving
a CI that “covers” the parameter of p = 0.45 is 0.90.
What Confidence Means
The probability of a random sample giving a CI that
“covers” the parameter of p = 0.45 is 0.90.
0.54
Confidence Interval
0.52
0.50
0.48
0.46
..
0.44
..
0.45
0.42
0.40
1
2
3
4
5
6
7
8
9
..
Last
Samples of size 1000 generating a CI
90% of black intervals cover the blue line at p.
90% of all 90% CIs “cover” the estimated parameter.
What Confidence Means
A histogram of the black dots would be Normal, with
mean 0.45. Approximately 10% of the time, the black
dot would be far enough from 0.45 so that the interval
(roughly  0.026) would not cover 0.45.
0.54
Confidence Interval
0.52
0.50
0.48
0.46
..
0.44
..
0.45
0.42
0.40
1
2
3
4
5
6
7
8
9
Samples of size 1000 generating a CI
..
Last
What Confidence Means
“Do you favor or oppose abolishing the penny?”
0.59  2136 = 1260.24
1260/2136 = 0.5899 = 0.590 to 3 significant digits.
0.59 0.41
E  1.96
 0.021
2136
59.0%  2.1% is a 95% confidence interval.
What Confidence Means
We are 95% confident that between 56.9% and
61.1% of all Americans oppose abolishing the
penny. (p represents this unknown proportion.)
In a real study: Exactly one random sample is
chosen. Once the data is recorded there is nothing
random (certainly p is not random).
The location of the “blue line” is unknown. It
exists: We just don’t know where. We don’t know
whether or not p is covered (the probability is
either 0 or 1).
We use the word confidence after the random
sample is drawn. We don’t use the word
probability (unless we are explaining what
confidence is).
Quiz
A sample of jokes from The Daily Show found that 83
of 252 were of a political nature. Assume this was a
random sample from all jokes. Then a 95% confidence
interval is (0.271, 0.387).
Quiz
A sample of jokes from The Daily Show found that 83
of 252 were of a political nature. Assume this was a
random sample from all jokes. Then a 95% confidence
interval is (0.271, 0.387).
95% of jokes are political in nature.
FALSE. 95% is the confidence we have in the result,
it has nothing to do with the prevalence (in the
sample or for the entire population) of political jokes
on The Daily Show).
Quiz
A sample of jokes from The Daily Show found that 83
of 252 were of a political nature. Assume this was a
random sample from all jokes. Then a 95% confidence
interval is (0.271, 0.387).
We are 95% confident that between 27.1% and 38.7%
of the sampled jokes were political in nature.
Quiz
A sample of jokes from The Daily Show found that 83
of 252 were of a political nature. Assume this was a
random sample from all jokes. Then a 95% confidence
interval is (0.271, 0.387).
We are 95% confident that between 27.1% and 38.7%
of the sampled jokes were political in nature.
FALSE. 83 / 252 = 0.329. The probability is 100%
that the sample proportion lies within the bounds of
the interval – it centers the interval and always falls
within the bounds.
Quiz
A sample of jokes from The Daily Show found that 83
of 252 were of a political nature. Assume this was a
random sample from all jokes. Then a 95% confidence
interval is (0.271, 0.387).
The confidence is 0.95 that another random sample of
jokes would have between 0.271 and 0.387 of the jokes
political in nature.
FALSE. Confidence intervals are not intended to
predict what will happen with other random samples.
They estimate a parameter (in this case, p).
Quiz
A sample of jokes from The Daily Show found that 83
of 252 were of a political nature. Assume this was a
random sample from all jokes. Then a 95% confidence
interval is (0.271, 0.387).
The probability is 0.95 that between 0.271 and 0.387 of
all jokes on The Daily Show are political in nature.
FALSE. The probability is either 0 or 1 – we just
don’t know what p is. Probability refers to an
outcome that has uncertainty due to randomness. The
uncertainty here is due to ignorance.
Quiz
A sample of jokes from The Daily Show found that 83
of 252 were of a political nature. Assume this was a
random sample from all jokes. Then a 95% confidence
interval is (0.271, 0.387).
95% of all samples of The Daily Show jokes give an
interval that cover p = the proportion of all jokes that
are political in nature. Our 1 sample, randomly drawn,
gives (0.271, 0.387). We don’t know if p is in there or
not, but we are 95% confident it is.
That’s it! TRUE!
Polls apart: Why polls vary on presidential race
The groups pollsters randomly choose to interview are bound to
differ from each other, and sometimes do significantly.
Every poll has a margin of sampling error, usually around 3
percentage points for 1,000 people.* That means the results of a
poll of 1,000 people should fall within 3 points of the results you
population of the U.S. But — and this is important — the results
are expected to be that accurate only 95 percent of the time. That
means that one time in 20, pollsters expect to interview a group
whose views are not that close** to the overall population's
views.
* Using p^ = 0.5 at 95% confidence gives n  1068
** not within the error margin
Example
Suppose we randomly sample people for a telephone
poll on the issue of Presidential approval.
We’ll sample 1000 people, using 95% confidence.
People of different political leanings have
systematically different behaviors.
Refusing telephone surveys is one such behavior.
Example
Suppose (to oversimplify) that in the population 88
million people approve of the President and 72 million
disapprove. So the President’s approval rating is p =
88/160 = 0.55.
But…
The people that approve of the President are crankier
than those that do not. They are less likely to put up
with an intruding phone call. In fact, 40% of the
approvers will not respond (that’s 35.2 million people).
The disapprovers are more willing to take the call: only
10% of them will refuse (that’s 7.2 million people).
Example
Respond
Refuse
Total
Approve
Disapprove
Total
52.8
35.2
88.0
64.8
7.2
72.0
117.6
42.4
160.0
Among everyone the approval rate is 55%.
Among responders, the approval rate is
52.8 / 117.6 = 45%
The CI formed from the data estimates 45% (not 55%).
How Confident Are We?
How Confident Are We?
The probability of a random sample giving a CI that
“covers” the parameter of p = 0.55 is essentially 0 (and
certainly not even close to 0.95).
The sample proportion is a biased (to the low side)
estimate of the population proportion p = 0.55.
Statistical bias is procedural, not “individual.”
You may (but it’s probably not likely) use the wrong
method and get the right answer. This is a biased
method.
If you use the right method and get the wrong
answer (which happens only 5% of the time) your
method is not biased.
Our Confidence is Shot
We’d have 0% confidence in such a procedure.
CIs handle only “errors” due to randomization.
It should not be (but IS!) called “margin of error.” It
should be called “margin of variability (at 95%
confidence).”
If other errors exist and aren’t accounted for, the
confidence you have should be (probably much)
lower than the stated confidence.
Many other types of errors are very difficult to
account for in a scientific way.
Polls apart: Why polls vary on presidential race
report them?
Bureau data… But some pollsters make these adjustments
differently than others.
* Not really. They adjust the percentages. The individual
Polls apart: Why polls vary on presidential race
…in a country where barely more than half of eligible voters
usually show up for presidential elections, pollsters want their
polls to reflect the views of those likeliest to vote.
Q: Is that hard to do?
A: Quite hard…
…nobody is 100 percent sure how to do this properly. And the
challenge is being compounded this year because many think
Obama's candidacy could spark higher turnout than usual from
certain voters, including young voters and minorities. The
question pollsters face is whether, and how, to adjust their tests
for likely voters to reflect this.
Polls apart: Why polls vary on presidential race
Q: Are people always willing to tell pollsters who they're
supporting for president?
A: No, and that's another possible source of discrepancies. Some
polling organizations gently prod people who initially say they're
undecided for a presidential preference, others do it more
vigorously. The AP's poll, for example, found 9 percent of likely
voters were undecided, while the ABC-Post survey had 2 percent.
Love, Sex and the Changing Landscape of Infidelity
…surveys appearing in sources like women’s magazines may
overstate the adultery rate, because they suffer from what
pollsters call selection bias: the respondents select themselves
and may be more likely to report infidelity.
```