Report

Displaying and Describing Categorical Data CHAPTER 3 Objectives • • • • • • • • • • • Frequency Table Relative Frequency Table Distribution Area Principle Bar Chart Pie Chart Contingency Table Marginal Distribution Conditional Distribution Independence Segmented Bar Chart Simpson’s Paradox The Three Rules of Data Analysis • The three rules of data analysis won’t be difficult to remember: 1. Make a picture—things may be revealed that are not obvious in the raw data. These will be things to think about. 2. Make a picture—important features of and patterns in the data will show up. You may also see things that you did not expect. 3. Make a picture—the best way to tell others about your data is with a wellchosen picture. Frequency Table • What is a frequency Table? A frequency table is an organization of raw data in tabular form, using classes (or intervals) and frequencies. • What is a frequency count? The frequency or the frequency count for a data value is the number of times the value occurs in the data set. Categorical Frequency Tables • NOTE: Later we will consider qualitative frequency tables. • What is a categorical frequency table? A categorical frequency table represents data that can be placed in specific categories, such as gender, hair color, political affiliation etc. Frequency Table A frequency table is a tabular summary of data showing the frequency (or number) of items in each of several non-overlapping categories. The objective is to provide insights about the data that cannot be quickly obtained by looking only at the original data. Frequency Tables: • We can “organize” the data by counting the number of data values in each category of interest. • We can organize these counts into a frequency table, which records the totals and the category names. Categorical Frequency Table • Example: The blood types of 25 blood donors are given below. Summarize the data using a frequency distribution. AB O B A A B B O O B A O B AB AB O A B AB O B O B O A Categorical Frequency Table for the Blood Types Note: The classes for the distribution are the blood types. Your Turn • Guests staying at Marada Inn were asked to rate the quality of their accommodations as being excellent (E),above average (AA), average (A), below average (BA), or poor (P). The ratings provided by a sample of 20 guests: • BA, AA, AA, A, AA, A, AA, A, AA, BA, P, E, AA, A, AA, AA, BA, P, AA, A . • Make a frequency table. Categorical Frequency Distribution Rating Counts Poor (P) 2 Below Average (BA) 3 Average (A) 5 Above Average (AA) 9 Excellent (E) 1 Total 20 Relative Frequency Tables: • A relative frequency table is similar, but gives the relative frequency, a decimal or percentage (instead of counts) for each category. Relative Frequency Table The relative frequency of a class is the fraction or proportion of the total number of data items belonging to the class. A relative frequency table is a tabular summary of a set of data showing the relative frequency for each class. Percent Frequency Table The percent frequency of a class is the relative frequency multiplied by 100. A percent frequency table is a tabular summary of a set of data showing the percent frequency for each class. Relative & Percent Categorical Frequency Table • Using the frequency table below, from the Marada Inn problem, create a relative and percent frequency table. • Add two additional columns labeled relative frequency and percent frequency. Rating Counts Poor (P) 2 Below Average (BA) 3 Average (A) 5 Above Average (AA) 9 Excellent (E) 1 Total 20 Relative & Percent Categorical Frequency Table Rating Frequency Poor 2 Below avg. 3 Avg. 5 Above avg. 9 Excellent 1 Total: 20 Rel. Freq. 2/20 = .10 3/20 = .15 5/20 = .25 9/20 = .45 1/20 = .05 1 %Freq 10% 15% 25% 45% 5% 100% Frequency Tables: • All three types of tables show how cases are distributed across the categories. • They describe the distribution of a categorical variable because they name the possible categories and tell how frequently each occurs. There are three kinds of lies: lies, damned lies, and statistics. Benjamin Disraeli (1804 - 1881) Misleading Statistics • Now that we have the frequency table, we are ready to make a picture or a graph of the data. • Misleading graphs • Scale • Pictographs Misleading Statistics - Scale • Adjusting the scale of a graph is a common way to mislead (or lie) with statistics. • Example: Misleading Scale Misleading Statistics • The best data displays observe a fundamental principle of graphing data called the area principle. • The area principle says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents. • Violations of the area principle are another common way of misleading with statistics. What’s Wrong With This Picture? • You might think that a good way to show the Titanic data is with this display: What’s Wrong With This Picture? • The ship display makes it look like most of the people on the Titanic were crew members, with a few passengers along for the ride. • When we look at each ship, we see the area taken up by the ship, instead of the length of the ship. • The ship display violates the area principle: • The area occupied by a part of the graph should correspond to the Slide 3 - 24 magnitude of the value it represents. Area Principle - Pictographs Double the length, width, and height of a cube, and the volume increases by a factor of eight Area Principle Pictographs GRAPHING CATEGORICAL OR QUALITATIVE DATA Ways to Graph Categorical Data Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.). 1. Bar Charts – Each category is represented by a bar. 2. Pie Charts - The slices must represent the parts of one whole. Bar Charts • A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison. • A bar chart stays true to the area principle. • Thus, a better display for the ship data is: Bar Charts (cont.) • A relative frequency bar chart displays the relative proportion of counts for each category. • A relative frequency bar chart also stays true to the area principle. • Replacing counts with percentages in the ship data: Slide 3 - 30 Bar Charts A bar chart is a graphical device for depicting qualitative data. On one axis (usually the horizontal axis), we specify the labels that are used for each of the categories. A frequency, relative frequency, or percent frequency scale can be used for the other axis (usually the vertical axis). Using a bar of fixed width drawn above each class label, we extend the height appropriately. The bars are separated to emphasize the fact that each class is a separate category. Bar Charts •Either counts (frequency bar chart) or proportions (relative frequency bar chart) may be shown on the y-axis. This will not change the shape or relationships of the graph. •Make sure all graphs have a descriptive title and that the axes are labeled (this is true for all graphs in AP Stats). Pie Charts • When you are interested in parts of the whole, a pie chart might be your display of choice. • Pie charts show the whole group of cases as a circle. • They slice the circle into pieces whose size is proportional to the fraction of the whole in each category. Slide 3 - 33 Pie Chart The pie chart is a commonly used graphical device for presenting relative frequency distributions for qualitative data. First draw a circle; then subdivide the circle into sectors that correspond in area to the relative frequency for each category. Since there are 360 degrees in a circle, a category with a relative frequency of .25 would consume .25(360) = 90 degrees of the circle. Relations between Two Categorical Variables • Examples: • Is gender or race related to political preference? • What type of music can make people relax? • Will different packaging of the same product attract people with different social-economic background? • A contingency table or two-way table is a way to display the data from two categorical variables. A sort of Venn Diagram which shows how a population splits according to two factors. Contingency Tables • A contingency table allows us to look at two categorical variables together. • It shows how individuals are distributed along each variable, contingent on the value of the other variable. • Example: we can examine the class of ticket and whether a person survived the Titanic: Contingency Tables (cont.) • The margins of the table, both on the right and on the bottom, give totals and the frequency distributions for each of the variables. • Each frequency distribution is called a marginal distribution of its respective variable. • The marginal distribution of Survival is: Contingency Tables (cont.) • Each cell of the table gives the count for a combination of values of the two values. • For example, the second cell in the crew column tells us that 673 crew members died when the Titanic sunk. Conditional Distributions • A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable. • The following is the conditional distribution of ticket Class, conditional on having survived: Conditional Distributions (cont.) • The following is the conditional distribution of ticket Class, conditional on having perished: Conditional Distributions (cont.) • The conditional distributions tell us that there is a difference in class for those who survived and those who perished. • This is better shown with pie charts of the two distributions: Conditional Distributions (cont.) • We see that the distribution of Class for the survivors is different from that of the nonsurvivors. • This leads us to believe that Class and Survival are associated, that they are not independent. • The variables would be considered independent when the distribution of one variable in a contingency table is the same for all categories of the other variable. Segmented Bar Charts • A segmented bar chart displays the same information as a pie chart, but in the form of bars instead of circles. • Each bar is treated as the “whole” and is divided proportionally into segments corresponding o the percentage in each group. • Here is the segmented bar chart for ticket Class by Survival status: Slide 3 - 43 Example: Income level vs. Job Satisfaction Income Conditional distribution < 30K 30K-50K 50K-80K > 80K C. Total Job Satisfaction 1 2 3 4 20 24 80 82 22 38 104 125 13 28 81 113 7 18 54 92 62 108 319 412 Row Total 206 289 235 171 901 Marginal distribution Table total •This is a Contingency table with Income Level as the Row Variable and Job Satisfaction as the Column Variable. •The distributions of income to job satisfaction or job satisfaction to income are called Conditional Distributions. •The distributions of income alone and job satisfaction alone are called Marginal Distributions. •Relationships between categorical variables are described by calculating appropriate percents from the counts given in each cell. Example: • A Statistics class reports the following data on sex and eye color for students in the class: Eye Color Sex Blue Brown Green/Hazel/Other Total Males 6 20 6 32 Females 4 16 12 32 Total 10 36 18 64 1. 2. 3. 4. 5. 6. 7. What percent of females are brown-eyed? What percent of brown-eyed students are female? What percent of students are brown-eyed females? What’s the distribution of eye color? What’s the conditional distribution of eye color for the males? Compare the percent who are female among the blue-eyed students to the percent of all students who are female? Does it seem that eye color and sex are independent? Explain. Eye Color Solution: Sex 1. 2. 3. 4. 5. 6. 7. Blue Brown Green/Hazel/Other Total Males 6 20 6 32 Females 4 16 12 32 Total 10 36 18 64 What percent of females are brown-eyed? 16/32 = .5 or 50% What percent of brown-eyed students are female? 16/36 = .444 or 44.4% What percent of students are brown-eyed females? 16/64 = .25 or 25% What’s the distribution of eye color? 10/64 = .156 or 15.6% Blue, 36/64 = .563 or 56.3% Brown, 18/64 =.281 or 28.1% Green/Hazel/Other What’s the conditional distribution of eye color for the males? 6/32 = .188 or 18.8% Blue, 20/32 = .625 or 62.5% Brown, 6/32 = .188 or 18.8% Green/Hazel/Other Compare the percent who are female among the blue-eyed students to the percent of all students who are female? 4/10 = .4 or 40% of the blue-eyed students are female, while 32/64 = .5 or 50% of all students are female. Does it seem that eye color and sex are independent? Explain. Since blueeyed students appear less likely to be female, it seems that Sex and Eye Color may not be independent. (But the numbers are small.) SIMPSON’S PARADOX Simpson’s Paradox • Discovered by E. H. Simpson in 1951. • Occurs when averaging different samples of different sizes • Two groups from one sample are compared to two similar groups from another sample • One sample’s success rate for both groups is higher than the success rates for the other sample Not E. H. Simpson Simpson’s Paradox • However, when both groups’ respective success rates are combined, the sample with the lower success rate ends up with the better overall proportion of successes. Thus, the paradox. •One sample usually has a considerably smaller number of members than the other groups •Simpson’s Paradox does not occur in populations with similar amounts What is Simpson’s Paradox? • Simpson’s Paradox occurs when an association between two variables is reversed upon observing a third variable. Simpson’s Paradox • Simpson’s paradox lurking variable creates a reversal in the direction of an association (“confounding”) • To uncover Simpson’s Paradox, divide data into subgroups based on the lurking variable Recent Cleveland Indians season records 2003—68-94, 42.0% winning percentage 2004—80-82, 49.4% winning percentage Two-season record: 148-176, 45.7% win percentage Recent Minnesota Twins season records 2003—90-72, 55.6% win percentage 2004—92-70, 56.8% win percentage Two-season record: 182-142, 56.2% win percentage Notice that the Twins had a higher percentage in both 2003 and 2004, as well as in the two-year period. Not Simpson’s Paradox. Recent Cleveland Indians season records 2003—68-94, 42.0% winning percentage 2004—80-82, 49.4% winning percentage Two-season record: 148-176, 45.7% win percentage Recent Minnesota Twins season records 2003—90-72, 55.6% win percentage 2004—92-70, 56.8% win percentage Two-season record: 182-142, 56.2% win percentage Notice that the Twins had a higher percentage in both 2003 and 2004, as well as in the two-year period. Not Simpson’s Paradox. Simpson’s Paradox at work Ronnie Belliard 2002—61/289, .211 of his at-bats were hits 2003—124/447, .277 of his at-bats were hits Two-season average: 185/736, hits .2514 of the time Casey Blake 2002—4/20, .200 of his at-bats were hits 2003—143/557, .257 of his at-bats were hits Two-season average: 147/577, hits .2548 of the time The two season batting avg. for Belliard was lower than Blake’s, but divided into separate seasons, Belliard’s had a higher batting avg. both seasons. This is Simpson’s Paradox. Discrimination? (Simpson’s Paradox) Consider college acceptance rates by sex Accepted Not accepted Total Men 198 162 360 Women 88 112 200 Total 286 274 560 198 of 360 (55%) of men accepted 88 of 200 (44%) of women accepted Is there a sex bias? Discrimination? (Simpson’s Paradox) • Or is there a lurking variable that explains the association? • To evaluate this, split applications according to the lurking variable "major applied to” • Business School (240 applicants) • Art School (320 applicants) Discrimination? (Simpson’s Paradox) BUSINESS SCHOOL Accepted Not accepted Total Men 18 102 120 Women 24 96 120 Total 42 198 240 18 of 120 men (15%) of men were accepted to B-school 24 of 120 (20%) of women were accepted to B-school A higher percentage of women were accepted Discrimination (Simpson’s Paradox) ART SCHOOL Accepted Not accepted Total Men 180 60 240 Women 64 16 80 Total 244 76 320 180 of 240 men (75%) of men were accepted 64 of 80 (80%) of women were accepted A higher percentage of women were accepted. Discrimination? (Simpson’s Paradox) • Within each school, a higher percentage of women were accepted than men. • No discrimination against women • Possible discrimination against men • This is an example of Simpson’s Paradox. • When the lurking variable (School applied to) was ignored, the data suggest discrimination against women. • When the School applied to was considered, the association is reversed. Colin R. Blyth’s example of Simpson’s Paradox • A doctor was planning to try a new treatment on patients mostly local (C) and a few in Chicago (C’). A statistician advised him to use a table of random numbers and as each C patient became available, assign him to the new treatment with probability .91, leave him to the standard treatment with probability .09; and the same for C’ patient with probability .01 and .99 respectively. When the doctor returned with the data the statistician told him that the new treatment was obviously a very bad one, and criticized him for having continued trying it on so many patients. Treatment Standard New Dead 5950 9005 Alive 5050 1095 (46%) (11%) • The doctor replied that he continued because the new treatment was obviously a very good one, having nearly doubled the recovery rate in both cities. C’ patient only C patient only Treatment Standard New Standard New Dead 950 9000 5000 5 Alive 50 1000 5000 95 5% 10% 50% 95% Smokers’ Example • In England a study was conducted to examine the survival rates of smokers and non-smokers. The result implied a significant positive correlation between smoking & survival rates because only 24% of smokers died as compared to 31% of non-smokers. When the data were broken down by age group in a contingency table, it was found that there were more older people in the non-smoker group. Thus age played a very significant role in the outcome but since it was overlooked the researchers were left with deceiving results. (Appleton & French, 1996). The Paradox What’s true for the parts isn’t true for the whole. CONCLUSION!!!! Simpson’s paradox is a rare phenomenon! It does not occur often! Thus statisticians must be trained academically & ethically well enough to make sure that if it has occurred they will detect and correct it. This is where practice, critical thinking skills, and repetition come into play! What Can Go Wrong? • Don’t violate the area principle. • While some people might like the pie chart on the left better, it is harder to compare fractions of the whole, which a well-done pie chart does. What Can Go Wrong? (cont.) • Keep it honest—make sure your display shows what it says it shows. • This plot of the percentage of high-school students who engage in specified dangerous behaviors has a problem. Can you see it? What Can Go Wrong? (cont.) • Don’t confuse similar-sounding percentages—pay particular attention to the wording of the context. • Don’t forget to look at the variables separately too—examine the marginal distributions, since it is important to know how many cases are in each category. What Can Go Wrong? (cont.) • Be sure to use enough individuals! • Do not make a report like “We found that 66.67% of the rats improved their performance with training. The other rat died.” What Can Go Wrong? (cont.) • Don’t overstate your case—don’t claim something you can’t. • Don’t use unfair or silly averages—this could lead to Simpson’s Paradox, so be careful when you average one variable across different levels of a second variable. What have we learned? • We can summarize categorical data by counting the number of cases in each category (expressing these as counts or percents). • We can display the distribution in a bar chart or pie chart. • And, we can examine two-way tables called contingency tables, examining marginal and/or conditional distributions of the variables. Assignment • Exercises pg. 37 – 43: #5, 7, 11, 16, 21, 23, 27 -31 odd, 37 • Read Ch-4, pg. 44 - 71