Report

Week 10 Nov 3-7 Two Mini-Lectures QMM 510 Fall 2014 ML 10.1 Chapter Contents 15.1 Chi-Square Test for Independence 15.2 Chi-Square Tests for Goodness-of-Fit 15.3 Uniform Goodness-of-Fit Test 15.4 Poisson Goodness-of-Fit Test 15.5 Normal Chi-Square Goodness-of-Fit Test 15.6 ECDF Tests (Optional) 15-2 So many topics, so little time … Chapter 15 Chi-Square Tests Chapter 15 Chi-Square Test for Independence Contingency Tables • A contingency table is a cross-tabulation of n paired observations into categories. • Each cell shows the count of observations that fall into the category defined by its row (r) and column (c) heading. 15-3 Chapter 15 Chi-Square Test for Independence Contingency Tables • For example: 15-4 Chapter 15 Chi-Square Test for Independence Chi-Square Test • In a test of independence for an r x c contingency table, the hypotheses are H0: Variable A is independent of variable B H1: Variable A is not independent of variable B • Use the chi-square test for independence to test these hypotheses. • This nonparametric test is based on frequencies. • The n data pairs are classified into c columns and r rows and then the observed frequency fjk is compared with the expected frequency ejk. 15-5 Chapter 15 Chi-Square Test for Independence Chi-Square Distribution • The critical value comes from the chi-square probability distribution with d.f. degrees of freedom. where d.f. = degrees of freedom = (r – 1)(c – 1) r = number of rows in the table c = number of columns in the table • Appendix E contains critical values for right-tail areas of the chi-square distribution, or use Excel’s =CHISQ.DIST.RT(α,d.f.) • The mean of a chi-square distribution is d.f. with variance 2d.f. 15-6 Chapter 15 Chi-Square Test for Independence Chi-Square Distribution Consider the shape of the chi-square distribution: 15-7 Chapter 15 Chi-Square Test for Independence Expected Frequencies • Assuming that H0 is true, the expected frequency of row j and column k is: ejk = RjCk/n where Rj = total for row j (j = 1, 2, …, r) Ck = total for column k (k = 1, 2, …, c) n = sample size 15-8 Chapter 15 Chi-Square Test for Independence Steps in Testing the Hypotheses • • • Step 1: State the Hypotheses H0: Variable A is independent of variable B H1: Variable A is not independent of variable B • • Step 2: Specify the Decision Rule Calculate d.f. = (r – 1)(c – 1) • For a given α, look up the right-tail critical value (2R) from Appendix E or by using Excel =CHISQ.DIST.RT(α,d.f.). • Reject H0 if 2R > test statistic. 15-9 Chapter 15 Chi-Square Test for Independence Steps in Testing the Hypotheses • For example, for d.f. = 6 and α = .05, 2.05 = 12.59. 15-10 Chapter 15 Chi-Square Test for Independence Steps in Testing the Hypotheses • Here is the rejection region. 15-11 Chapter 15 Chi-Square Test for Independence Steps in Testing the Hypotheses • Step 3: Calculate the Expected Frequencies ejk = RjCk/n • For example, 15-12 Chapter 15 Chi-Square Test for Independence Steps in Testing the Hypotheses • Step 4: Calculate the Test Statistic • The chi-square test statistic is • Step 5: Make the Decision • Reject H0 if test statistic 2calc > 2R or if the p-value α. 15-13 Chapter 15 Chi-Square Test for Independence Example: MegaStat all cells have ejk 5 so Cochran’s Rule is met Caution: Don’t highlight row or column totals p-value = 0.2154 is not small enough to reject the hypothesis of independence at α = .05 15-14 Chapter 15 Chi-Square Test for Independence Test of Two Proportions • For a 2 × 2 contingency table, the chi-square test is equivalent to a twotailed z test for two proportions. • The hypotheses are: Figure 14.6 15-15 Chapter 15 Chi-Square Test for Independence Small Expected Frequencies • The chi-square test is unreliable if the expected frequencies are too small. • Rules of thumb: • Cochran’s Rule requires that ejk > 5 for all cells. • Up to 20% of the cells may have ejk < 5 • Most agree that a chi-square test is infeasible if ejk < 1 in any cell. • If this happens, try combining adjacent rows or columns to enlarge the expected frequencies. 15-16 Chapter 15 Chi-Square Test for Independence Cross-Tabulating Raw Data • Chi-square tests for independence can also be used to analyze quantitative variables by coding them into categories. • For example, the variables Infant Deaths per 1,000 and Doctors per 100,000 can each be coded into various categories: 15-17 Chapter 15 Chi-Square Test for Independence Why Do a Chi-Square Test on Numerical Data? • The researcher may believe there’s a relationship between X and Y, but doesn’t want to use regression. • There are outliers or anomalies that prevent us from assuming that the data came from a normal population. • The researcher has numerical data for one variable but not the other. 15-18 Chapter 15 Chi-Square Test for Independence 3-Way Tables and Higher • More than two variables can be compared using contingency tables. • However, it is difficult to visualize a higher-order table. • For example, you could visualize a cube as a stack of tiled 2-way contingency tables. • Major computer packages permit three-way tables. 15-19 Purpose of the Test • The goodness-of-fit (GOF) test helps you decide whether your sample resembles a particular kind of population. • The chi-square test is versatile and easy to understand. Hypotheses for GOF tests: • The hypotheses are: H0: The population follows a _____ distribution H1: The population does not follow a ______ distribution • The blank may contain the name of any theoretical distribution (e.g., uniform, Poisson, normal). 15-20 Chapter 15 Chi-Square Tests for Goodness-of-Fit ML 10.2 Chapter 15 Chi-Square Tests for Goodness-of-Fit Test Statistic and Degrees of Freedom for GOF • Assuming n observations, the observations are grouped into c classes and then the chi-square test statistic is found using: where fj = the observed frequency of observations in class j ej = the expected frequency in class j if the sample came from the hypothesized population 15-21 Chapter 15 Chi-Square Tests for Goodness-of-Fit Test Statistic and Degrees of Freedom for GOF tests • If the proposed distribution gives a good fit to the sample, the test statistic will be near zero. • The test statistic follows the chi-square distribution with degrees of freedom d.f. = c – m – 1. • where c is the number of classes used in the test and m is the number of parameters estimated. 15-22 Chapter 15 Normal Chi-Square GOF Test Is the Sample from a Normal Population? • Many statistical tests assume a normal population, so this the most common GOF test. • Two parameters, the mean μ and the standard deviation σ, fully describe a normal distribution. • Unless μ and σ are known a priori, they must be estimated from a sample in order to perform a GOF test for normality. 15-23 Method 1: Standardize the Data • Transform sample observations x1, x2, …, xn into standardized z-values. • Count the sample observations within each interval on the z-scale and compare them with expected normal frequencies ej. Problem: Frequencies will be small in the end bins yet large in the middle bins (this may violate Cochran’s Rule and seems inefficient). 15-24 Chapter 15 Normal Chi-Square GOF Test Chapter 15 Normal Chi-Square GOF Test Method 2: Equal Bin Widths • Step 1: Divide the exact data range into c groups of equal width, and count the sample observations in each bin to get observed bin frequencies fj. • Step 2: Convert the bin limits into standardized z-values: • Step 3: Find the normal area within each bin assuming a normal distribution. • Step 4: Find expected frequencies ej by multiplying each normal area by the sample size n. Problem: Frequencies will be small in the end bins yet large in the middle bins (this may violate Cochran’s Rule and seems inefficient). 15-25 Chapter 15 Normal Chi-Square GOF Test Method 3: Equal Expected Frequencies • Define histogram bins in such a way that an equal number of observations would be expected under the hypothesis of a normal population, i.e., so that ej = n/c. • A normal area of 1/c is expected in each bin. • The first and last classes must be open-ended, so to define c bins we need c-1 cut points. • Count the observations fj within each bin. • Compare the fj with the expected frequencies ej = n/c. Advantage: Makes efficient use of the sample. Disadvantage: Cut points on the z-scale points may seem strange. 15-26 Chapter 15 Normal Chi-Square GOF Test Method 3: Equal Expected Frequencies • Standard normal cut points for equal area bins. Table 15.16 15-27 Critical Values for Normal GOF Test • Two parameters, m and s, are estimated from the sample, so the degrees of freedom are d.f. = c – m – 1. • We need at least four bins to ensure at least one degree of freedom. Small Expected Frequencies • 15-28 Cochran’s Rule suggests at least ej 5 in each bin (e.g., with 4 bins we would want n 20, and so on). Chapter 15 Normal Chi-Square GOF Test Visual Tests • The fitted normal superimposed on a histogram gives visual clues as to the likely outcome of the GOF test. • A simple “eyeball” inspection of the histogram may suffice to rule out a normal population by revealing outliers or other nonnormality issues. 15-29 Chapter 15 Normal Chi-Square GOF Test ML 10.3 ECDF Tests for Normality • There are alternatives to the chi-square test for normality based on the empirical cumulative distribution function (ECDF). • ECDF tests are done by computer. Details are omitted here. • A small p-value casts doubt on normality of the population. • The Kolmogorov-Smirnov (K-S) test uses the largest absolute difference between the actual and expected cumulative relative frequency of the n data values. • The Anderson-Darling (A-D) test is based on a probability plot. When the data fit the hypothesized distribution closely, the probability plot will be close to a straight line. The A-D test is widely used because of its power and attractive visual. 15-30 Chapter 15 ECDF Tests Example: Minitab’s Anderson-Darling Test for Normality Data: weights of 80 babies (in ounces) 15-31 Near-linear probability plot suggests good fit to normal distribution p-value = 0.122 is not small enough to reject normal population at α = .05 Chapter 15 ECDF Tests Chapter 15 ECDF Tests Example: MegaStat’s Normality Tests Data: weights of 80 babies (in ounces) p-value = 0.2487 is not small enough to reject normal population at α = .05 in this chi-square test Near-linear probability plot suggests good fit to normal distribution Note: MegaStat’s chi-square test is not as powerful as the A-D test, so we would prefer the A-D test if software is available. The MegaStat probability plot is good, but shows no p-value. 15-32