Report

LOGISTIC REGRESSION OUTLINE • Basic Concepts of Logistic Regression • Finding Logistic Regression Coefficients using Excel’s Solver • Significance Testing of the Logistic Regression Coefficients • Testing the Fit of the Logistic Regression Model • Finding Logistic Regression Coefficients using Newton’s Method • Comparing Logistic Regression Models • Hosmer-Lemeshow Test BASIC CONCEPTS OF LOGISTIC REGRESSION • The basic approach is to use the following regression model, employing the notation from Definition 3 of Method of Least Squares for Multiple Regression: = + + + ⋯ + + where the odds function is as given in the following definition. Definition 1 : • Odds(E) is the odds that event E occurs, namely = = ′ − Where p has a value 0 ≤ p ≤ 1 (i.e. p is a probability value), we can define the odds function as = • − For our purposes, the odds function has the advantage of transforming the probability function, which has values from 0 to 1, into an equivalent function with values between 0 and ∞. When we take the natural log of the odds function, we get a range of values from - ∞ to ∞. Definition 2 : • The logit function is the log of the odds function, namely = ln , or = = = − ( − ) − Definition 3 : • Based on the logistic model as described above, we have = () = + + + ⋯ + + where p = P(E). • It now follows that () = = + + +⋯+ + 1 − () and so + + +⋯+ + = = = 1 + + + +⋯+ + 1 + − 1 = 1 + −0 − =1 • 1 + + +⋯+ + For our purposes we take the event E to be that the dependent variable y has value 1. If y takes only the values 0 or 1, we can think of E as success and the complement E′ of E as failure. This is as for the trials in a binomial distribution. • Just as for the regression model studied in Regression and Multiple Regression, a sample consists of n data elements of the form (yi, xi1, xi2 ,…, xik), but for logistic regression each yi only takes the value 0 or 1. Now let Ei = the event that yi = 1 and pi = P(Ei). Just as the regression line studied previously provides a way to predict the value of the dependent variable y from the values of the independent variables x 1, …, xk in for logistic regression we have 1 = =1 = 1 + −0 − =1 ( = 1) = ln = 0 − 1 − ( = 1 ) • =1 Note that since the yi have a proportion distribution, by Property 2 of Proportion Distribution, var(yi) = pi (1 – pi). • 1 In the case where k = 1, we have = 1+ −0 −1 ,Such a curve has sigmoid shape: The values of b0 and b1 determine the location direction and spread of the curve. The curve is symmetric about the point where = − 0 . 1 In fact, the value of p is 0.5 for this value of x. Sigmoid curve for p • Logistic regression is used instead of ordinary multiple regression because the assumptions required for ordinary regression are not met. In particular 1. The assumption of the linear regression model that the values of y are normally distributed cannot be met since y only takes the values 0 and 1. 2. The assumption of the linear regression model that the variance of y is constant across values of x (homogeneity of variances) also cannot be met with a binary variable. Since the variance is p(1–p) when 50 percent of the sample consists of 1s, the variance is .25, its maximum value. As we move to more extreme values, the variance decreases. When p = .10 or .90, the variance is (.1)(.9) = .09, and so as p approaches 1 or 0, the variance approaches 0. 3. Using the linear regression model, the predicted values will become greater than one and less than zero if you move far enough on the x-axis. Such values are theoretically inadmissible for probabilities. • For the logistics model, the least squares approach to calculating the values of the coefficients b i cannot be used; instead the maximum likelihood techniques, as described below, are employed to find these values. Definition 4: • The odds ratio between two data elements in the sample is defined as follows: = ( , ⋯ ) ⋯ = + = + = = = ( − ) Using the notation px = P(x), the log odds ratio of the estimates is defined as (+ ) • In the case where + = + = − − − − + − = + + − − = Thus, ( + ) = = () − − Furthermore, for any value of d ( + ) = () Note that when x is a dichotomous variable, = • () () E.g. when x = 0 for male and x = 1 for female, then 1 represents the odds ratio between males and females. If for example b 1 = 2, and we are measuring the probability of getting cancer under certain conditions, then 1 = 7.4, which would mean that the odds of females getting cancer would be 7.4 times greater than males under the same conditions. • The model we will use is based on the binomial distribution, namely the probability that the sample data occurs as it does is given by (1 − )1− = =1 Taking the natural log of both sides and simplifying we get the following definition. Definition 5: • The log-likelihood statistic is defined as follows: = ln = ln + 1 − ln 1 − =1 where the yi are the observed values while the p i are the corresponding theoretical values. • Example 1: A sample of 760 people who received doses of radiation between 0 and 1000 rems wasmade following a recent nuclear accident. Of these 302 died as shown in the table in Figure 2. Actually each row in the table represents the midpoint of an interval of 100 rems (i.e. 0-100, 100-200, etc.). Figure 2. • <Solution> Let Ei = the event that a person in the ith interval survived. The table also shows the probability P(Ei) and odds Odds(Ei) of survival for a person in each interval. Note thatP(Ei) = the percentage of people in interval i who survived and In Figure 3 we plot the values of P(Ei) vs. i and Odds(Ei) vs. i. We see that the second of these plots is reasonably linear. Given that there is only one independent variable (namely x = # of rems), we can use the following model Here we use coefficients a and b instead of b0 and b1 just to keep the notation simple. • We show two different methods for finding the values of the coefficients a and b. The first uses Excel’s Solver tool and the second uses Newton’s method. Before proceeding it might be worthwhile to click on Goal Seeking and Solver to review how to use Excel’s Solver tool and Newton’s Method to review how to apply Newton’s Method. We will use both methods to maximize the value of the log-likelihood statistic as defined in Definition 5. FINDING LOGISTIC REGRESSION COEFFICIENTS USING EXCEL’S SOLVER • We now show how to find the coefficients for the logistic regression model using Excel’s Solver capability (see also Goal Seeking and Solver). We start with Example 1 from Basic Concepts of Logistic Regression. Example 1 (continued) : From Definition 1 of Basic Concepts of Logistic Regression, the predicted values p i for the probability of survival for each interval i is given by the following formula where x i represents the number of rems for interval i. The log-likelihood statistic as defined in Definition 5 of Basic Concepts of Logistic Regression is given by where yi is the observed probability of survival in the ith interval. Since we are aggregating the sample elements into intervals, we use the modified version of the formula, namely yi is the observed probability of survival in the i th of r intervals where We capture this information in the worksheet in Figure 1 (based on the data in Figure 2 of Basic Concepts of Logistic Regression). In figure 1, Column I contains the rem values for each interval (copy of column A and E). Column J contains the observed probability of survival for each interval (copy of column F). Column K contains the values of each pi. E.g. cell K4 contains the formula =1/(1+EXP(-O5–O6*I4)) and initially has value 0.5 based on the initial guess of the coefficients a and b given in cells O5 and O6 (which we arbitrarily set to zero). Cell L14 contains the value of LL using the formula =SUM(L4:L13); where L4 contains the formula =(B4+C4)*(J4*LN(K4)+(1-J4)*LN(1-K4)), and similarly for the other cells in column L. We now use Excel’s Solver tool by selecting Data > Analysis|Solver and filling in the dialog box that appears as described in Figure 2 (see Goal Seeking and Solver for more details). Our objective is to maximize the value of LL (in cell L14) by changing the coefficients (in cells O5 and O6). It is important, however, to make sure that the Make Unconstrained Variables Non-Negative checkbox is not checked. When we click on the Solve button we get a message that Solver has successfully found a solution, i.e. it has found values fora and b which maximize LL. We elect to keep the solution found and Solver automatically updates the worksheet from Figure 1 based on the values it found for a and b. The resulting worksheet is shown in Figure 3. We see that a = 4.476711 and b = -0.00721. Thus the logistics regression model is given by the formula For example, the predicted probability of survival when exposed to 380 rems of radiation is given by Note that Thus, the odds that a person exposed to 180 rems survives is 15.5% greater than a person exposed to 200 rems. • Real Statistics Data Analysis Tool: The Real Statistics Resource Pack provides the Logistic Regression supplemental data analysis tool. This tool takes as input a range which lists the sample data followed the number of occurrences of success and failure. E.g. for Example 1 this is the data in range A3:C13 of Figure 1. For this problem there was only one independent variable (number of rems). If additional independent variables are used then the input will contain additional columns, one for each independent variable. We show how to use this tool to create a spreadsheet similar to the one in Figure 3. First press Ctrl-m to bring up the menu of Real Statistics supplemental data analysis tools and choose the Logistic Regression option. This brings up the dialog box shown in Figure 4. Now select A3:C13 as the Input Range (see Figure 5) and since this data is in summary form with column headings, select the Summary data option for the Input Format and check Headings included with data. Next select the Solver as the Analysis Type and keep the default Alpha and Classification Cutoff values of .05 and .5 respectively. Finally press the OK button to obtain the output displayed in Figure 5. This tool takes as input a range which lists the sample data followed the number of occurrences of success and failure (this is considered to be the summary form). E.g. for Example 1 this is the data in range A3:C13 of Figure 1 (repeated in Figure 5 in the same cells). For this problem there was only one independent variable (number of rems). If additional independent variables are used then the input will contain additional columns, one for each independent variable. Note that the coefficients (range Q7:Q8) are set initially to zero and (cell M16) is calculated to be -526.792 (exactly as in Figure 1). The output from the Logistic Regression data analysis tool also contains many fields which will be explained later. As described in Figure 2, we can now use Excel’s Solver tool to find the logistic regression coefficient. The result is shown in Figure 6. We obtain the same values for the regression coefficients as we obtained previously in Figure 3, but also all the other cells are updated with the correct values as well. SIGNIFICANCE TESTING OF THE LOGISTIC REGRESSION COEFFICIENTS • Definition 1: For any coefficient b the Wald statistic is given by the formula • For ordinary regression we can calculate a statistic t ~ T(dfRes) which can be used to test the hypothesis that a coordinate b = 0. The Wald statistic is approximately normal and so it can be used to test whether the coefficient b = 0 in logistic regression. • Since the Wald statistic is approximately normal, by Theorem 1 of Chi-Square Distribution, Wald2 is approximately chi-square, and, in fact, Wald2 ~ χ2(df) where df = k – k0 and k = the number of parameters (i.e. the number of coefficients) in the model (the full model) and k0 = the number of parameters in a reduced model (esp. the baselinemodel which doesn’t use any of the variables, only the intercept). • Property 1: The covariance matrix S for the coefficient matrix B is given by the matrix formula where X is the r × (k+1) design matrix (as described in Definition 3 of Least Squares Method for Multiple Regression) and V = [vij] is the r × r diagonal matrix whose diagonal elements are vii = ni pi (1–pi), where ni = the number of observations in group i and pi = the probability of success predicted by the model for elements in group i. Groups correspond to the rows of matrixX and consist of the various combinations of values of the independent variables. Note that S = (XTW)-1 where W is X with each element in the ith row of X multiplied by vii. Observation : The standard errors of the logistic regression coefficients consist of the square root of the entries on the diagonal of the covariance matrix in Property 1. • Example 1 (Coefficients): We now turn our attention to the coefficient table given in range E18:L20 of Figure 6 of Finding Logistic Regression Coefficients using Solver (repeated in Figure 1 below). Figure 1 – Output from Logistic Regression tool Using Property 1 we calculate the correlation matrix S (range V6:W7) for the coefficient matrix B via the the formula Actually, for computational reasons it is better to use the following equivalent array formula: The formulas used to calculate the values for the Rems coefficient (row 20) are given in Figure 2. Note that Wald represents the Wald2 statistic and that lower and upper represent the 100-α/2 % confidence interval of exp(b). Since 1 = exp(0) is not in the confidence interval (.991743, .993871), the Rem coefficient b is significantly different from 0 and should therefore be retained in the model. Observation: The % Correction statistic (cell N16 of Figure 1) is another way to gauge the fit of the model to the observed data. The statistic says that 76.8% of the observed cases are predicted accurately by the model. This statistic is calculated as follows: For any observed values of the independent variables, when the predicted value of p is greater than or equal to .5 (viewed as predicting success) then the % correct is equal to the value of the observed number of successes divided by the total number of observations (for those values of the independent variables). When p < .5 (viewed as predicting failure) then the % correct is equal to the value of the observed number of successes divided by the total number of observations. These values are weighted by the number of observations of that type and then summed to provide the % correct statistic for all the data. For example, for the case where Rem = 450, p-Pred = .774 (cell J10), which predicts success (i.e. survived). Thus the % Correct for Rem = 450 is 85/108 = 78.7% (cell N10). The weighted sum (found in cell N16) of all these cells is then calculated by the formula =SUMPRODUCT(N6:N15,H6:H15)/H16. TESTING THE FIT OF THE LOGISTIC REGRESSION MODEL • For larger values of b, the standard error and the wald statistic become inflated, which increases the probability that b is viewed as not making a significant contribution to the model even when it does (i.E. A type II error). To overcome this problem it is better to test on the basis of the log-likelihood statistic since where df = k – k0 and where LL1 refers to the full log-likelihood model and LL0 refers to a model with fewer coefficients (especially the model with only the intercept b0 and no other coefficients). This is equivalent to Observation: For ordinary regression the coefficient of determination Thus R2 measures the percentage of variance explained by the regression model. We need a similar statistic for logistic regression. We define the following three pseudo-R2 statistics for logistic regression. Definition 1 : The log-linear ratio R2 is defined as follows : where LL1 refers to the full log-likelihood model and LL0 refers to a model with fewer coefficients (especially the model with only the intercept b0 and no other coefficients). Cox and Snell’s R2 is defined as where n = the sample size. Nagelkerke’s R2 is defined as Observation I : Since cannot achieve a value of 1, Nagelkerke’s R2 was developed to have properties more similar to the R2 statistic used in ordinary regression. Observation II : The initial value L0 of L, i.e. where we only include the intercept value b0, is given by where n0 = number of observations with value 0, n1 = number of observations with value 1 and n = n0 + n1. As described above, the likelihood-ratio test statistic equals: where L1 is the maximized value of the likelihood function for the full model L1, while L0 is the maximized value of the likelihood function for the reduced model. The test statistic has chi square distribution with df = k1 – k0, i.e. the number of parameters in the full model minus the number of parameters in the reduced model. • Example 1 : Determine whether there is a significant difference in survival rate between the different values of rem in Example 1 of Basic Concepts of Logistic Regression. Also calculate the various pseudo-R2 statistics. We are essentially comparing the logistic regression model with coefficient b to that of the model without coefficient b. We begin by calculating the L1 (the full model with b) and L0 (the reduced model without b). Here L1 is found in cell M16 or T6 of Figure 6 of Finding Logistic Coefficients using Solver. We now use the following test : where df = 1. Since p-value = CHITEST(280.246,1) = 6.7E-63 < .05 = α, we conclude that differences in rems yield a significant difference in survival. The pseudo-R2 statistics are as follows: All these values are reported by the Logistic Regression data analysis tool (see range S5:T16 of Figure 6 of Finding Logistic Coefficients using Solver). FINDING LOGISTIC REGRESSION COEFFICIENTS USING NEWTON’S METHOD • Property 1: The maximum of the log-likelihood statistic (from Definition 5 of Basic Concepts of Logistic Regression) occurs when Observation: Thus, to find the values of the coordinates bi we need to solve the equations We can do this iteratively using Newton’s method (see Definition 2 of Newton’s Methodand Property 2 of Newton’s Method) as described in Property 2. • Property 2: Let B = [bj] be the (k+1) × 1 column vector of logistic regression coefficients, let Y = [yi] be the n × 1 column vector of observed outcomes of the dependent variable, let X be the n × (k+1) design matrix (see Definition 3 of Least Squares Method for Multiple Regression), let P = [pi] be the n × 1 column vector of predicted values of success and V = [vi] be the n × n matrix where vi = pi (1 – pi). Then if B0 is an initial guess of B and for all m we define the following iteration then for m sufficiently large B ≈ Bm, and so Bm is a reasonable estimate of the coefficient vector. Observation: If we group the data as we did in Example 1 of Basic Concepts of Logistic Regression (i.e. summary data), then Property 3 holds where holds where Y = [yi] is the r × 1 column vector of summarized observed outcomes of the dependent variable, X is the corresponding r × (k+1) design matrix, P = [pi] is the r × 1 column vector of predicted values of success and V = [vi] is the r × r matrix where vi = ni pi (1 – pi). Example 1 (using Newton’s Method) : We now return to the problem of finding the coefficients a and b for Example 1 of Basic Concepts of Logistic Regression using the Newton’s Method. We apply Newton’s method to find the coefficients as described in Figure 1. The method converges in only 4 iterations with the values a = 4.47665 and b = -0.0072. The regression equation is therefore logit(p) = 4.47665 – 0.0072x. Example 2: A study was made as to whether environmental temperature or immersion in water of the hatching egg had an effect on the gender of a particular type of small reptile. The table in Figure 2 shows the temperature (in degrees Celsius) and immersion in water (0 = no and 1 = yes) of the 49 eggs which resulted in a live birth as well as the sex of the reptile that hatched. Determine the odds that a female will be born if the temperature is 23 degrees with the egg immersed in water vs. not immersed in water. We use the Logistic Regression supplemental data analysis tool, selecting the Raw data and Newton Method options as shown in Figure 3. After pressing the OK button we obtain the output displayed in Figure 4. Here we only show the first 19 elements in the sample, although the full sample is contained in range A4:C52. Note that in the raw data option the Input Range (range A4:C52) consists of one column for each independent variable (Temp and Water for this example) and a final column only containing the values 0 or 1, where 1 indicates “success” (Male in this case) and 0 indicates “failure” (Female in this case). Please don’t read any gender discrimination into these choices: we would get the same result if we chose Female to be success and Male to be failure. The model indicates that to predict the probability that a reptile will be male you can use the following formula: We can now obtain the desired results as shown in Figure 5 by copying any formula for p -Pred from Figure 4 and making a minor modification. Here we copied the formula from cell K6 into cells G29 and G30. The formula that now appears in cell G29 will be =1/(1+EXP(-$R$7MMULT(A29:B29,$R$8:$R$9))). You just need to change the part A29:B29 to E29:F29 (where the values of Temp and Water actually appear). The resulting formula 1/(1+EXP(-$R$7-MMULT(E29:F29,$R$8:$R$9))) will give the result shown in Figure 5. COMPARING LOGISTIC REGRESSION MODELS • Example 1: Repeat the study from Example 3 of Finding Logistic Regression Coefficients using Newton’s Method based on the summary data shown in Figure 1. • Using the Logistic Regression supplemental data analysis tool, selecting the Newton Method option, we obtain the output displayed in Figure 2. • Example 2: Do the Temp and Water variables make a significant difference in the model of Example 1? We first create summary tables for the Temp-only and Water-only models and then use the Logistic Regression data analysis tool (with Newton option) to build the two models. Also see below for a simpler approach for creating the Temp-only summary table. The summary table for the Temp model is shown in range B28:D34 of Figure 3 The values of the C and D columns can be calculated from the summary table of the base model (as shown in Figure 2) using SUMIF. For example, the number of samples where Temp = 20 and the reptile was born Male (cell C29) is given by the formula =SUMIF($A$4:$A$15,$B29,C$4:C$15) By filling right (Ctrl-R) and down (Ctrl-D), you can copy this formula into the other cells in the range C29:D34. You now use the Logistic Regression tool to obtain the output shown in Figure 3. We observe that the Temp variable makes a significant contribution (cell U35) over the constant-only model. Here we are comparing (Temp model) with (constant-only model). We can also compare the Temp model with the base model (Temp + Water), by copying the range T28:U35 to another location in the worksheet and using the value from the base model and substituting the value from the Temp model for . Also we need to change to 1 since the difference between the of the two models is 2 – 1 = 1. This is shown in Figure 4. We see that there is not a significant difference between the models (cell X44). This confirms the conclusion that we reached previously that the Water variable is not making a significant contribution, and in fact it can be dropped. We create the Water-only model in a similar way to obtain the output shown in Figure 5. This time we see that there is no significant difference between the Water model and the constant model. If we repeat the analysis of Figure 4, we would see that there is a significant difference between the Water model and the base model. Finally, we can look at further refinements of the model, such as the full interaction model, where we include the interaction between Temp and Water. We show this analysis in Figure 6. If we compare this model with the base model using the approach described above (as in Figure 4), we get the output shown in Figure 7. This shows that there is a significant difference between the full interaction model and the base model, with the interaction model providing a better fit. Observation : As mentioned above, there is a simpler way to create the Temp-only and Water-only summary data tables. To create the Temp-only table, enter Ctrl-m and select the Logistic Regression data analysis tool and then enter the following information into the dialog box that appears Here we have entered the Water independent variable into the List of variables to exclude field. This produces the output in Figure 3. Observation : The List of variables to exclude field can be used whenever the Input Format is set to Summary data and the Headings included with data field is checked in order to create a reduced model. The list of variables to exclude are entered into this field separated by commas. E.g. if we have a summary data table with Nationality, Age, Education, Gender and Occupation as independent variables and want to create a reduced model with only Nationality, Education and Occupation, we would simply enter Age, Gender into theList of variables to exclude field. HOSMER-LEMESHOW TEST • The Hosmer-Lemeshow test is used to determine the goodness of fit of the logistic regression model. Essentially it is a chi-square goodness of fit test (as described in Goodness of Fit) for grouped data, usually where the data is divided into 10 equal subgroups. The version of the test we present here uses the groupings that we have used elsewhere and not subgroups of size ten. • Since this is a chi-square goodness of fit test, we need to calculate the HL statistic where g = the number of groups. The test used is chi-square with g – 2 degrees of freedom. A significant test indicates that the model is not a good fit and a non-significant test indicates a good fit. • Example 1: Use the Hosmer-Lemeshow test to determine whether the logistic regression model is a good fit for the data in Example 1 in Comparing Logistic Regression Models. In our example the sum is taken over the 12 Male groups and the 12 Female groups. The observed values are given in columns H and I (duplicates of the input data columns C and D), while the expected values are given in columns L and M. E.g. cell L4 contains the formula =K4*J4 and cell M4 contains the formula =J4-L4 or equivalently =(1-K4)*J4. The HL statistic is calculated in cell N16 via the formula =SUM(N4:N15). E.g. cell N4 contains the formula =(H4-L4)^2/L4+(I4-M4)^2/M4. The Hosmer-Lemeshow test results are shown in range Q12:Q16. The HL stat is 24.40567 (as calculated in cell N16), df = g – 2 = 12 – 2 = 10 and p-value = CHIDIST(24.40567, 10) = .006593 < .05 = α, and so the test is significant, which indicates that the model is not a good fit. Observation : The Hosmer-Lemeshow test needs to be used with caution. It tends to be highly dependent on the groupings chosen, i.e. one selection of groups can give a negative result while another will give a positive result. Also when there are too few groups (5 or less) then usually the test will show a model fit. As a chi-square goodness of fit test, the expected values used should generally be at least 5. In Example 1 the cells L9, L15, M4 and M10 all have values less than 5, with cells M4 and M10 especially troubling with values less than 1. We now address the problems of cells M4 and M10. We can eliminate the first of these by combining the first two rows, as shown in Figure 2. Here p-Pred for the first row (cell K23) is calculated as a weighted average of the first two values from Figure 1 using the formula =(J4*K4+J5*K5)/(J4+J5). In a similar manner we combine the 7th and 8th rows from Figure 20.23. The revised version shows a non-significant result, indicating that the model is a good fit. Observation : The Real Statistics Logistic Regression data analysis tool automatically performs the HosmerLemeshow test. For Example 1 of Finding Logistic Regression Coefficients using Solver, we can see from Figure 5 of Finding Logistic Regression Coefficients using Solver that the logistic regression model is a good fit. For Example 1, Figure 2 of Comparing Logistic Regression Models shows that the model is not a good fit, at least until we combine rows as we did above. END