Exact Logistic Regression

Report
Exact Logistic Regression
Larry Cook
Outline
• Review the logistic regression model
• Explore an example where model
assumptions fail
– Brief algebraic interlude
• Explore an example with a different issue
where logistic regression fails
• Computational considerations
• Example SAS code
Logistic Regression
• Model a binary outcome, Y, with one or
more predictors
– Success/failure
– Disease/not disease
• Model outcome in terms of the log odds
of a success
• log(odds of Yi) = a + bxi + e
Why Log Odds?
• Canonical link function
• Makes a binary outcome continuous
• Solves this problem
– Probability is constrained to [0,1]
– Odds are constrained to [0, ∞)
• Log odds are in (-∞, ∞)
• Exponentiating coefficients gives us
estimates of odds ratios
Example: Motor Vehicle
Crash Fatalities
• What are odds of being hospitalized or
killed in a motor vehicle crash for drivers
using safety restraints vs. those that are
not?
– Outcome: Hospitalized/killed or not
– Covariate: safety belt use
Hospital/Killed * Restraint Use
OR = 0.22, p-value < 0.001
Example: Motor Vehicle
Crash Fatalities
• What are odds of being hospitalized or
killed in a motor vehicle crash for drivers
using safety restraints vs. those that are
not?
– Outcome: Hospitalized/killed or not
– Covariate: safety belt use
gender, age, alcohol, rural area
Logistic Regression Output
Parameter
Intercept
Male
Restraint
Use
Alcohol
Night
Rural
Estimate
-0.261
Odds Ratio
P-value
< 0.001
-0.576
-1.430
0.56
0.24
< 0.001
< 0.001
1.065
0.194
0.135
2.90
1.21
1.14
< 0.001
0.011
<0.001
Assumptions
• Conditional probabilities follow a logistic
function of the independent variables
• Observations are independent
• Asymptotics
– Sample size is large enough
– Minimum of 50 to 100 observations
– 10 successes/failures per variable
Corneal Graft Rejections
• What if studying a rare disease?
• Data for eight kids in young age group
and eight in the older age group
• Hypothesis is that rejection is more likely
in older children
Graft Rejections
Young (< 4 y.o.)
(X = 0)
Older (> 4 y.o.)
(X = 1)
Total
No Rejection
(Y = 0)
7
2
9
Rejection
(Y = 1)
1
6
7
Total
8
8
16
OR = 21, p-value = 0.012,
100% of cell have expected counts < 5!!!
Fisher’s Exact Test p-value (2-sided) = 0.0406; (1-sided) = 0.0203
Let’s Tackle the Graft
Rejection Example as
Logistic Regression
Graft Rejections
Young (< 4 y.o.)
Older (> 4 y.o.)
No Rejection
7
2
9
Rejection
1
6
7
Total
8
8
16
Sample Size << 50!
Don’t have 10 success or 10 failures!
Total
Exact (Conditional)
Logistic Regression
• Rather than using the unconditional
logistic regression, we will condition on
nuisance parameters
• Use conditional maximum likelihood for
estimation and inference
Warning Algebra Ahead
Proceed with Caution
Logistic Model
Likelihood of a Sample
Sufficient Statistics
Conditioning
• If we are only trying to describe the
relationship between rejection and age, do
we care about the value of the intercept?
• Remove the intercept, a, out of the
likelihood by conditioning on its sufficient
statistic, t0 = Syi.
• Let S(to) = Set of all tables with Syi = t0 and
observed sample sizes
Conditional Likelihood
Estimation
Inference
End of Algebra
Back to Example
Graft Rejections
Young (< 4 y.o.)
(X = 0)
Older (> 4 y.o.)
(X = 1)
Total
No Rejection
(Y = 0)
7
2
9
Rejection
(Y = 1)
1
6
7
Total
8
8
16
Sufficient Statistics
t0 = Syi = # of rejections = 7
t1 = Sxiyi = 0*# of rejections in young + 1*# of rejections in old
= 0*1 + 1*6 = 6
Conditional Distribution
for Graft Rejection
• Need to calculate all possible tables that
have exactly 7 rejections
• Calculate how often each of the tables
occur
• Calculate CMLE
• Calculate how rare our table is to obtain
p-value
Reference Set
Yng_NR
Yng_R
Old_NR
Old_R
t0
t1
1
7
8
0
7
0
8
0.0007
2
6
7
1
7
1
224
0.0196
3
5
6
2
7
2
1,568
0.1371
4
4
5
3
7
3
3,920
0.3427
5
3
4
4
7
4
3,920
0.3427
6
2
3
5
7
5
1,568
0.1371
7
1
2
6
7
6
224
0.0196
8
0
1
7
7
7
8
0.007
11,440
1.000
7
Count
P[Table]
Estimate b and Find a p-value
t1
Count
P[Table]
0
8
0.0007
1
224
0.0196
2
1,568
0.1371
3
3,920
0.3427
4
3,920
0.3427
5
1,568
0.1371
6
224
0.0196
7
8
0.0007
Estimate and p-value
t1
Count
P[Table]
0
8
0.0007
1
224
0.0196
2
1,568
0.1371
3
3,920
0.3427
4
3,920
0.3427
5
1,568
0.1371
6
224
0.0196
7
8
0.0007
Confidence Interval
• Lower Bound, b• If t1 = t1,min
• Upper Bound, b+
• If t1 = t1,max
 b- = -∞
 b+ = ∞
• Otherwise
• Otherwise
 b- is the value of b
that produces an
upper p-value of a/2
 b+ is the value of b
that produces a lower
p-value of a/2
Final Stats for Graft Rejection
Example 2
PECARN C-Spine Study
Case Control Study
Control
Case
Total
Not Present
1,057
540
1,0597
Present
2
0
2
Any problems estimating the odds ratio?
Could exact logistic regression help?
Total
1,059
540
1,599
What sufficient statistics
are needed?
Not Present
(X = 0)
Present
(X = 1)
Total
Control
(Y = 0)
1,057
2
1,059
Case
(Y = 1)
540
0
540
1,597
2
1,599
Total
• Sy = 2
• Sxy = 0
Conditional Density
Case P Case NP
Ctrl P
Ctrl NP
t0
t1
Count
P[Table]
0
540
2
1,057
2
0
560,211
0.438
1
539
1
1,058
2
1
571,860
0.448
2
538
0
1,059
2
2
145,530
0.114
1,277,601
1.000
2
One-sided p-value = 0.438
Two-sided p-value = 2*0.438 = 0.876
95% confidence interval (-∞, 2.345)
Point estimate?
Median Unbiased Estimate
One More Example
Dose Response
Toxicology Experiment
• 400 mice randomized to one of four levels of a drug
• Drug administered to each animal
• Outcome is the number of deaths in each dose
level
0
1
2
3
Total
Lived
99
97
95
90
381
Died
1
3
5
10
19
Total
100
100
100
100
400
Sy = 19
Sxy = 3 + 10 + 30 = 43
Exact vs. Unconditional
•
•
•
•
•
Exact
Estimate = 0.710
SE = 0.246
OR = 2.03
CI = (1.26, 3.52)
p-value = 0.002
•
•
•
•
•
Unconditional
Estimate = 0.712
SE = 0.246
OR = 2.04
CI = (1.26, 3.30)
p-value = 0.004
Computational Issues
Counting All the Tables
• One of the main hurdles for conditional
logistic regression is counting all the tables
in the sample space
– Graft rejections – 11,440 possibilities
– PECARN C-Spine - 1,277,601
– Toxicology – 2.79 x 1033
• Obviously don’t want to generate tables
one at a time
Network Algorithm
• Graphical representation of the sample
space
• Nodes represent a partial sum of the
sufficient statistic
• Arcs have combinatorial weighting value
• One path through the graph represents a
table in the sample space
Example
X=1
X=2
X=3
X=4
Y=0
3
2
2
1
8
Y=1
0
1
1
2
4
Total
3
3
3
3
12
Sufficient Statistics
t0 = Syi = 4
t1 = Sxiyi = 1*0 + 2*1 + 3*1 + 4*2 = 13
Total
(0,0)
(1,0)
(2,0)
(1,1)
(2,1)
(3,1)
(1,2)
(2,2)
(3,2)
(1,3)
(2,3)
(3,3)
(2,4)
(3,4)
(4,4)
X=1
X=2
X=3
X=4
Total
Y=0
1
3
1
3
8
Y=1
2
0
2
0
4
(0,0)
(1,0)
(2,0)
(1,1)
(2,1)
(3,1)
(1,2)
(2,2)
(3,2)
(1,3)
(2,3)
(3,3)
(2,4)
(3,4)
(4,4)
X=1
X=2
X=3
X=4
Total
Y=0
3
2
2
1
8
Y=1
0
1
1
2
4
Network Representation
of the Sample Space
(0,0)
(1,0)
(2,0)
(1,1)
(2,1)
(3,1)
(1,2)
(2,2)
(3,2)
(1,3)
(2,3)
(3,3)
(2,4)
(3,4)
(4,4)
What About
Multiple Covariates?
More Conditioning!
Osteogtenic Sarcoma
LogXact Manual
• 46 patients surgically treated for
osteogenic sarcoma and then observed
for disease recurrence within 3 years
• Covariates
– Sex: Male = 1, Female = 0
– Any Ostoid Pathology (AOP)
• Present = 1, not = 0
• Interested in the effect of AOP
Osteogtenic Sarcoma
Covariate
Group
No
Recurrence
(y = 0)
Recurrence
(y = 1)
Group
Size
(ni)
1
8
2
Covariates
Sex (x1)
AOP (x2)
0
8
0
0
5
2
7
0
1
3
9
4
13
1
0
4
7
11
18
1
1
Total
29
17
46
Estimating the Effect of AOP
• New statistics to condition
– Group sizes
– Sufficient statistic for intercept, Sy = 17
– Sufficient statistic for coefficient for sex, Sx1y = 15
• Calculate the conditional distribution of Sx2y
– Sufficient statistic for coefficient for AOP
– Number of cases with AOP in recurrence (=13)
– Given exactly 17 with recurrence
15 of which are males
Network Algorithm
• The Network Algorithm using two passes
– First pass conditions on the intercept
• All tables with exactly 17 cases in recurrence
– Second pass removes arcs that don’t
produce sufficient statistic for sex
• All tables that don’t have 15 males in recurrence
• Proceed with estimation & inference as
before
P[Sx2y = t2 |17 in recurrence
and 15 males ]
Results
LR Test for Both Variables
• To test both sex and AOP are zero
simultaneously, need the joint conditional
density
– All possible combinations of males and
patients with AOP in recurrence given
exactly 17 patients in recurrence
– Determine how rare is it to have 15 recurrent
males AND 13 recurrent AOP patients?
SAS Examples
Conclusion
• Exact (conditional) logistic regression
– Useful method when asymptotic assumptions
are not met or with separation
– Utilizes conditioning to remove nuisance
parameters from the likelihood
– Very computational intensive method
– Network algorithm speeds up calculations
Questions?

similar documents