High-dimensional propensity score versus lasso outcome

Comparing high-dimensional
propensity score versus lasso variable
selection for confounding adjustment in
a novel simulation framework
Jessica Franklin
Instructor in Medicine
Division of Pharmacoepidemiology & Pharmacoeconomics
Brigham and Women’s Hospital and Harvard Medical School
QMC, Department of Quantitative Health Sciences
University of Massachusetts Medical School
April 15, 2014
• Administrative healthcare claims data are a popular
data source for nonrandomized studies of
• Because treatments are not randomized, addressing
confounding is the primary methodological
Claims Data
• Comprehensive claims databases contain
information on patient insurance enrollment and
demographics, as well as every healthcare encounter,
Medications dispensed
• Dates of encounters provide a complete longitudinal
record of patients’ healthcare interactions.
New user design
• Potential confounders are measured prior to
initiation of exposure.
• Active treatment comparator group reduces biases
associated with non-user comparators.
Principles of variable selection
• Brookhart et al. (2006) showed that the best PS
model is the model that includes all predictors of
outcome (regardless of whether they are associated
with exposure).
• Pearl (2010) and Myers et al. (2011) further noted
that including instrumental varaibles (IVs) can
increase bias from unmeasured confounding.
• IVs are associated with exposure, but not associated
with outcome except through exposure.
hd-PS variable selection
• The high-dimensional propensity score (hd-PS)
algorithm screens thousands of diagnoses,
medications, and procedure codes and ranks
variables according to likelihood of confounding.
• Relies on the idea that a large number of “proxy”
variables can reduce bias from unmeasured
• Empirical evidence has shown a reduction in bias.
Shrinkage methods
• Greenland (2008) suggested regularization methods
as preferable to variable selection.
• Shrinking coefficients allows for efficient estimation,
even in models with many degrees of freedom.
• Lasso regression provides both shrinkage and
principled variable selection.
• Shrinkage allows for direct modeling of the outcome
even with many potential confounders
• Some coefficients are shrunk all the way to 0.
• To compare the performance of
• hd-PS variable selection
• Ridge regression of the outcome on all potential
• Lasso regression of the outcome on all potential
• The goal is maximum reduction in confounding
Comparing high-dimensional
• How can we answer this question?
• Empirical studies are useful when we “know” the
true treatment effect, but even then we can’t
determine the contributions of bias and variance to
overall error.
• Ordinary simulation techniques with completely
synthetic data cannot capture the complex
correlation structure among covariates in claims
Plasmode simulation
• We start with a real empirical cohort study:
49,653 patients
Exposed to either ns-NSAIDs or Cox-2 inhibitors (X)
Followed for gastrointestinal events (Y)
Pre-defined covariates include age, sex, race, and 16
diagnosis/medication/procedure variables (C1)
• To get reasonable values for associations between covariates
and outcome, we estimated a model with:
• Y ~ X + all pre-defined covariates + interactions between age
and binary covariates
logit{Pr(Y =1)} = fˆ (C1 | bˆ )+ aˆ X
Simulation setup
• True outcome generation model:
logit{Pr(Y =1)} = fˆ (C1 | bˆ )+ 0X
• Estimated coefficient values from the observed outcome model
• Except for the coefficient on exposure: a = 0.
• To create simulated datasets:
• Sample with replacement rows from (X, C)
• Calculate pi = expit{ fˆ (C1i | bˆ )} for each patient in the sample.
• Simulate outcome Yi* ~ Bernoulli(pi )
• We created 500 datasets, each of size 30,000, outcome
prevalence set to 5%, exposure prevalence set to 40%.
True causal diagram
Any variables
associated with
exposure remain
associated with
C1 = True confounders, a subset
of C = all measured covariates.
Any correlations
among covariates
and true
remain intact.
Associations with outcome are
determined by chosen simulation
Outcome generation
True OR
Black race
Male gender
Congestive heart failure
Coronary disease
Prior bleeding
Prior ulcer
Recent hospitalization
Recent nursing home admission
Gastrointestinal drugs
The mechanics of hd-PS
• For each diagnosis, procedure, medication code, hdPS creates 3 potential variables:
• Code observed ≥ 1 time during baseline period
• Code observed ≥ median number of times
• Code observed ≥ 75th percentile number of times
• There are 2 potential ranking methods:
• Exposure-based: A simple RR association measure
between exposure and each variable.
• Bias-based: Bross’s bias formula that considers the
association of each varaible with exposure and
hd-PS Analyses
• PSs were constructed using:
The top 500 exposure-ranked variables + demographics
The top 500 bias-ranked variables + demographics
The top 30 exposure-ranked variables + demographics
The top 30 bias-ranked variables + demographics
• Logistic regression on exposure + deciles of each PS
Shrinkage analyses
• Regression of the outcome on all hdPS-screened
variables (4800 – those that never occur) + exposure
+ demographics
• Ridge regression
• Lasso regression
• We apply no shrinkage to the coefficient on
• Calculate the crude estimate for comparison
Combination approaches
• Using the variables selected by the lasso regression:
• Include them in a PS analysis
• Include them in an ordinary logistic regression
outcome model
• Using the 500 variables chosen by bias-based hd-PS:
• Include them in an ordinary logistic regression
outcome model
• Include them in a lasso outcome model
• Include them in a ridge outcome model
Results – Variable selection
• Lasso selected 103 variables on average.
• 66% were also
selected by at least
one hdPS algorithm
• IQR: 62-70%
Age was selected in
100% of simulations.
• Race was selected in
- Bias
- Bias
Crude confounding
bias of 0.19.
- Bias
Ridge and lasso
regression with all
variables reduces
bias by 41% and
63%, respectively.
- Bias
Ridge and lasso do
better when they
start with prescreened variables.
Bias is reduced by
70% and 83%,
- Bias
Ordinary regression
and PS approaches
performed better.
hdPS with 500
variables completely
eliminated bias.
- Bias
Bias-based hdPS
varaible selection
also performed
well, with 93% and
91% bias reduction
in the PS and
ordinary regression
- Bias
PS and regular
regression models
performed well
using lasso variable
selection as well
(95% and 96% bias
- Bias
When restricting
variables to a very
small set, bias-based
hdPS was much
• The variable selection method had relatively little
• The estimation method mattered much more.
• Shrinkage of coefficient estimates led to insufficient
bias control.
• Focus on including a large number of potential
confounders or confounder proxies.
• There are many “instruments” in current simulation
• Variables associated with exposure that are not
included in the outcome simulation model are
essentially IVs, which is unrealistic.
• There is no unmeasured confounding in these data.
• Variable selection is an easier task when all important
confounders are measured.
Future work
Enrich the outcome model
• Non-linear associations, more interactions, more true confounders
Vary the true treatment effect
• Modify the coefficient on treatment in the outcome generation model.
Vary exposure prevalence
• Can be accomplished by sampling within exposure group.
Vary outcome prevalence
• Modify the intercept in the outcome generation model.
Unmeasured confounding
• Set aside one or more true confounders and don’t allow methods to utilize these
Other base datasets
• Co-authors:
Wesley Eddings
Jeremy A Rassen
Robert J Glynn
Sebastian Schneeweiss
• Contact:
• [email protected]
• www.drugepi.org/faculty-staff-trainees/faculty/jessicafranklin/

similar documents