Report

PROPENSITY SCORES MAKING SENSE OF NON-RANDOMIZED OBSERVATIONAL DATA Atul Sharma MD, MSc, FRCPC(ret) Biostatistical Consulting Unit April 2014 Propensity score 1996 - 2013 RCT – the gold standard R.A. Fisher: The Design of Experiments, 1925 • Introduced the concept of randomization Neyman-Ruben Counterfactual Framework • Ruben and Rosenbaum, 1983 • Formalizes conditions for valid RCT Goal: estimate treatment effect in ith individual ti = Yi1 – Yi0, where • Yi1 = outcome of ith patient with treatment, Yi0 = outcome with control • Usually, we observe only one outcome i.e. “counterfactual” is hypothetical • This is the “fundamental problem of causal inference “ -> randomization Neyman-Ruben Theory Counterfactual Framework • With randomization, average treatment effect estimated by difference in means or proportions between treatment & control groups i.e. • Randomization ensures groups balanced for all important covariates -> unbiased estimate of average treatment effect. Treatment assignment is ‘strongly ignorable’ • 1:1 randomization allows direct comparison of groups with simple statistics e.g t-tests for means; RR, OR, NNT for proportions • With more complex study designs e.g. 1:k randomization, weighted averages still permit unbiased estimates of average treatment effect • All such estimates rely implicitly on knowing the probability of assignment to treatment group = propensity score (e.g. ½ ) Neyman-Ruben Theory Counterfactual Framework Sometimes randomization fails: • Selection biases arise when there is an imbalance between groups for observed or unobserved covariate, and probability of treatment now depends on covariates e.g. older or sicker patients more likely to receive treatment • Confounding occurs there is an imbalance between treatment groups on observed or unobserved covariates that influence outcome e.g. older or sicker patients more likely to receive treatment and more likely to die • Potential confounders: Covariates correlated with treatment and outcomes Neyman-Ruben Theory Counterfactual Framework • Imbalance may be due to failure of randomization, non-compliance, unblinded treatment assignment. • Imbalance also arises as a result of non-randomized treatment assignment chosen by patient/physician or institutional/social policies i.e. • ‘natural experiments’ or ‘quasi-experimental designs’ • A non-randomized study can yield unbiased estimates of treatment effects Conventional methods for adjusting for imbalance: • Stratification (limited to a few key covariates e.g. age, gender). For example, matching on 10 binary factors e.g disease history -> 1024 strata • Matching case-control (also limited to a few key covariates). Given covariates, treatment assignment is ‘strongly ignorable’ • Regression adjustment : Models effects of confounders on outcome directly • Can adjust for multiple confounders, but effectiveness depends on correctly specifying the model e.g. non-linear terms, interactions, etc. Misspecification may actually worsen bias • Overfitting may improve fit with current data (‘internal validity’), but limit generalizability to larger population (‘external validity’). Principle of parsimony limits model complexity • Almost impossible to assess effectiveness of ‘bias control’ (unlike PS) Balancing Property of the Propensity Score: • PS is probability of assignment to treatment given confounders, summarizes all observed confounders in a single number (scalar) • The ‘balancing property’ tells us that conditioning on the PS by stratifying, matching or regression adjustment will improve balance between groups • Alternatively, • The ‘fundamental theorem’: as sample size increases, treatment is assignment independent of observed covariates after conditioning on propensity score (Ruben and Rosenbaum, 1983) • Given PS, treatment assignment is said to be ‘strongly ignorable’ Propensity scores in the real world: • Advantages: • PS estimated by familiar methods (e.g. logistic regression) • Balances multiple covariates simultaneously • Model of treatment assignment applies only to current data, no external validity • Similarly, less need for parsimony. Over-fitting is not a problem if balancing works • After conditioning on PS, balance can be assessed easily and directly • Caveats: • Unlike randomization, no assurance that unobserved covariates will balance • You can only balance observed/ measured covariates • Works best with large N, as imbalance may be unavoidable with small N • Take care to avoid covariates that are really treatment outcomes (mediators). Adjusting for covariates that are effected by treatment may worsen bias A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 Objective Examine impact of RHC use during 1st 24h of ICU care & outcome i.e. survival, length of stay, intensity of care, cost of care Design Prospective cohort study Setting 5 US teaching hospitals, 1989 – 1994. Subjects 5735 critically ill adult patients in ICU for 1 of 9 prespecified disease categories: acute respiratory failure, COPD, CHF, cirrhosis, coma, metastatic colon cancer, non-small cell lung cancer, multi-organ failure with malignancy or sepsis Treatment RHC at discretion of physician, possibly confounded with patient factors related to the outcome e.g. BP Outcome(s) Survival, costs, intensity of care, hospital stay Dataset and codebook available at http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 • RCT is often impractical for ethical, logistic, financial reasons e.g. “ the popularity of this procedure and the widespread belief that it is beneficial makes the performance of an RCT difficult” • Treatment selection at discretion of MD, likely to be confounded with patient characteristics related to the outcome e.g. patients with low BP more likely to receive RHC and more likely to die • 7 critical care specialists identified 50 variable likely to influence treatment Dependent variable is RHC in first 24h (2184 cases vs 3551 controls) Independent variables are age, sex, race, education, income, medical insurance, 9 primary disease categories, 10 admission diagnoses, 12 comorbid conditions, pre-admission ADL and DASI, day 1 DNR, cancer, 2 mo. survival probability, APACHE III APS score, Glasgow Coma Score, weight, temperature, BP, respiratory rate, heart rate, PaO2/FiO2, PaCO2, pH, WBC, hematocrit, sodium, potassium, creatinine, bilirubin, albumin, urine output Great care taken to identify all confounders that might bias outcomes through literature review, expert panels A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 Pre-treatment balance: 8 select covariates Covariate Age (yrs) Gender % male APACHE III APS Weight (kg) Mean BP mmHg Respiratory Rate min-1 WBC Creatinine RHC+ 60.8 58.5 60.7 72.4 68.2 26.7 16.3 218 RHC61.8 53.9 50.9 65.0 84.9 29.0 15.3 169 p-value = 0.026 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 = 0.002 < 0.001 How do we compare groups that are fundamentally different? A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 100 Kaplan Meier Survival, death by day 30 swang1 80 70 60 % Surviving 90 No RHC RHC 0 5 10 15 20 25 Survival Time in Days RHC RHC No RHC N 2184 3551 Events 830 1088 % death 38.0 30.6 Log-rank p< 0.0001 30 A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 Cox proportional hazards regression • lt = hazard function = probability of death at time t given survival until t • log(lt) = l0 + b1 x RHC • e b1 = hazard ratio for RHC Unadjusted Cox PH model: • coxph(survival~swang1, data=rhc)) Cox Model Hazard Ratio 95% CI Log-rank Unadjusted 1.30 1.19-1.43 p<0.0001 Given selection biases, patients with RHC sicker (lower BP, higher creatinine, higher APACHE scores), HR likely overestimate of true risk of treatment 0.8 0.6 0.4 0.2 0.0 p = Prob{ RHC } 1.0 Estimating propensity scores • Logistic regression for binary outcome e.g. RCH vs No RHC (swang1) • LHS p = probability of treatment assignment = propensity score • RHS covariates may be numeric or categorical (0/1 dummy variables) −6 −4 −2 0 2 4 linear predictor • b’s or OR = eb measure impact of each predictor on probability of RHC • For each subject, fitted value of p or linear predictor may be used as PS Estimating propensity scores Logistic regression: RHC+/RHC- adjusted for 50 risk factors model1 = glm(swang1~age + sex + race + edu + income + ninsclas + cat1 + das2d3pc + dnr1 + ca + surv2md1 + aps1 + scoma1 + wtkilo1 + temp1 + meanbp1 + resp1 + hrt1 + pafi1 + paco21 + ph1 +wblc1 + hema1 + sod1 + pot1 + crea1 + bili1 + alb1 + resp + card + neuro + gastr + renal + meta + hema + seps + trauma + ortho + cardiohx + chfhx + dementhx + psychhx + chrpulhx + renalhx + liverhx + gibledhx + malighx + immunhx + transhx + amihx, data=rhc, family=binomial) prop_score = fitted.values(model1) • 30 of 50 were significant independent predictors of treatment (p < 0.05) • OR = eb measures impact of predictors on probability of RHC Covariates Predictor CHF Day 1 DNR Status 10 mmHg drop BP 0.1U drop in arterial pH 10 meq/l drop sodium 100mmol/l increase in creatinine OR p value 2.19 <0.0001 0.51 1.07 <0.0001 <0.0001 1.18 1.13 <0.0001 0.016 1.06 0.014 • in general, sicker patients more likely to undergo RHC Estimating propensity scores Does LR model for PS adequately discriminate treatment groups? Check ROC Sensitivity 0.2 0.6 1.0 • Data: 3551 controls < 2184 cases • Area under the curve: 0.80 • 95% CI: 0.79-0.81 −0.2 Optimal Model Threshold Specificity Sensitivity 0.41 0.73 0.71 1.0 0.6 0.2 Specificity If discrimination inadequate, refine model: • Add covariates, nonlinear terms, interactions • Alternate PS estimation procedure • Probit regression • Discriminant analysis Estimating propensity scores: • Compare frequency distributions • PS should discriminate, but overlap is essential! Frequency Histogram No RHC Propensity 11 33 109 180 287 360 467 558 839 900 600 300 N 67 0.95 0.85 0.75 0.65 0.55 0.45 0.35 0.25 0.15 0.05 707 0 RHC 225 305 344 322 308 243 189 139 42 0 300 N 600 900 Using Propensity Scores to balance covariates in observational data: 1. Stratify by quintile (Ruben and Rosenbaum, 1983): • Each quintile analyzed separately as if from RCT (~meta-analysis) • With perfect stratification, average treatment effect across strata is unbiased estimate of true treatment effect • 5 groups eliminate 90% of bias from observed confounders 2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile • Continuous numeric outcomes: ordinary linear regression • Binary categorical outcomes: logistic regression • Survival analysis: Cox proportional hazards regression 3. Match cases and controls by PS • Compare groups as if from RCT • Almost always reduces sample size (unlike 1 ,2) Stratify by Propensity Score: • As few as 5 strata can eliminate 90% bias (Ruben and Rosenbaum 1983) centiles= quantile(prop_score, probs=(0,0.2,0.4,0.6,0.8,1)) Probability 0 0.2 0.4 0.6 0.8 1 Percentile 0.0027 0.1299 0.2734 0.4353 0.6235 0.9781 ps_quintiles = cut(prop_score, centiles, include.lowest=TRUE); Quintile 1 2 3 4 5 Sample N 1147 1147 1147 1147 1147 No RHC 1073 908 737 551 282 RHC 74 239 410 596 865 Balancing property of propensity score: • Like randomization, treated and control patients in each PS quintile have similar distribution of measured covariates • Unlike randomization, does not mean they will be similar for unobserved covariates • Does not imply that pairs matched by PS will be similar on covariates Assess balance by quintile: 250 Creatinine 0 50 100 150 200 No RHC RHC Q1 Q2 Q3 Q4 Q5 ps_quintiles Quintile Mean (mmol/l) 1 129.4 2 157.3 3 182.1 4 211.3 5 258.5 Assess balance by quintile: 100 Mean BP 0 20 40 60 80 No RHC RHC Q1 Q2 Q3 Q4 Q5 ps_quintiles Quintile Mean (mmHg) 1 102.8 2 87.4 3 4 77.2 67.7 5 57.4 Assess balance by quintile: PO2/FiO2 0 50 100 200 No RHC RHC Q1 Q2 Q3 Q4 Q5 ps_quintiles Quintile Mean 1 287.7 2 246.7 3 4 220.2 203.0 5 153.8 Are covariates related to RHC after PS adjustment? • Covariate balance after stratification must be assessed formally • Simple t-tests affected by sample-size differences (N=5735 vs 1147) • Differences between RHC+/RHC- should be tested using methods unaffected by sample size • Recommended : Regress each covariate on RHC ± PS quintile • covariate ~ treatment ± ps_quintile • If covariate unrelated to RHC after adjustment for PS, strata balanced Are covariates related to RHC after PS adjustment? Linear regression: age vs RHC status ± PS quintile (categorical, 5 levels): • • lm(age~swang1, data=rhc) lm(age~swang1 + ps_quintiles, data=rhc) Logistic regression: sex vs RHC status ± PS quintile: • • glm(sex~swang1, family = binomial, data=rhc) glm(sex~swang1 + ps_quintiles, family = binomial, data=rhc) PS-adjusted p-values Covariate Before Age (yrs) 0.026 Gender % male <0.001 APACHE III APS <0.001 Weight (kg) <0.001 Mean BP mmHg <0.001 Respiratory Rate min-1 <0.001 WBC 0.002 Creatinine <0.001 After 0.947 0.730 0.100 0.530 0.260 0.529 0.610 0.470 After adjusting for PS quintile, cases and controls are well balanced. • May need to refine treatment model by adding covariates, non-linear terms, interactions - iteratively until balanced Stratify by Quintile : Analyze each separately • Unadjusted Cox PH model applied to each stratum Quintile 1 2 3 4 5 Hazard Ratio 1.21 1.30 1.24 1.21 1.22 Sometimes meaningful to report strata individually (distinct profiles, outcomes) Average across strata (HR=1.24) is an ‘unbiased estimate of true treatment effect’ For most analyses, formulae exist for calculating pooled significance tests adjusted for sample sizes and variances in each stratum • coxph(survival ~ swang1+strata(ps_quintiles), data=rhc) Cox Model Hazard Ratio 95% CI Log-rank Unadjusted 1.30 1.19-1.43 p<0.0001 Pooled 1.24 1.11-1.37 P<0.0001 Stratify by Quintile : Analyze each separately • Unadjusted Cox model applied to each stratum • Individual strata can be reported separately (sometimes meaningful) • coxph(survival~ swang1, data=rhc3) PS Quintile 3 RHC No RHC RHC N 737 410 Deaths 222 148 % deaths 30.1 36.1 HR 95% CI Log-rank 1.24 1.01-1.53 p= 0.03 Using Propensity Scores to balance covariates in observational data: 1. Stratify by quintile (Ruben and Rosenbaum, 1983): • Each quintile analyzed separately as if from RCT (~meta-analysis) • With perfect stratification, average treatment effect across strata is unbiased estimate of true treatment effect • 5 groups eliminate 90% of bias from observed confounders 2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile • Continuous numeric outcomes: ordinary linear regression • Binary categorical outcomes: logistic regression • Survival analysis: Cox proportional hazards regression 3. Match cases and controls by PS • Compare groups as if from RCT • Almost always reduces sample size Regression Adjustment using propensity scores • Regression of outcome on treatment with adjustment for PS quintile Cox proportional hazards models: Unadjusted : coxph(survival~swang1, data=rhc) Quintile PS : coxph(survival~swang1+ps_quintile, data=rhc) Linear PS : coxph(survival~swang1+prop_score, data=rhc) Cox Model Unadjusted PS quintiles PS linear Hazard Ratio 1.30 1.24 1.23 95% CI 1.19-1.43 1.12-1.37 1.04-1.58 Log-rank p<0.0001 p<0.0001 P<0.0001 Regression Adjustment using propensity scores In weighted regression, underrepresented subjects given greater weight so that weighted sample more representative • Inverse-Probability-Treatment Weights (IPTW) • treatment effect in population whose risk factor distribution equals all study subjects • regression weights : 1/PS for controls and 1/(1-PS) for treated • iptw=ifelse(swang1==TRUE,1/(1-prop_score), 1/prop_score)) • Standardized Mortality Estimator (SMR) Weights • treatment effect in population whose risk factor distribution equals treated subjects* • regression weights : PS/(1-PS) for controls and 1 for treated • smr=ifelse(swang1==TRUE,1, prop_score/(1-prop_score)) • Weighted regression • coxph(survival ~ swang1, weights=smr, data=rhc) • Results disappointing: • Freedman, Berk 2008: better just to fit the model without weights Regression Adjustment without propensity scores Adjusted Cox PH model: model effect of confounders on outcome directly coxph(survival~ swang1 + age + sex + race + edu + income + ninsclas + cat1 + das2d3pc + dnr1 + ca + surv2md1 + aps1 + scoma1 + wtkilo1 + temp1 + + meanbp1 + resp1 + hrt1 + pafi1 + paco21 + ph1 + wblc1 + hema1 + sod1 + pot1 + crea1 + bili1 + alb1 + resp + card + neuro + gastr + renal + meta + hema + seps + trauma + ortho + cardiohx + chfhx + dementhx + psychhx + chrpulhx + renalhx + liverhx + gibledhx + malighx + immunhx + transhx + amihx, data=rhc) Cox Model Hazard Ratio 95% CI Log-rank Unadjusted 1.30 1.19-1.43 p<0.0001 Adjusted 1.24 1.12-1.38 P<0.0001 • Misspecification and over-fitting may limit generalizability (external validity) • Difficult to assess adequacy of balance when modeling outcome directly • May provide reassuring confirmation of bias adjustment by PS Using Propensity Scores to balance covariates in observational data: 1. Stratify by quintile (Ruben and Rosenbaum, 1983): • Each quintile analyzed separately as if from RCT (~meta-analysis) • With perfect stratification, average treatment effect across strata is unbiased estimate of true treatment effect • 5 groups eliminate 90% of bias from observed confounders 2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile • Continuous numeric outcomes: ordinary linear regression • Binary categorical outcomes: logistic regression • Survival analysis: Cox proportional hazards regression 3. Match cases and controls by PS • Compare groups as if from RCT • Almost always reduces sample size Propensity score matching: • Create matched case-control pairs by matching on PS • Confirm balance before and after matching by comparing treatment groups with procedure independent of sample size (e.g. standardized differences). If inadequate, refine PS model or matching procedure iteratively until it works • Compare groups as if from randomized study • Caveats: • ‘Balancing property’ of PS does not ensure matched pairs will match on covariates, only groups on average • Matching constraints usually reduce sample size (stratification and regression adjustment use all cases and controls) • Requires specialized software Propensity score matching: • Matching on PS: • Nearest neighbour: matches each case to closest control • Caliper matching: only if PS within a pre-specified caliper distance, typically 0.2 sd’s • Matching on original covariates • Minimize sum of distances between cases and controls i. Euclidean distance: ii. Mahalanobis distance weights the contributions by inverse covariance, to avoid over-counting correlated measures, generally preferable i. Exact matching for categorical variables e.g. gender, disease status Propensity score matching: Options include: • matching ± replacement (re-use controls; less bias, more analysis) • one:many, many:one or variable ratio matching • greedy (closest) vs optimal matching (minimizes sum of distances) • almost exact matching, fine balance (without matching) Suggested approach: • start with nearest neighbor matching without replacement • ± caliper ≤ 0.20 SD’s • ± exact matching for priority covariates (gender, disease) • check balance using standardized differences • consider optimal matching (effective, but intensive) • experiment with other options until cases and controls balanced • may need to use specialized packages (Stata, R) Propensity score matching in R: Matching http://sekhon.berkeley.edu/matching Sekhon, J. S. (2011). Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software 42(7). • Uses automated procedure to select matches, based on univariate and multivariate balance diagnostics. Primarily 1:M matching (where M is a positive integer), allows matching with or without replacement, caliper, exact MatchIt http://gking.harvard.edu/matchit Ho, D.E., Imai, K., King, G., and Stuart, E.A. (2011). MatchIt: Nonparametric preprocessing for parameteric causal inference. Journal of Statistical Software 42(8) • Two-step process: does matching, then user does outcome analysis. Wide array of estimation procedures and matching methods available: nearest neighbor, Mahalanobis, caliper, exact, full, optimal, subclassification optmatch http://cran.r-project.org/web/packages/optmatch/index.html Hansen, B.B., and Fredrickson, M. (2009). • optmatch: Functions for optimal matching. Variable ratio, optimal, and full matching. Propensity score matching in Stata: psmatch2 http://ideas.repec.org/c/boc/bocode/s432001.html Leuven, E. and Sianesi, B. (2003). Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing. • Allows k:1 matching, kernel weighting, Mahalanobis matching pscore http://www.lrz-muenchen.de/~sobecker/pscore.html Becker, S.O. and Ichino, A. (2002). Estimation of average treatment effects based on propensity scores (2002) The Stata Journal 2(4): 358-377. • k:1 matching, radius (caliper) matching, and stratification (subclassification) match http://www.economics.harvard.edu/faculty/imbens/software_imbens Abadie, A., Drukker, D., Herr, J. L., and Imbens, G. W. (2004). Implementing matching estimators for average treatment effects in Stata. The Stata Journal 4(3): 290-311. • Primarily k:1 matching (with replacement) cem http://gking.harvard.edu/cem/ Iacus, S.M., King, G., and Porro, G. (2008). Matching for Causal Inference Without Balance Checking. • Implements coarsened exact matching Propensity score matching in SAS: Local and global optimal propensity score matching Coca-Perraillon, M. (2007). Local and global optimal propensity score matching. In SAS Global Forum 2007. Paper 185-2007. • Variety of matching methods. No built in diagnostics. Greedy matching (1:1 nearest neighbor) Parsons, L. S. (2001). Reducing bias in a propensity score matched-pair sample using greedy matching techniques. In SAS SUGI 26, Paper 214-26. Parsons, L.S. (2005). Using SAS software to perform a case-control match on propensity score in an observational study. In SAS SUGI 30, Paper 225-25. Kosanke, J., and Bergstralh, E. (2004): Match 1 or more controls to cases using the GREEDY algorithm 1:1 Mahalanbois matching within propensity score calipers Feng, W.W., Jun, Y., and Xu, R. (2005). A method/macro based on propensity score and Mahalanobis distance to reduce bias in treatment comparison in observational study. www.lexjansen.com/pharmasug/2006/publichealthresearch/pr05.pdf Propensity Score Matching Using R Match() function to implement • Nearest neighbour match on PS with caliper = 0.1 (PS within 0.025) • Exact matching on gender, admission diagnosis (9 disease categories) Original number of observations.............. Original number of treated obs............... Matched number of observations............... Matched number of observations (unweighted). 5735 2184 1461 1461 Number of obs dropped by 'exact' or 'caliper' 2090 controls 723 cases, Balance checks before and after matching are mandatory • Test for balance must be independent of sample size • Given numeric covariate x, standardized mean difference = • For proportions p, • Most authors equate balance with a standardized mean difference < 10% • Full range pre- and post- match test statistics output by MatchBalance() Propensity Score Matching Output of MatchBalance() function includes tests for equality, treatment vs controls • means, standardized mean differences (< 10%) • variances (treatment vs controls) (0.5-2.0, Rubin 2001) • empiric quantiles (quantile-quantile plot) • empiric cumulative distribution function ***** (V5) meanbp1 ***** Before Matching Matching mean treatment........ 68.198 mean control.......... 84.869 std mean diff......... -48.685 After 73.523 73.673 -0.42029 mean raw eQQ diff..... med raw eQQ diff..... max raw eQQ diff..... 16.673 11 49 1.6547 1 74 mean eCDF diff........ med eCDF diff........ max eCDF diff........ 0.092555 0.06949 0.21172 0.0092787 0.0058179 0.039014 var ratio (Tr/Co)..... 0.77589 T-test p-value........ < 2.22e-16 KS Naive p-value...... < 2.22e-16 KS Statistic.......... 0.21172 1.0215 0.90308 0.21612 0.039014 Propensity Score Matching: Assessing matching balance dot chart effective for display of many simultaneous covariates Treatment and control groups balance when Standardized Mean Differences < 10% Propensity Score Matching Output of MatchBalance() function includes tests for equality, treatment vs controls • means, standardized mean differences (< 10%) • variances (treatment vs controls) (0.5-2.0, Rubin 2001) • empiric quantiles (quantile-quantile plot) • empiric cumulative distribution function ***** (V5) meanbp1 ***** Before Matching Matching mean treatment........ 68.198 mean control.......... 84.869 std mean diff......... -48.685 After 73.523 73.673 -0.42029 mean raw eQQ diff..... med raw eQQ diff..... max raw eQQ diff..... 16.673 11 49 1.6547 1 74 mean eCDF diff........ med eCDF diff........ max eCDF diff........ 0.092555 0.06949 0.21172 0.0092787 0.0058179 0.039014 var ratio (Tr/Co)..... 0.77589 T-test p-value........ < 2.22e-16 KS Naive p-value...... < 2.22e-16 KS Statistic.......... 0.21172 1.0215 0.90308 0.21612 0.039014 Propensity Score Matching: Assessing matching balance Equality of distributions may be assessed by QQ plots, eCDF, KS statistic Quantile – quantile plot: Creatinine (mmol/l) Empiric CDF: Creatinine (mmol/l) Formally compared via KS statistic Propensity Score Matching RHC RHC No RHC Cox PH regression: 1461 matched pairs coxph(survival ~ swang1, data = rhcm) Cox Model N Hazard Ratio Unadjusted 5735 1.30 PS quintiles 5735 1.24 PS matched pairs 2922 1.23 95% CI Log-rank 1.19-1.43 p<0.0001 1.12-1.37 p<0.0001 1.09-1.41† p=0.001 N 1461 1461 Deaths 536 454 % Deaths 36.7 31.1 Log-rank p = 0.001 Testing the fundamental assumption: “Estimated treatment effects are unbiased if there are no unobserved confounders i.e. all relevant covariates have been included” Sensitivity Analysis seeks to test robustness of conclusions to hidden biases from unobserved confounders Connors. 1996: • Asked 13 practicing clinicians to identify 10 variables with the greatest influence on their decision to use RHC in practice. None missing in study • From logistic regression, identified 4 physiologic variables with the largest effect on the decision to treat: PaO2/FiO2, mean BP, pulse, and resp. rate. Individually omitted from PS model -> hazard ratio ± 0.01 • Formal ‘Rosenbaum bounds’ to estimate how much hidden bias would need to be present to invalidate conclusions, implemented in rbounds libraries for R, Stata. Rosenbaum Sensitivity Test for matched pairs: • Matched patients have same PS and same probability of treatment (RHC) if no unmeasured confounders • Assume true probabilities differ due to an unmeasured confounder that correlates perfectly with outcome (worst-case scenario). Let G = odds ratio for its effect on treatment assignment (in logistic regression for PS) • How large does G need to be to invalidate the conclusion RCH increased mortality? • For binary outcome: 2 x 2 table for survival in 1461 matched pairs • Control dies Control lives RHC dies 651 356 RHC lives 274 180 Discordant outcomes in 274+356 = 730 of 1461 matched pairs McNemar test for matched pairs: Under Ho , how likely is it that two discordant cells (274 vs 356 ) arose from same binomial distribution by chance? Rosenbaum Sensitivity Test for matched pairs: • For binary outcome: 2 x 2 table for survival in 1461 matched pairs Control dies Control lives RHC dies 651 356 RHC lives 274 180 ← McNemar p= 0.001 binarysens(274,356,Gamma=1.2,GammaInc=0.02) ← varies G -> effect on McNemar p-value Rosenbaum Sensitivity Test p-value of estimate .... 0.0006 Gamma Lower bound Upper bound 1.00 0.00062 0.00062 ← no confounding by unobserved covariate 1.02 0.00025 0.00143 1.04 0.00010 0.00308 1.06 0.00004 0.00619 1.08 0.00001 0.01170 1.10 0.00000 0.02082 1.12 0.00000 0.03503 1.14 0.00000 0.05590 ← McNemar p > 0.05 Unmeasured confounder needs to increase odds of RHC by 1.14 to nullify the increase in mortality with treatment Sensitivity analysis: Interpreting Rosenbaum bounds • Becker et al, 2007: “These are worst case scenarios…a critical value of G = 1.15 does not mean there is no effect of treatment on the outcome. However, the results are sensitive to deviations from the unconfoundedness assumption, and we advise some caution when interpreting” • Connors et al used a ‘less worst-case’ scenario to estimate G needed to ‘invert’ HR i.e. reduce it from 1.2 to 0.8 and cited a 3-fold increase in odds of RHC due to unobserved confounders. • rbounds now the standard measure of sensitivity to hidden biases in matched studies • psens() performs similar analysis for numeric outcomes based on Wilcoxon rank-sum test (≠ McNemar) • On Friday, we’ll demonstrate both binarysens() and psens() to perform sensitivity analysis for different outcomes Devices to increase robustness • Multiple control groups: • consistency between control groups that differ in unobserved confounders • e.g. different ways to escape treatment e.g. declined vs denied treatment • Coherence among outcomes: • e.g. length or cost of hospitalization rather than survival, looking for consistency • subgroup analysis: looking for consistency or plausible variations e.g. elderly, women, whites, shock/sepsis/ post-op comparable • Dose response relationships: looking for a consistent effect • e.g. SVC pressure line vs. full RHC • Known effects • e.g. outcomes known to be unaffected by treatment A brief review Quasi-experimental designs require as much care and attention as an RCT • Carefully identify all potential confounders by • Clinical judgment • Literature review • Expert panels • Prioritize confounders to identify those which must be balanced (a priori) • Avoid controlling for post-treatment covariates After estimating PS, make sure treatment model discriminates. Add covariates, nonlinear terms, or interactions until it does. Make sure that group scores overlap After conditioning on PS by stratification, regression, or matching, make sure that groups are balanced using measures independent of sample size. If needed, revise the treatment model or tighten up matching criteria Assess robustness of conclusions to hidden bias due to unobserved confounders: • Have external panel of experts review choice of predictors • Re-analyze after omitting important confounders • Estimate Rosenbaum bounds i.e. G = how much hidden bias needed to invalidate conclusions Useful References: • Rosenbaum, P.R., & Rubin, D.B. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70(1), 41-55 • A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996; 276(11): 889-97 • Rosenbaum, P.R. (2010). Design of Observational Studies. New York: Springer • Pattanayaka, C.W. et al. Propensity Score Methods for Creating Covariate Balance in Observational Studies. Rev Esp Cardiol. 2011; 64(10):897–903 • Stuart, E. A. Matching methods for causal inference: A review and a look forward. Statistical Science 20. 2010; 25: 1-21. • Stukel, T.A. et al. Analysis of Observational Studies in the Presence of Treatment Selection Bias. JAMA 2007; 297(3) 278-85 • J.S. Sekhon. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R. J Stat Software. 2011; 42(7): 1-52 Thank-you for your patience: QUESTIONS?