Propensity Score - Hankamer School of Business

```Propensity Score
• Overview:
– What do we use a propensity score for?
– How do we construct the propensity score?
– How do we implement propensity score
estimation in STATA?
Joke (kind of…)
• Two heart surgeons (Jack and Jill) walk into a bar.
– Jack: “I just finished my 100th heart surgery!”
– Jill: “I finished my 100th heart surgery last week. Which
probably means I’m a better heart surgeon. How many of your
patients died within 3 months of surgery? I’ve only had 10 die.”
– Jack: “Five. So I’m probably the better surgeon.”
– Jill: “Or maybe mine are older and have a higher risk than your
patients”.
• There may be differences in the patients’ characteristics
between Jack and Jill
– We want to show the difference due to treatment (Jill)
– We want to compare apples to apples – not apples to oranges
Purpose of propensity scores
• It can produce apples-to-apples comparisons
when treatment is non-random (non-ignorable
treatment assignment)
• Provides a way to summarize covariate
information about treatment selection into a
single number (scalar)
• Can be used to adjust for differences via study
design, or matching, or during estimation of the
treatment effect (e.g., subclassification or
regression)
Propensity score estimation
• Some caveats
– This is only relevant for selection on observables
– If you cannot write down a conditioning strategy
such that conditioning on X will satisfy the
backdoor criterion, then this is not the research
design you choose
– You need to identify the confounders, X, that will
block all back doors – based on economic theory –
and you will need data on them
Better example: a case in which the propensity
score is useful for causal inference
• Suppose that we are interested in whether a
scholarship program caused children in to spend
more years in high school (9-12).
this program
• You have data on every child, including test
scores, family income, age, gender, etc.
• Scholarships are awarded based on some
combination of test scores, family income,
gender, etc., but you don’t know the exact
formula.
Motivation (cont.)
• Ignorable treatment assignment: Scholarships are assigned to
students randomly, independent of how a student is expected to
perform in high school
– Calculate ATE by estimating simple difference in mean outcomes:
1
1
(Y | D =1) - å(Y | D = 0)
å
N
N
• But what if ignorability is violated?
– For instance, assume you know that children with higher test scores are
more likely to get the scholarship (positive selection), but you don’t know
how important this and other factors are, you just know that the decision
is based on information you have (X) and some randomness.
– What can you do with this information?
Motivation (cont.)
• In principle, you could estimate it using OLS controlling for X:
Where X is a matrix of covariates that you think affect the probability
of receiving a scholarship.
• OLS consistently estimates the conditional mean, but if probability of
getting a scholarship is not a linear function of X, this conditional mean
estimate may not be informative.
• Usually, we won’t know how the selection depended on X, only that it did.
– For instance, they may use discrete cutoffs rather than a linear function
Motivation (cont.)
• Suppose your variables are not continuous, but they are
categories (somewhat arbitrarily).
– E.g. family income above or below \$50 per week, scores above or
below the mean, sex, age, etc.
• Now, you could put in dummy variables for each category and
interaction between all dummies. This would distinguish
every group formed by the categories.
• Or you could run separate regressions for each group
– This is more flexible since it allows the effect of the scholarship to
differ by group.
• These methods are in principle correct, but they are only
feasible if you have a lot of data and few categories.
Constructing the Propensity Score
• Estimation of average treatment effects based on
propensity score estimation can handle
sparseness and ignorance about the functional
form associated with treatment assignment.
• You will first need to have a selection into the
treatment (in our case the scholarship) that is
based on observables, or “selection on
observables”.
• The following gives a brief overview of how the
propensity score is constructed.
command that will do all of this for you.
Definition and General Idea
• Definition: The propensity score is the conditional probability
of being assigned to the treatment group (e.g., 9-12 grade
scholarship), conditional on the particular covariates (X).
– Pr(D=1|X) is some marginal probability (e.g., 55%)
• The idea is to compare units who, based solely on their
observables, had very similar probabilities of being placed
into treatment
– If conditional on X, two units have a similar probability of treatment,
then we say they have similar propensity scores
• We then think that all the difference in the outcome
variable is due to the treatment.
– If we compare a unit in the treatment group to a control group unit with two
similar propensity scores, then conditional on the propensity score, all remaining
variation between these two is randomness if selection on observables
First stage
• Estimation using this method is a two-stage procedure
– First stage: estimates the propensity score
– Second stage: calculate the average causal effect of interest by
averaging differences in outcomes over units with similar propensity
scores
• First stage: estimate the propensity score:
– First, estimate the following equation with binary treatment (D) on
the LHS, and covariates (X) that determine selection into treatment on
RHS using logit or probit model:
– Second, using estimated coefficients, calculate the predicted LHS
– The propensity score is just the predicted conditional probability of
treatment (using estimated coefficients on X) for each unit
Algorithm
1) Sort your data by the propensity score and divide it into
blocks (groups) of observations with similar propensity sores.
2) Within each block, test (using a t-test), whether the means of
the covariates are equal in the treatment and control group.
If so  stop, you’re done with the first stage
3) If a particular block has one or more unbalanced covariates,
divide that block into finer blocks and re-evaluate
4) If a particular covariate is unbalanced for multiple blocks,
modify the initial logit or probit equation by including higher
order terms and/or interactions with that covariate and start
again.
Second Stage
• In the second stage, we look at the effect of treatment on the outcome
(in our example of getting the scholarship on years of schooling), using the
propensity score.
• Once you have determined your propensity score with the procedure
above, there are several ways to use it. I’ll present two of them (canned
version in Stata for both):
•
Stratifying on the propensity score
–
•
Divide the data into blocks based on the propensity score (blocks are determined with
the algorithm). Run the second stage regression within each block. Calculate the
weighted mean of the within-block estimates to get the average treatment effect.
Matching on the propensity score
–
Match each treatment observation with one or more control observations, based on
similar propensity scores. You then include a dummy for each matched group, which
controls for everything that is common within that group.
Balancing within blocks
1.
2.
3.
Sort the data by the propensity score
Divide the data into groups called “blocks” that have similar
propensity scores (e.g., 0.001 to 0.10, 0.10 to 0.20, etc.)
For each block, test whether the means of the covariates are
equal for treatment and control using a t-test
a.
4.
5.
If they are, you are done with the first stage
If a particular block has one or more unbalanced covariates (X),
divide that block into finer blocks and re-evaluate
If a particular covariate is unbalanced for multiple blocks, modify
the initial logit or probit equation by including higher order terms
and/or interactions with that covariate and start again
Implementation in STATA
Multiple methods for estimating the propensity score
– ssc install psmatch2, replace
• First stage:
pscore treat X1 X2 X3…, pscore(scorename)
• Second stage: attr (for matching) or atts (for
stratifying):
attr outcome treat, pscore(scorename)
General Remarks
• The propensity score approach becomes more appropriate
the more we have randomness determining who gets
treatment (closer to randomized experiment).
• The propensity score doesn’t work very well if almost
everyone with a high propensity score gets treatment and
almost everyone with a low score doesn’t:
– we need to be able to compare people with similar propensities who
did and did not get treatment.
• The propensity score approach doesn’t correct for
unobservable variables that affect whether observations
NSW example
• Comparison of propensity score matching with
experimental results
NSW program
• During the mid-1970s, Manpower Demonstration Research
Corporation (MDRC) operated the National Supported Work
Demonstration (NSW)
• NSW was a temporary employment program designed to help
disadvantaged workers lacking basic job skills move into the labor
market by giving them work experience and counseling in a
sheltered environment
• Unlike other federally sponsored employment and training
programs, though, the NSW program assigned qualified applicants
to training positions randomly
– Treatment group: received all the benefits of the NSW program
– Control group: left to fend for themselves
• NSW admitted into the program AFDC women, ex-drug addicts, excriminal offenders, and high school dropouts of both sexes
NSW Program
• Treatment group members were:
– guaranteed a job for 9-18 months depending on the target group and site
– divided into crews of 3-5 participants who worked together and met
frequently with an NSW counselor to discuss grievances and performance
– paid for their work
• Wage schedule offered the trainees lower wage rates than they would’ve
received on a regular job, but allowed their earnings to increase for
satisfactory performance and attendance
• After their term expired, they were forced to find regular employment
• The type of work varied within sites – gas station attendant, working at a
printer shop – and males and females were frequently performing
different kinds of work
– This was why the program costs varied across sites and target groups
– The program cost \$9,100 per AFDC participant and approximately \$6,800 for
other target groups’ trainees in 1982 dollars (US)
NSW Program
• MDRC collected earnings and demographic
information from both treatment and control
at baseline and every 9 months thereafter
• Conducted up to 4 post-baseline interviews
LaLonde (1986) study
• LaLonde, Robert J. (1986). “Evaluating the
Econometric Evaluations of Training Programs
with Experimental Data”. American Economic
Review. 76(4): 604-620.
• LaLonde’s ideas:
– Outcome variable: Annual earnings in 1978
– Get unbiased estimate of the job training program’s
effects using randomized control group
– Compare that with what you get by selecting a control
group from the entire population that looks like the
treatment group using various causal inference
methods
Need for a control group
• The fundamental problem of causal inference
is causality is defined as the difference
between two potential outcomes states, but
for each individual, we only observe one of
these.
• We are missing data on each trainees
counterfactual – what they would’ve earned
had they not been in the NSW experiment
Choice of a control group
• Best option: Randomize so that independence is satisfied
– Control group and treatment group are different only by random
chance
– Eliminates bias due to baseline differences between the two groups
and the heterogeneous treatment effects bias
• Oftentimes these kinds of randomized controls aren’t available so
labor economists would instead sample from various datasets to
create (non-experimental) control groups
• So LaLonde sampled a non-experimental control group from two
surveys: the Current Population Survey (CPS) and the Panel Study of
Income Dynamics (PSID)
– Sampled the entire working population
– Sampled those not working in 1976
– Sampled those not working in 1975 or 1976
Similarity of treatment and control
groups
• Treatment and control groups need to be similar.
But in what way should they be similar?
• Most importantly, they need to be similar with
regards to income pre-treatment since income is
what we’ll be examining post-treatment.
• So what did LaLonde find?
– First column is treatment group earnings in 1978
– Second column is randomized control group
– Everything else are the non-random control groups
Lessons
• What were the take-aways?
– Fairly pessimistic findings – observational data and causal inference
methods available at that time performed poorly when trying to
reproduce the known ATE from the randomization
• What did he do?
– Linear regression, fixed effects, latent variable selection modeling
– His estimated treatment effect for women tended to overestimate the
impact of the program – “positive self-selection”
– But it tended to underestimate the impact of the program for men –
“negative self-selection”
• Why should you care?
– Even though the control group might seem like a good guess for the
Dehija and Wahba (1999; 2002)
Dehejia, Rajeev H. and Sadek Wahba (1999). “Causal Effects
in Nonexperimental Studies: Reevaluating the
Evaluation of Training Programs”. Journal of the
American Statistical Association, vol. 94(448): 1053-1062
Dehejia, Rajeev H. and Sadek Wahba (2002). “Propensity
Score-Matching Methods for Nonexperimental Causal
Studies”. The Review of Economics and Statistics.
February, 84(1): 151-161.
These two studies introduce propensity score matching
methods to economists and perform a kind of replication of
LaLonde’s study
Dehejia and Wahba (1999)
• DW (1999) re-analyze the data using propensity
score matching and stratification
• These were new at the time to economists,
although the method was first established in
Rosenbaum and Rubin (1983)
• Identifying assumptions:
– (Y0,Y1) || D|p(X) – p(X) is “propensity score”
– 0<Pr(D|X)<1 – “Common support”
– Stable unit treatment value assumption (SUTVA)
• The response of subject i to the treatment D doesn’t depend
on the treatment given to anyone else except i
Assumptions
• e(X) = Pr(D|X) which is the conditional probability of
treatment.
– Also called the “propensity score”
– This is a scalar summary of all observed covariates, X
• Key Result is that the propensity score is a balancing
score
– X || | e(X)
– Pr[D|X, e(X)] = Pr[D|e(X)]
• ATE at e(X) is the average difference between the
observed responses in each treatment group at e(X)
– E[Y1 – Y0) | e(X) ] = E[Y | e(X), D=1] – E[Y | e(X), D=0]
Interpretation
• The overall estimated ATE from this method is
the individual treatment effect averaged over
the distribution of e(X)
Analytical use of propensity score
• Matching – subsets consisting of both
treatment and control subjects with the same
propensity score are matched
• Stratification – Data is divided into several
“strata” (or “blocks”) based on the propensity
score, then regular analysis is carried out
within each strata
Implementation
• Include as many observed pretreatment variables
(“covariates”) as possible
– The statistical significance of individual terms isn’t
important
• Functional form of covariates
– Consider higher order polynomials as well as
interaction terms. Why?
– BALANCE BETWEEN TREATMENT AND CONTROL
• Selection of the model
– Probit or logit
Matching algorithm
• Nearest neighbor algorithm
– Iteratively find the pair of subjects with the
shortest “distance”
– Easy to understand and implement; offers good
results in practice; fast running time; rarely offers
the best matching results compared to some
optimal matching procedure
Implementation
• Choices of distance
– Exact match not possible because propensity score is
a continuous variable and the probability of having
the same value of a continuous score is zero
– Use one distance measure to summarize the
information
•
•
•
•
Mahalanobis distance
Propensity score
Mahalanobis distance with propensity score caliper
Any distance with the requirement of exact match on a
specific variable
Software
• R functions by Ben Hansen
– http://www.stat.lsa.umich.edu/~bbh#
• STATA functions
– STATA 13 has new “treatment effects” methods
built into it which includes nearest neighbor
matching as well as propensity score matching
methods
– Pre-STATA 13: psmatch2(); pscore; nnmatch
Procedures for PSM
• Identify the propensity score model (e.g., logit or probit;
covariates)
• Estimate the propensity score with all the data
• Compute the distance between any two subjects
• Created matched pair/group using a specific matching
algorithm
• Check covariate balance between the treatment and
control group among matched subjects; if not good
enough, go back to improve the propensity score model
• Contrast between treated and control subjects within each
pair/group
• Obtain the ATE by averaging over all pairs/groups
Why are we doing this?
• Remember the goal of DW:
– The goal is to investigate the credibility of the
conventional analytical results from nonexperimental data
– So the authors compared the results from the
experimental data to the results from the nonexperimental data by combining the treatment
group with a comparable control dataset
Checking the balance after matching
Comparison of the analytical results
Observations
• The results after the propensity score
matching/stratification was much closer to the
truth (if we assume the randomized experiment is
the correct benchmark)
• The variances seem to be larger due to the loss of
the data
• The results aren’t very sensitive to the functional
form of the chosen covariates in the propensity
score model; however they are sensitive to the
selection of covariates included in the propensity
score model