Multilevel multiple imputation using the latent normal

Report
Latent normal models for missing
data
Harvey Goldstein
Centre for Multilevel Modelling
University of Bristol
The (multilevel) binary probit model
Suppose that we have a variance components 2-level
model for an underlying continuous variable written as
y ij  ( X  ) ij  u j  e ij , u j ~ N ( 0 ,  u ),
2
e ij ~ N ( 0 ,1)
and suppose a positive value is observed for a variable
Z when y  0 .
We then have
ij
( X  ) ij  u

Pr( z ij  1)  Pr( y ij  0 )  Pr( e ij   ( X  ) ij  u j ) 
  ( t ) dt
 ( X  ) ij  u
j
which is in fact just the probit link function.

j
  ( t ) dt

Ordered and unordered latent normal models
• Ordered categories, 1…p:
Define
 k  ( X  ) ij  u j
Pr( z ij  k ) 
  ( t ) dt ,
 0   ,  1  0
 k 1  ( X  ) ij  u j
Where we additionally need to estimate thresholds { k }
• For unordered categories we can map to an
underlying (p-1) variate normal distribution with
covariance matrix I p
Multivariate - mixed type - responses
• Consider a bivariate (multilevel) response with a normal
and binary response.
• We can map onto a latent bivariate normal.
• We can use ML for parameter estimation or MCMC
• MCMC provides a chain for random draws from the latent
normal distribution for the binary response. Each draw is
conditioned on the observed (correlated) normal response.
• Where a response is (randomly) missing we draw an impute
from its (estimated) conditional distribution – this is easy
for a MVN – and this is done for every MCMC iteration. For
the binary response this can be mapped back onto (0,1).
• Typically use uninformative priors.
Multiple imputations
• Every n-th iteration (say n~500 to ensure zero
autocorrelation) we can choose a ‘completed’
bivariate dataset. This will then yield p imputed
datasets to combine using ‘Rubin’ rules.
• We can extend to include ordered, unordered,
Poisson etc. responses all mapped onto a latent
MVN with missing data mapped back to original
scales.
• We can also include responses defined at higher
levels or classifications – correlated with higher
level random effects for lower level responses.
MI for multilevel GLMs
• Every variable is treated as a response, possibly with
fully observed variables as covariates in MVN model.
For multilevel models this may include variables
measured at higher levels.
• Imputation carried out for the MVN model, mapped
back to original scales, MOI fitted to multiple datasets
and combined.
• Note that for non-normal continuous variables we may
be able to use e.g. a Box-Cox transform within the
same model framework.
• Note that for general discrete distributions we may be
able to approximate by a set of ordered categories.
A simulation ~ 30% records with randomly missing data
Response is 16-year-old exam score
Table 1. Simulation study model (PART). Parameter estimates and standard errors in brackets.
One hundred simulated data sets. MCMC estimation used a burn in of 2000 with five imputed
data sets at iterations 1, 500, 1000, 1500, 2000. Estimates are computed using restricted
maximum likelihood.
Parameter
Complete data set
Imputation
Relative bias
Imputation
(%)
standard error***
Intercept
0.260 (0.056)
0.263
1.2
0.0021
Reading test score
0.391 (0.017)
0.391
0.0
0.0007
Verbal
reasoning
-0.417 (0.032)
-0.414
-0.7
0.0014
reasoning
-0.765 (0.054)
-0.768
0.4
0.0024
Level 2 variance
0.079 (0.016)
0.080
1.3
0.0005
Level 1 variance
0.536 (0.012)
0.536
0.0
0.0004
band 2*
Verbal
band 3*
* Verbal reasoning band has 3 categories: category 1 (the reference category) is the top 25% of
original verbal reasoning scores, category 2 is the middle 50% of verbal reasoning scores and category
3 is the bottom 25% of verbal reasoning scores. ***The imputation standard error is the standard error
for each parameter over the 100 simulations.
See Goldstein, Carpenter, Kenward and Levin. (2009). Statist. Modelling, 9, 173-197

similar documents