Slide - IAOS 2014 Conference

Report
IAOS 2014 Conference – Meeting the Demands of a Changing World
Da Nang, Vietnam, 8-10 October 2014
Diagnosing the Imputation of Missing
Values in Official Economic Statistics
via Multiple Imputation:
Unveiling the Invisible Missing Values
National Statistics Center (Japan)
Masayoshi Takahashi
Notes: The views and opinions expressed in this presentation are
the authors’ own, not necessarily those of the institution.
Outline
1.
2.
3.
4.
5.
6.
Problems of Missing Values and
Imputation
Theory of MI and the EMB Algorithm
Mechanism Behind the Diagnostic
Algorithm
Data and Missing Mechanism
Assessment of the Diagnostic Algorithm
Conclusions and Future Work
1
1. Problems of Missing Values and Imputation
Problems of Missing Values
Prevalence of missing values
 Effects of missing values
 Reduction in efficiency
 Introduction of bias
 Assumptions and solution
 Missing At Random (MAR)
 Imputation

2
1. Problems of Missing Values and Imputation
Problematic Nature of Single Imputation (SI)
Deterministic SI
ˆ
ˆ
Yij  Yi ,  j 
Stochastic SI
ˆ
ˆ
Yij  Yi ,  j   ˆi
^ = OLS estimate
There is only one
set of regression
coefficients.
Random noise
3
2. Theory of Multiple Imputation and the EMB Algorithm
Multiple Imputation (MI) Comes for Rescue
~ ~
~
Yij  Yi ,  j    i
Multiple sets of
regression
coefficients
~ = random
sampling from
a posterior
distribution
Need multiple values of
&
4
2. Theory of Multiple Imputation and the EMB Algorithm
Likelihood of Observed Data
L   ,  | Y obs  
n
 N Y
i , obs
|  i , obs ,  i , obs
i 1
Random sampling from observed likelihood
Not easy!!
Solution
Various computation algorithms
5

2. Theory of Multiple Imputation and the EMB Algorithm
Computational Algorithms
EMB algorithm
 Expectation-Maximization
 Bootstrapping
 Most computationally efficient
 Other MI algorithms
 MCMC
 FCS

6
2. Theory of Multiple Imputation and the EMB Algorithm
Graphical Presentation of the EMB Algorithm
7
3. Mechanism Behind the Diagnostic Algorithm
Paradox in Imputation
Imputed values
 Estimates, not true values
 Diagnosis
 True values
 Always missing
 Cannot compare the imputed values
with the truth
 How do we go about imputation
diagnostics?

8
3. Mechanism Behind the Diagnostic Algorithm
Solution to the Paradox
Indirect diagnostics of imputation
 Abayomi, Gelman, and Levy (2008)
 Honaker and King (2010)
 MI
 Within-imputation variance
 Between-imputation variance

9
3. Mechanism Behind the Diagnostic Algorithm
Disadvantage of multiple imputation
Dozens of imputed datasets
 Computational burden
 Multiple values for one cell
 Unrealistic to directly use in official
statistics

10
3. Mechanism Behind the Diagnostic Algorithm
Proposal in this Research

Two-step procedure
 Imputation step: Stochastic SI
 Diagnostic step: MI New!!

Advantage
 Can have only one imputed value
 Advantage of SI
 Can know the confidence about each
imputed value
 Advantage of MI
11
3. Mechanism Behind the Diagnostic Algorithm
Multiple Imputation as a Diagnostic Tool
Variation among M imputed datasets
 Estimation uncertainty in imputation
 Our diagnostic algorithm
 Utilizes this variability
 Can examine the stability & confidence
of imputation models
 What does this mean?
 See the next slide for illustration

12
3. Mechanism Behind the Diagnostic Algorithm
Illustration: Two Cases of Variation in Imputations
13
3. Mechanism Behind the Diagnostic Algorithm
Mathematical Representation
Imputation Step:
Stochastic SI
ˆ
ˆ
Yij  Yi ,  j   ˆi
Diagnostic Step:
MI
~ ~
~
Yij  Yi ,  j    i
~
ˆ
If    , then
no uncertainties
What we actually
check is whether
~
sd (Yij )  0
14
4. Data and Missing Mechanism
Data
Multivariate log-normal distribution
 Mean vector & variance-covariance matrix
 Simulated dataset
 Manufacturing Sector
 2012 Japanese Economic Census
 Number of observations
 1,000
 Variables
 turnover, capital, worker

15
4. Data and Missing Mechanism
Missing Mechanism
Target variable
 turnover
 Missing rate
 20%
 Missing mechanism
 MAR
 A logistic regression to estimate the
probability of missingness according to
the values of explanatory variables
(capital and worker)

16
5. Assessment of the Diagnostic Algorithm
R-Function diagimpute
New function developed in R
 Graphical detection of problematic
imputations as outliers
 Graphical presentation of the stability of
imputation via control chart
 Not yet publicly available
 A work in progress
 Once finalized, planning to make it
publicly available

17
5. Assessment of the Diagnostic Algorithm
Preliminary Result 1
18
5. Assessment of the Diagnostic Algorithm
Preliminary Result 2
19
6. Conclusions and Future Work
Conclusions
MI as a diagnostic tool
 A novel way
 Diagnostic algorithm
 Still a work in progress
 A preliminary assessment given
 Useful to detect problematic imputations
 Help us strengthen the validness of official
economic statistics.

20
6. Conclusions and Future Work
Future Work
Intend to further refine the algorithm
 Test it against a variety of real datasets
 Use several imputation models

21
References 1
1.
2.
3.
4.
5.
6.
7.
8.
Abayomi, Kobi, Andrew Gelman, and Marc Levy. (2008). “Diagnostics for
Multivariate Imputations,” Applied Statistics vol.57, no.3, pp.273-291.
Allison, Paul D. (2002). Missing Data. CA: Sage Publications.
Congdon, Peter. (2006). Bayesian Statistical Modelling, Second Edition. West
Sussex: John Wiley & Sons Ltd.
de Waal, Ton, Jeroen Pannekoek, and Sander Scholtus. (2011). Handbook of
Statistical Data Editing and Imputation. Hoboken, NJ: John Wiley & Sons.
Honaker, James and Gary King. (2010). “What to do About Missing Values in
Time Series Cross-Section Data,” American Journal of Political Science vol.54,
no.2, pp.561–581.
Honaker, James, Gary King, and Matthew Blackwell. (2011). “Amelia II: A
Program for Missing Data,” Journal of Statistical Software vol.45, no.7.
King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. (2001).
“Analyzing Incomplete Political Science Data: An Alternative Algorithm for
Multiple Imputation,” American Political Science Review vol.95, no.1, pp.49-69.
Little, Roderick J. A. and Donald B. Rubin. (2002). Statistical Analysis with
Missing Data, Second Edition. New Jersey: John Wiley & Sons.
22
References 2
9.
10.
11.
12.
13.
14.
15.
Oakland, John S. and Roy F. Followell. (1990). Statistical Process Control: A
Practical Guide. Oxford: Heinemann Newnes.
Rubin,
Donald
B.
(1978).
“Multiple
Imputations
in
Sample
Surveys — A Phenomenological Bayesian Approach to Nonresponse,”
Proceedings of the Survey Research Methods Section, American Statistical
Association, pp.20-34.
Rubin, Donald B. (1987). Multiple Imputation for Nonresponse in Surveys. New
York: John Wiley & Sons.
Schafer, Joseph L. (1997). Analysis of Incomplete Multivariate Data. London:
Chapman & Hall/CRC.
Scrucca, Luca. (2014). “Package qcc: Quality Control Charts,” http://cran.rproject.org/web/packages/qcc/qcc.pdf.
Statistics Bureau of Japan. (2012). “Economic Census for Business Activity,”
http://www.stat.go.jp/english/data/e-census/2012/index.htm.
Takahashi, Masayoshi and Takayuki Ito. (2012). “Multiple Imputation of
Turnover in EDINET Data: Toward the Improvement of Imputation for the
Economic Census,” Work Session on Statistical Data Editing, UNECE, Oslo,
Norway, September 24-26, 2012.
23
References 3
16.
17.
18.
19.
Takahashi, Masayoshi and Takayuki Ito. (2013). “Multiple Imputation of Missing
Values in Economic Surveys: Comparison of Competing Algorithms,”
Proceedings of the 59th World Statistics Congress of the International Statistical
Institute, Hong Kong, China, August 25-30, 2013, pp.3240-3245.
Takahashi, Masayoshi. (2014a). “An Assessment of Automatic Editing via the
Contamination Model and Multiple Imputation,” Work Session on Statistical
Data Editing, United Nations Economic Commission for Europe, Paris, France,
April 28-30, 2014.
Takahashi, Masayoshi. (2014b). “Keiryouchi Data no Kanrizu (Control Chart for
Continuous Data),” Excel de Hajimeru Keizai Toukei Data no Bunseki (Statistical
Data Analysis for Economists Using Excel) , 3rd edition. Tokyo: Zaidan Houjin
Nihon Toukei Kyoukai..
van Buuren, Stef. (2012). Flexible Imputation of Missing Data. London:
Chapman & Hall/CRC.
24
Thank you
25

similar documents