Screening Data Prior to Primary Analysis

Screening the Data
Tedious but essential!
Missing Data
• Missing Not at Random (MNAR)
• Missing at Random (MAR)
• Missing Completely at Random (MCAR)
Missing Not at Random (MNAR)
• Are missing cases on Y
• Missingness is related to the value of Y
• Faculty salaries – those with high salaries
may be reluctant to reveal them
• Estimates of mean Y will be biased if use
just the available data
Missing at Random (MAR)
• Missingness on Y not related to value of Y
• Or is related but through other variables
on which we have data.
• Faculty salary related to rank.
• Higher rank = higher salary
• If missingness is random within each rank,
within-rank estimates will be unbiased.
• Overall mean = weighted sum of withinrank estimates
Missing Completely at Random (MCAR)
• There is no variable, observed or not, that
is related to missingness of Y.
• Ideal, not likely ever absolutely true.
Finding Patterns of Missingness
• There is specialized software. You do not
have it.
• Can use SAS.
• Can use SPSS with home license code.
• Create missingness dummy variable
• 0 = not missing, 1 = missing
• Relate missingness to other variables.
Dealing with MCAR Data
• Delete Cases: Will create no bias, but will
lower power and precision.
• Mean Substitution: For each missing
value, substitute the group mean on that
value. No bias for means, but will reduce
standard deviations.
Dealing with MCAR Data
• Regression: For each missing score,
develop a multiple regression to predict
score from other variables. Impute that
predicted score. Regression towards
mean will reduce variability.
Dealing with MAR Data
• Deletion of Variables: If another variable
can serve as a proxy.
• Multiple Imputation – specialized
software, may eliminate bias
– Involves resampling techniques to generate
several sets of predictions of missing scores
– Analyze each set and then average the
results across sets.
Dealing with MNAR Data
• Sophisticated methods may reduce, but
not eliminate, bias.
• Pairwise Correlation Matrix – use as
input to multivariate procedures. Different
correlations will be based on different
subsets of the data. Can produce very
strange results, not recommended.
Missing Item Data Within Unidimensional Scale
• Assume each item measures the same
• For each subject, compute the means on
the items which do have data.
• Set to missing the scale scores for
subjects who have answered fewer than a
threshold number of items.
Identifying Outliers
• Univariate: Box and whiskers plots
• Multivariate: Compute Mahalanobis
Distance or Leverage. Investigate cases
with high values. Use outlier dummy
variable to compare outliers with inliers.
• Regression Diagnostics:
o Leverage: Cases with unusual values on the
predictor variables
o Standardized Residuals: Cases whose
actual Y is far from predicted Y.
o Cook’s D: Cases with values that make them
have great influence on the regression
Dealing with Outliers
• Investigate: May be bad data. May be
able to correct the data, may not. May
represent cases not properly considered
part of the population of interest.
• Out-of-Range Values: Even if not
outliers, these are bad data that need
Dealing with Outliers
• Set to Missing: If all else fails.
• Delete the Case: For example, if
convinced the respondent was not even
reading the questions.
– “I frequently visit planets outside of our solar
– “I make all of my own clothes.”
• Delete the Variable: Last resort when it
has many cases with missing data.
Dealing with Outliers
• Transform the Variable: If outliers are
valid but contributing to skewness.
• Change the Score: For example, reduce
very high score to value a small bit higher
than the remaining highest score. See
Howell’s discussion of “Winsorizing.”
Assumptions of the Analysis
• Check Outliers First: Dealing with
outliers may resolve the problems below.
• Normality: Look at plots and measures of
skewness and kurtosis. Ignore tests of
significance, like Kolgomorov-Smirnov.
May need to use different analysis.
• Homogeneity of Variance: Does the
variance differ considerably across
groups? May need to transform or use
different analysis.
Assumptions of the Analysis
• Homoscedasticity: Carefully inspect the
residuals. May need to transform data or
use a different analysis.
• Homogeneity of Variance/Covariance
Matrices (across groups): Box’s M.
• Sphericity: For univariate-approach
related samples ANOVA. Check with
Mauchley’s Test. Correct the df or use a
multivariate approach instead.
Assumptions of the Analysis
• Homogeneity of Regression: In
ANCOV, we assume the relationship
between Y and the predictors is constant
across groups. Test the Groups x
Predictor(s) interactions.
• Linear Relationships: Look at plots. If
necessary, transform variables or use
curvilinear techniques.
• One predictor is nearly perfectly correlated
with the other predictors.
• Makes the regression coefficients unstable
across random samples from the same
• Makes complicated the interpretation of
unique effects.
Detecting Multicollinearity
• For each predictor, compute the R2
between it and the other predictors. If very
high (.9 or more), there is a problem.
• SAS will compute tolerance
= (1 – that R2 ). If very low, there is a
• If R2 = 1, the correlation matrix is singulair,
cannot be inverted, the analysis crashes
– Predictors = Verbal SAT, Math SAT, Total
Variance Inflation Factor
• VIF = 1/tolerance. If high, there is a
• How High?
• Some say 10, some say 5, a few say 2.5.
• If R2 = .9, tolerance = .1, VIF = 10.
Dealing with Multicollinearity
• Drop a Predictor – may resolve the
• Combine Predictors – into a composite
• Principle Components Analysis –
conduct the analysis on the resulting
weighed linear combinations of the
variables. Can then transform the results
back to the original variables.
• Look at the command lines in the SAS
• Always give every case a unique ID
number, so you can locate it later.
• Label variables if their SAS name is not
• input ID 1-3 @5 (Q1-Q138) (1.);
label Q1='Sex' Q3 = 'Age';
• Recode values that represent missing
• On several variables, such as “number of
biological brothers,” response 5 was “do
not know.”
• if Q15 = 5 then Q15 = . ; if
Q16 = 5 then q16 = . ;
SAS 3 & 4
• Transform variable to reduce positive
• age_sr = sqrt(Q3); age_log =
log10(Q3); age_inv = -1/(Q3);
• Dichotomize variable – transformation of
last resort.
• if q3 = 1 then age_di = 1; else
if q3 > 1 then age_di = 2;
SAS 5 & 6
• Create composite variable
• SIBS = Q15 + Q16;
• Transform to reduce positive skewness
• sibs_sr = sqrt(sibs);
sibs_log = log10(sibs);
sibs_in = -1/sibs;
• Create mental variable and associated
missingness variable.
• MENTAL = Q62 + Q65 + Q67;
MentalMiss = 0;
If Mental=.then MentalMiss = 1;
• Transform to reduce negative skewness
• Mental2 = Mental*Mental;
Mental3 = Mental**3;
Ment_exp = EXP(Mental);
R_Ment = 13 - Mental;
R_Ment_sr = sqrt(R_Ment);
R_Ment_log = log10(R_Ment);
• Dichotomize Mental
• if 0 LE Mental LE 9 then
else if Mental > 9 then
• Be careful – SAS codes missing data with
an extreme negative number.
SAS 10
• Check for missing data and out-of-range
• proc means min max n nmiss;
var q1-q10 q50-q70; run;
SAS 11
• Check for skewness & kurtosis
• proc means min max n nmiss
skewness kurtosis;
var Q3 age_sr -- Mental Mental2
-- R_Ment_log; run;
SAS 12
• Check distributions of variables with few
• proc freq;
tables q3 age_di sibs mental
ment_di; run;
SAS 13
• Locate cases with bad data
• data duh; set delmel;
if q9 > 3;
proc print; var q9; id id; run;
• Case 159 has out-of-range on item Q9.
SAS 14
• Check correlates of missingness.
• proc corr nosimple data=delmel;
var MentalMiss;
with Q1 Q3 Q5 Q6 sibs; run;
• MentalMiss negatively correlated with sibs.
• Duh, some subjects have missing data on
number of brothers or number of sisters.
• Instead of Mental = Q62+Q65+Q67, use
Mental = Mean(of Q62 Q65 Q67);
SAS 15
• Identify multivariate outliers
• proc reg data=delmel;
model id = Q1 Q3 Q6 mental;
output out=hat H=Leverage; run;
data outliers; set hat;
if leverage > .052;
SAS 15
• Identify multivariate outliers
• proc print; var id Q1 Q3 Q6
mental leverage; run;
proc means mean;
var Q1 Q3 Q6 mental; run;
• As a group, the outliers are older than the
overall sample.
• All three students aged 25 or older are
included among the outliers.
Survey Scoundrels
• These sloths do not even read the
questions, they just answer randomly to
get whatever incentive is available for
completing the survey.
• My daughter’s shock upon discovering
• Monitor how long it takes respondents to
complete the survey.
Items to Help Detect Scoundrels
• Repeat same item, compare responsese
• “I frequently visit with aliens from other
• “I make all of my own clothes.”

similar documents