What`s New in the I/O Testing and Assessment Literature

What’s New in the I/O Testing and
Assessment Literature That’s
Important for Practitioners?
Paul R. Sackett
New Developments in the
Assessment of Personality
Topic 1: A faking-resistant approach to
personality measurement
• Tailored Adaptive Personality Assessment System (TAPAS)
• Developed for Army Research Institute by Drasgow
Consulting Group
• Multidimensional Pairwise Preference Format combined
with applicable Item Response Theory model
• Items are created by pairing statements from different
dimensions that are similar in desirability and trait “location”
• Example item: “Which is more like you?”
• __1a) People come to me when they want fresh ideas.
• __1b) Most people would say that I’m a “good listener”.
A faking-resistant approach to personality
measurement (continued)
• Extensive work show it’s faking-resistant
• Non-operational field study in Army show useful prediction
of attrition, disciplinary incidents, completion of basic
training, adjustment to Army life, among other criteria
• Now in operational use on a trial basis
Drasgow, F., Stark, S., Chernyshenko, O. S., Nye, C. D., and Hulin, C. L. (2012).
Development of the Tailored Adaptive Personality Assessment System (TAPAS) to
Support Army Selection and Classification Decisions. Technical Report 1311, Army
Research Institute
Topic 2: The Value of Contextualized
Personality Items
• A new meta-analysis documents the higher predictive power
obtained by “contextualizing” items (e.g., asking about
behavior at work, rather than behavior in general)
• Mean r with supervisory ratings for work context vs. general:
Conscientiousness: .30 vs .22
Emotional Stability: .17 vs. 12
Extraversion: .25 vs. .08
Agreeableness: .24 vs. .10
Openness: .19 vs. .02
Shaffer, J.A., & Postlethwaite, B. E. (2012). A matter of context: A meta-analytic
investigation of the relative validity of contextualized and noncontextualized
personality measures. Personnel Psychology, 65, 445-494.
Topic 3: Moving from the Big 5 to Narrower
DeYoung, Quilty and Peterson (2007) suggested the following:
– Neuroticism:
Volatility - irritability, anger, and difficulty controlling emotional impulses
Withdrawal - susceptibility to anxiety, worry, depression, and sadness
– Agreeableness:
Compassion - empathetic emotional affiliation
Politeness - consideration and respect for others’ needs and desires
– Conscientiousness:
Industriousness - working hard and avoiding distraction
Orderliness - organization and methodicalness
– Extraversion:
Enthusiasm - positive emotion and sociability
Assertiveness - drive and dominance
– Openness to Experience:
Intellect - ingenuity, quickness, and intellectual engagement
Openness - imagination, fantasy, and artistic and aesthetic interests
DeYoung, C. G., Quilty, L. C., & Peterson, J. B. (2007). Between facets and domains: 10 Aspects of the
Big Five, Journal of Personality and Social Psychology, 93, 880-896
Moving from the Big 5 to Narrower Dimensions
• Dudley et al (2006) show the value of this perspective
– Four conscientiousness facets: achievement, dependability,
order, and cautiousness
– Validity was driven largely by the achievement and/or
dependability facets, with relatively little contribution from
cautiousness and order
– Achievement receives the dominant weight in predicting task
performance, while dependability receives the dominant weight in
predicting counterproductive work behavior
Dudley NM, Orvis KA, Lebiecki JE, Cortina JM. 2006. A meta-analytic investigation of
conscientiousness in the prediction of job performance: Examining the intercorrelations and the
incremental validity of narrow traits. J. Appl. Psychol. 91:40-57
Topic 4: The Use of Faking Warnings
• Landers et al (2011) administered a warning after 1/3 of the
items to managerial candidates exhibiting what they called
“blatant extreme responding”.
• Rate of extreme responding was halved after the warning
Landers, R. N., Sackett, P. R., & Tuzinski, K. A. (2011). Retesting after initial
failure, coaching rumors, and warnings against faking in online personality
measures for selection. Journal of Applied Psychology, 96(1), 202.
More on the Use of Faking Warnings
• Nathan Kuncel suggests three potentially relevant
goals when individuals take a personality test:
• - be impressive
• - be credible
• - be true to oneself
More on the Use of Faking Warnings
• Jenson and Sackett (2013) suggested that
“priming” concern for being credible could
reduce faking.
• Test-takers who scheduled a follow-up interview
just before taking the personality test obtained
lower scores than those who did not
Jenson, C. E., and Sackett, P. R. (2013). Examining ability to fake and
test-taker goals in personality assessments. SIOP presentation.
New Developments in the
Assessment of Cognitive Ability
A cognitive test with reduced adverse impact
• In 2011, SIOP awarded its M.Scott Myers Award for applied
research to Yusko, Goldstein, Scherbaum, and Hanges for
the development of the Siena Reasoning Test
• This is a nonverbal reasoning test, using unfamilar item
content, such as made-up words (if a GATH is larger than a
SHET…) and figures
• Concept is that adverse impact will be reduced by
eliminating content with which groups have differential
Validity and subgroup d for Siena Test
• Black-White d commonly in the .3-.4 range
• Sizable number of validity studies, with validities in
the range commonly seen for cognitive tests.
• In one independent study, HumRRO researchers
included Siena along with another cognitive test;
corrected validity .45 for other test (d = 1.); .35 for
Siena (d = .38) (SIOP 2010: Paullin, Putka, and
Why the reduced d?
• Somewhat of a puzzle. There is a history of using nonverbal reasoning tests
– Raven’s Progressive Matrices
– Large sample military studies in Project A
• But these do not show the reduced d that is seen with the
Siena Test
• Things to look into: does d vary with item difficulty, and how
does Siena compare with other tests?
• (Note: Nothing published to date that I am aware of. Some powerpoint
decks from SIOP presentations can be found online: search for “Siena
Reasoning Test”)
New Developments in Situational
Judgment Testing
Sample SJT item
• You find yourself in an argument with several co-workers
about who should do a very disagreeable, but routine
task. Which of the following would be the most effective
way to resolve this situation?
• (a) Have your supervisor decide, because this would
avoid any personal bias.
• (b) Arrange for a rotating schedule so everyone shares
the chore.
• (c) Let the workers who show up earliest choose on a
first-come, first-served basis.
• (d) Randomly assign a person to do the task and don't
change it.
Key findings
• Extensive validity evidence
• Can measure different constructs (problem
solving, communication skills, integrity,etc.)
• Incremental validity over ability and personality
• Small subgroup differences, except for
cognitively-oriented SJTs
• Items can be presented in written form or by
video; recent move to animation rather than
recording live actors
Lievens, Sackett, and Buyse, T. (2009)
comparing response instructions
• Ongoing debate re “would do” vs. “should do”
• Lievens et al. randomly assigned Belgian
medical school applicants to “would do” or
“should do” in operational interpersonal skills
SJT; did the same with a student sample
Lievens, Sackett, and Buyse, T. (2009)
comparing response instructions
• In operational setting, all gave “should do”
– So: we’d like to know “would do”, but in effect,
can only get “should do”
Arthur et al (2014): comparing response
• Compared 3 options:
– Rate effectiveness of each response
– Rank the responses
– Choose best and worst response
• 20-item integrity-oriented SJT
• Administered to over 30,000 retail/hospitality
job applicants
• On-line admin; each format used for one week
• “Rate each response” emerges as superior
Higher reliability
Lower correlation with cognitive ability
Smaller gender mean difference
Higher correlation with conceptually relevant
personality dimensions (conscientiousness,
agreeableness, emotional stability)
• Follow-up study with student sample
– Higher retest reliabilty
– More favorable reactions
Krumm et al. (in press)
• Question: how “situational” is situational
• Some suggest SJTs really just measure
general knowledge about appropriate social
• So Krumm et al. conducted a clever
experiment: they “decapitated” SJT items
– Removed the stem – just presented the
• 559 airline pilots completed 10 items each from
– Airline pilot knowledge SJT
– Integrity SJT
– Teamwork SJT
• Overall, mean scores are 1 SD higher with the
• But for more than half the items, there is no
difference with and without stem
• So stem matters overall, but is irrelevant for lots of SJT
• Depends on specificity of stem content
• “You are flying an “angel flight” with a nurse and noncritical child
patient, to meet an ambulance at a downtown regional airport. You
filed visual flight rule: it is 11:00 p.m. on a clear night, when, at 60
nm out, you notice the ammeter indicating a battery discharge and
correctly deduce the alternator has failed. Your best guess is that
you have from 15 to 30 min of battery power remaining. You decide
• (a) Declare an emergency, turn off all electrical systems, except for
1 NAVCOM and transponder, and continue to the regional airport as
• (b) Declare an emergency and divert to the Planter’s County Airport,
which is clearly visible at 2 o’clock, at 7 nm.
• (c) Declare an emergency, turn off all electrical systems, except for
1 NAVCOM, instrument panel lights, intercom, and transponder, and
divert to the Southside Business Airport, which is 40 nm straight
• (d) Declare an emergency, turn off all electrical systems, except for
1 NAVCOM, instrument panel lights, intercom, and transponder, and
divert to Draper Air Force Base, which is at 10 o’clock, at 32 nm.”
• Arthur, W., Jr., Glaze, R. M., Jarrett, S. M., White, C. D., Schurig,
I., & Taylor, J. E. (2014). Comparative evaluation of three
situational judgment test response formats in terms of constructrelated validity, subgroup differences, and susceptibility to
response distortion. Journal of Applied Psychology, 99(3), 535545.
• Krumm, S, Lievens, F., Huffmeier,J., Lipnevich, A., Bendels,H.,
and Hertel, G.(in press). How “situational” is judgment in
situational judgment tests? Journal of Applied Psychology.
• Lievens, F., Sackett, P. R, and Buyse, T. (2009). The effects of
response instructions on situational judgment test performance
and validity in a high-stakes context. Journal of Applied
Psychology, 94, 1095-1101.
New Developments in Integrity
Two meta-analyses with differing
• Ones, Viswesvaran, and Schmidt (1993) is the “classic” analysis of
integrity test validity.
– found 662 studies, including many where only raw data was provided
(i.e., no write-up). Info sharing from many publishers
• In 2012, Van Iddekinge et al conducted an updated meta-analysis
– applied strict inclusion rules as to what studies to include (e.g.,
reporting of study detail)
– 104 studies (including 132 samples) met inclusion criteria.
– 30 publishers contacted; only 2 shared info.
• Both based bottom line conclusions on studies using a predictive design
and a non-self report criterion.
Predicting Counterproductive
• Ones et al – overt tests
10 5598
Mean Validity
• Ones et al- personalitybased tests
62 93092
• Van Iddekinge et al
10 5056
Why the difference?
• Not clear. A number of factors do not seem to be the cause:
– Differences in types of studies examined (e.g., both
excluded studies with polygraph as criteria)
– Differences in corrections (e.g., unreliability)
• Several factors may contribute, though this is speculation
– Some counterproductive behaviors may be more
predictable than others, but all are lumped together in
these analyses
• Given reliance in both on studies not readily available to
public scrutiny, this won’t be resolved until further work is
Broader questions
• This raises broader issues about data
openness policies
– Publisher obligations?
– Researcher obligations?
– Journal publication standards?
• Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive
meta-analysis of integrity test validities: Findings and implications for
personnel selection and theories of job performance. Journal of Applied
Psychology, 78, 679 –703
• Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H.
N. (2012). The criterion-related validity of integrity tests: An updated
meta-analysis. Journal of Applied Psychology, 97, 499 –530.
New Developments in Using
Vocational Interest Measures
• Since Hunter and Hunter (1984), interest in
using interest measures for selection has
diminished greatly
• They report a meta-analytic estimate of
validity for predicting performance as .10
• BUT: how many studies in this metaanalysis?
– 3!!!
• New meta-analysis by Van Iddekinge et al.
• Lots of studies (80)
• Mean validity for a single interest dimension:
• Mean validity for a single interest dimension
relevant to the job in question: .23
• Other studies suggest incremental validity
over ability and personality
• The “catch”: studies use data collected for
research purposes
• Concern that candidates can “fake” a jobrelevant interest profile
• I expect interest to turn to developing fakingresistant interest measures
• Van Iddekinge, C. H., Roth, P. L., Putka, D. J., & Lanivich,
S. E. (2011). Are you interested? A meta-analysis of
relations between vocational interests and employee
performance and turnover. Journal of Applied
Psychology, 96(6), 1167.
• Nye, C. D., Su, R., Rounds, J., & Drasgow, F. (2012).
Vocational interests and performance a quantitative
summary of over 60 years of research. Perspectives on
Psychological Science, 7(4), 384-403.
New Developments in Using
Social Media
Van Iddekinge et al (in press)
Students about to graduate made Facebook info available
Recruiters rated profile on 10 dimensions
Supervisors rated performance a year later
Facebook ratings did not predict performance
Higher ratings for women than men
Lower ratings for Blacks and Hispanics than Whites
• Van Iddekinge, C. H., Lanivich, S. E., Roth, P. L., & Junco,
E. (in press). Social Media for Selection? Validity and
Adverse Impact Potential of a Facebook-Based
Assessment. Journal of Management.
Distribution of Performance
Is performance normally
• We’ve implicitly assumed this for years
– Data analysis strategies assume normality
– Evaluations of selection system utility assume
• O’Boyle and Aguinis (2012) offer hundreds of
data sets, all consistently showing that a
“power law” distribution fits better
– This is a distribution with the largest number of
observations at the very bottom, with the number
of observations then dropping rapidly
The O’Boyle and Aguinis data
• They argue against looking at ratings data,
as ratings may “forced” to fit a normal
• Thus they focus on objective data
– Tallies of publication in journals
– Sports performance (e.g., golf tournaments won,
points scored in NBA)
– Awards in arts and letters (e.g. Number of
Academy Award nominations)
– Political elections (number of terms to which one
has been elected)
An alternate view
• “Job performance is defined as the total
expected value of the discrete behavioral
episodes an individual carries out over a
standard period of time” (Motowidlo and Kell,
Aggregating individual behaviors
affects distribution
Including all performers affects
Equalizing opportunity to perform
• O’Boyle Jr. E., & Aguinis, H. (2012). The best
and the rest: Revisiting the norm of normality
of individual performance. Personnel
Psychology, 65(1), 79.
• Beck, J., Beatty, A. S., and Sackett, P. R.
(2014) On the distribution of performance: A
reply to O’Boyle and Aguinis. Personnel
Psychology, 67, 531-566.

similar documents