A Bifactor Approach - Maryland Assessment Research Center (MARC)

```Exploring Value-Added Across
Multiple Dimensions: A Bifactor
Approach
Derek Briggs
Ben Domingue
Maryland Assessment Conference
October 18, 2012
1
Outline
• Motivation:
– Coming to Decisions in a High Stakes
Accountability Context
• Longitudinal Item Response Data
• A Bifactor Analysis
2
A Brief Aside on Vertical Scales
• The original title of this talk was “Multidimensionality, Vertical
• There is a simple bottom line on this: vertical scales are not
needed for value-added modeling. It’s a non-issue.
• Even for models that focus on repeated measures and growth
trajectories, the approach taken to create a vertical scale will
rarely have an impact on teacher or school rankings.
– For details, see working paper by Briggs & Domingue, “The Gains from
Vertical Scaling”
• Vertical scales can play an important role in supporting
inferences about student growth in absolute magnitudes.
– For a critique of current practice, see Briggs
“Measuring Growth with Vertical Scales” (in press) JEM
3
Motivation
4
“Don’t measure yourself by what you
have accomplished, but by what you
should have accomplished with your
ability.”
5
Our Theories and Intuition Tell Us
Multidimensional
According to Common Core State Standards, students who
are college and career ready in
• reading, writing, speaking and listening:
– Demonstrate independence, build strong content knowledge,
comprehend as well as critique, value evidence, use technology
and digital media, understand other perspectives and cultures.
• mathematics:
– Make sense of problems and persevere in solving them, reason
abstractly, construct viable arguments and critique the
reasoning of others, model with mathematics, attend to
precision, etc.
6
Previous Empirical Evidence
The variability in VA by outcome measure is greater
than the variability by model specification.
– Lockwood et al, JEM, (2007)
– MET Study “Learning about Teaching” (2010)
– Papay, AERJ (2011)
These studies focused on correlations between VA
based on different tests within the same content
7
Data Source
Unit of Analysis
Sample
Size
Model
Hawaii
Schools
272
(MGPs)
0.74
Wyoming
Schools
214
(MGPs)
0.53
Denver PSD
Teachers
180
(MGPs)
0.58
LAUSD
Teachers
10794
Fixed Effects Regression,
student, demographics,
no classroom or school
covariates (“LAVAM”)
0.60
LAUSD
Teachers
3306
Fixed Effects Regression,
with classroom and school
covariate (“altVAM”
0.46
8
Teachers/Schools
Categorical Outcomes (K):
4 = Highly Effective
3 = Effective
2 = Partially Effective
1 = Ineffective
in Student Outcomes
Direct Observations of
Practice, Other Sources of
Evidence
9
• Compensatory
– Take simple or weighted average of value-added indicator
across test outcomes
– Classify teachers/schools on basis of quantiles of
distribution or confidence intervals.
• Conjunctive
– Classify teachers/schools in i categories on basis of j
outcomes.
– Make rules that simplify ij decision matrix to k.
– Ensure that no teacher/school is ineffective on a given
outcome.
10
What Are Tests Designed to “Measure?”
11
“Tests Measure Student Achievement”
• An achievement score is a function of the content
sampled from an instructional domain
• Teachers/schools may vary in their ability to in teach
different subject matter.
• Agnostic about underlying latent variable.
• Observed achievement is an estimate of a true score or
universe score (G Theory)
– Each achievement domain has a different hypothetical universe
score.
• Consistent with compensatpory and conjunctive
approaches?
12
“Tests Measure Student Ability (θ)”
• This is a latent variable perspective.
• But math and reading “abilities” are poorly defined
latent variables.
• What is distinct and what is the same about these
variables?
• What if reading and math items are really just
measuring the same unidimensional latent variable?
• Spearman’s g?
• Should this be the focus of value-added inferences?
13
A Novel Application of a
Bifactor IRT Model
Common
Factor
“g”?
Math
Knowledge &
Skills?
Ite
m1
Ite
m2
…
Items from a Math Test
K & S?
Ite
m
45
Ite
m1
Ite
m2
…
Ite
m
54
14
Research Questions
1. Is “achievement” distinct from “ability”?
•
•
•
If we remove the influence that is common to both
math and reading test performance, what is left?
Are the subject-specific variables substantively
How do the three “theta” variables from the bifactor
model compare to the “theta” variables from
successive unidimensional IRT models?
2. What insights does a bifactor model give us
estimates of value-added across test outcomes?
15
Exploratory Strategy
• Leverage longitudinal item response data to estimate six
“theta” variables:
UNIDIMENSIONAL
1. Math (2PL IRT)
3. Math + Reading (unidimensional 2PL IRT)
MULTIDIMENSIONAL
1. Bifactor math (Bifactor 2PL)
3. Bifactor g (Bifactor 2PL)
• Examine the characteristics of each as a “measure”
• Compare the use of these different variables as the
outcome in a (simple) value-added model
16
Data & Methods
17
Bifactor Model
I
(
P(y | Q) = Õ P yi( j ) | q g ,q j
i=1
)
where Q = (q g ,q1,..., q j ,..., q J )
(
Let p j = P yi( j ) | q g ,q j
i = items, j = item specific factors,
g = general factor
)
Logit link fcn : logit(p j ) = a igq g + a ijq j + b j
Technical Details
Software: IRTPro 2.1 (Cai, Thissen, Du Toit)
Estimation Method: Bock-Aitken, 49 quadrature points
References: Cai, 2010; Cai, Yang & Hansen, 2011; Rijmen, 2009;
Rijmen et al 2008.
18
CSAP Tests in Math and Reading
Content Standards
1. Number & Operation Sense
2. Algebra, Patterns and Functions
3. Statistics and Probability
4. Geometry
5. Measurement
6. Computational Techniques
 All 6 standards emphasize
application of content for
problem solving and
communication.
 Mix of MC and CR items
Content Standards
2. Thinking Skills
3. Use of Literary Information
4. Literature
 Subcontent: Fiction, Nonfiction,
Vocabulary, Poetry
 Mix of MC and CR items
19
Longitudinal Item Response
Structure: Students nested in Schools
Source: Denver Public School District
20
Student & School Characteristics
price lunch services (FRL).
• Between 10-20% are English Language Learners (ELL)
Across DPS Schools:
Variable
Mean
SD
FRL
65%
28%
SPED
11%
8%
ELL
14%
13%
21
Students per School
12
62
10
1st Qu
128
8
Median
210
6
Mean
236.6
4
3rd Qu
315.2
Max
578
2
Min
0
Frequency
14
School Size
0
100
200
300
400
Number of Students
500
600
40 Schools
22
• Fixed Effects Regression
(high school)
• Outcome: One of the six “theta” variables created.
• Covariates:
–
–
–
–
–
Prior grade achievement in same outcome,
Free/reduced lunch status
English Language Learner status
Special Education status
• Empirical Bayes shrinkage estimators
23
Caveats
•
•
•
•
•
This is a very simple VAM.
Limited set of covariates, no school-level vars.
Only a single longitudinal cohort of students.
– (Though we did examine possible adjustments.
Results not shown here.)
24
Results
25
5
6
7
8
9
5
(.76)
.85
.82
.77
.77
6
.88
(.78)
.86
.82
.80
7
.87
.91
(.78)
.86
.83
8
.87
.89
.91
(.76)
.85
9
.84
.88
.89
.89
(.74)
Note how strong
these correlations
are even after 4
years.
math lower triangle; reading upper triangle
26
7
8
1.0
0.5
0.0
9
5
6
7
8
bifactor math
7
8
9
Horizontal
blue line at
0.5
0.0
-0.5
0.5
0.0
6
9
1.0
-0.5
5
-0.5
1.0
0.5
0.0
6
1.0
5
-0.5
g math
5
6
7
8
9
27
0.5
0.0
-0.5
-0.5
0.0
0.5
1.0
1.0
g math
g5
g6
g7
g8
g9
g5
g7
g8
g9
0.5
0.0
-0.5
-0.5
0.0
0.5
1.0
1.0
bf math
g6
g5
g6
g7
g8
g9
g5
g6
g7
g8
g9
28
0.5
-0.5
6
7
8
9
5
6
7
8
bifactor math
9
0.5
0.0
Seems clear that
something is
data in 2005 so we
analyses that
follow.
-0.5
-0.5
0.0
0.5
1.0
1.0
5
0.0
0.5
0.0
-0.5
1.0
1.0
g math
5
6
7
8
9
5
6
7
8
9
29
Marginal Reliabilities
The bifactor and math and reading estimates are rather noisy
estimates. Low reliability at the student level.
30
5
6
8
5
(-.23)
.37
.03
6
.52
(-.13)
-.02
8
.45
.54
(-.20)
9
.44
.52
.56
9
.06
-.10
.27
(-.18)
math lower diagonal; reading upper diagonal
31
Regression Results with
Unidimensional Outcomes
Unidimensional Approach
Math
Combined
Free/Reduced Price Lunch
Eligible
0.79*
0.81*
0.86*
-0.04
-0.07*
-0.03
English Language Learner
Student has an IEP
-0.04
-0.01
-0.01
-0.11*
-0.15*
-0.10*
R2 for model w/ school fixed effects
0.761
0.800
0.855
R2 for model w/ NO school fixed effects
0.734
0.785
0.838
Increase in R2 due to schools
0.027
0.014
0.016
Note: Each outcome is standardized, so coefficients can be interpreted in an effect
size metric. * p < .05
32
Regression Results with
Bifactor Outcomes
Multidimensional Approach
Math
g
Free/Reduced Price Lunch
Eligible
0.33*
0.48*
0.84*
-0.04
-0.21*
0.00
English Language Learner
Student has an IEP
0.07
-0.20*
0.01
-0.02
-0.22*
-0.11*
R2 for model w/ school fixed effects
0.147
0.356
0.814
R2 for model w/ NO school fixed effects
0.116
0.331
0.793
Increase in R2 due to schools
0.032
0.025
0.021
Note: Each outcome is standardized, so coefficients can be interpreted in an effect
size metric. * p < .05
33
School “Effects” Distributions from
Unidimensional vs. Bifactor Outcomes
SD Math = 0.22
SD Comp = 0.18
SD Math = 0.11
SD g
= 0.21
Note: These are
shrunken VA estimates
34
Unidimensional vs. Bifactor Math
SD Uni Math = 0.22
SD BF Math = 0.11
Note: These are
shrunken VA estimates
35
Note: These are
shrunken VA estimates
36
VA Comparisons:
Uni Math, Uni Reading vs. g
Value-added for math seems mostly redundant with value-added for g (r = .98);
but looking at reading separately yields some unique information (r = .82).
37
VA g is equivalent to VA from
38
Math vs. Reading: With and Without g
39
Math vs. Reading VA within Method
Bifactor VA by Subject
Unidimensional VA by Subject
40
Relationship of VA with
School-Level Status Variables
If low correlations with these variables was considered an indication of a VA indicator
that successfully leveled the playing field, the school effects associated with bifactor
math outcomes would “win”.
41
Discussion
42
Summary
• When math and reading outcomes are quantitatively
combined (VAcomp or taking average of VA across subjects),
this is essentially equivalent to estimating VA for “g”.
– Math items load weakly on math-specific bifactor
• Evidence that math and reading bifactors are not just noise.
• School fixed effects explain more variability in math/reading
factors than in traditional unidimensional measures.
missed if math and reading were combined.
43
Limitations & Next Steps
Limitations:
Next steps:
blueprints.
• Do results generalize to
–
–
–
–
schools & districts throughout the state?
multiple cohorts of students?
other tests?
more complex VAMs (control for unit-level aggregates)?
44
Tough Conceptual Questions
• What is g?
– Is it sensitive to instruction?
– Is it what we want to hold teachers and schools
accountable for increasing?
• If a test measures something beyond g, what
is that something? Can it be distinguished?
45
Claims from the Smarter-Balanced
Large-Scale Assessment Consortium
In the domain of mathematics:
Claim 1: Students can explain and apply mathematical concepts and interpret
and carry out mathematical procedures with precision and fluency.
Claim 2: Students can solve a range of complex well-posed problems in pure
and applied mathematics, making productive use of knowledge and problemsolving strategies.
Claim 3: Students can clearly and precisely construct viable arguments to
support their own reasoning and to critique the reasoning of others.
Claim 4: Students can analyze complex, real-world scenarios and can
construct and use mathematical models to interpret and solve problems.
Should each of these claims be measured with a unique score? Should we
expect variability in teacher efficacy on each? Or are all of these claims
wrapped up in g?
46
```