Teacher Evaluation and Performance Measurement, Educational

Report
Teacher Evaluation
and Performance
Measurement
Doug Staiger, Dartmouth College
Not this.
Satisfactory (or equivalent)
Unsatisfactory (or equivalent)
Weisberg, D., Sexton, S., Mulhern, J. & Keeling, D. (2009) The Widget Effect: Our National Failure to Acknowledge
and Act on Differences in Teacher Effectiveness. New York: The New Teacher Project.
2
Not this.
3
Transformative Feedback
4
Recent Work on Teacher Evaluation
 Efforts to identify effective teaching using
achievement gains
– Work with Tom Kane & others in LAUSD, NYC, Charlotte…
www.dartmouth.edu/~dstaiger
 Efforts to better identify effective teaching
– Measures of Effective Teaching (MET) Project
(Bill & Melinda Gates Foundation)
www.metproject.org
– National Center for Teacher Effectiveness (NCTE)
(US Department of Education)
www.gse.harvard.edu/ncte
5
The Measures of
Effective Teaching Project
Participating Teachers
• Two school years: 2009-10 and 2010-11
• Grades 4-8: ELA and Math
• High School: ELA I, Algebra I and Biology
The MET data is unique …

in the variety
of indicators tested,
5 instruments for classroom observations (use FFT here)
Student surveys (Tripod Survey)
Value-added on state tests

in its
scale,
3,000 teachers
22,500 observation scores (7,500 lesson videos x 3 scores)
900 + trained observers
44,500 students completing surveys and supplemental assessments in year 1
3,120 additional observations by principals/peer observers in Hillsborough County, FL

and in the variety
of student outcomes studied.
Gains on state math
ELA tests
Gains on supplemental tests (BAM & SAT9 OE)
Student-reported outcomes (effort and enjoyment in class, grit)
and
7
What is “Effective” Teaching?
 Can be an inputs based concept
– Observable actions or characteristics
 Can be outcomes based concept
– Measured by student success
 Ultimately, care about impact on student
outcomes
– Current focus on standardized exams
– Interest in other outcomes (college, non-cognitive)
8
Multiple Measures of Teaching Effectiveness
9
Measure #1
Student Achievement Gains
(“Value Added”)
10
Basics of Value Added Analysis

Teacher value added compares actual student
achievement at the end of the year to an
expectation for each student

Difference between actual and expected achievement,
averaged over all of teacher’s students

Expected achievement is typical achievement for other
students who looked similar at start of year
– Same prior-year test scores
– Same demographics, program participation
– Same characteristics of peers in classroom or school
 Various flavors, all work similarly
– Student growth percentiles
– Average change in score or percentile
– Based on prior year test or Fall pre-test
11
There are Large Differences in Teacher Effects on
Student Achievement Gains
 Most evidence from “value added” analysis, but
similar findings from randomized experiments
 Huge literature about “teacher effects” on
achievement
–
–
–
–
–
Large persistent variation across teachers
Difficult to predict at hire
Partially predictable after hire
Improve only in the first few years of teaching
Not related to most determinants of pay
• Certification, degrees, experience beyond first few
years
Large Variation in Value Added of LAUSD Teachers
is Not Related to Teacher Certification
0
.03
.06
.09
.12
Teacher Impacts on Math Performance
by Initial Certification
-15
-10
-5
0
5
10
Change in Percentile Rank of Average Student
Traditionally Certified
Alternatively Certified
15
Uncertified
Note: Classroom-level impacts on average student performance, controlling for baseline scores,
student demographics and program participation. LAUSD elementary teachers, grade 2 through 5.
Variation in Value Added of LAUSD Teachers
is Related to Prior Performance
0
.03
.06
.09
.12
Teacher Impacts on Math Performance in Third Year
by Ranking After First Two Years
-15
-10
-5
0
5
10
Change in Percentile Rank of Average Student
Bottom
2nd Quartile
15
3rd Quartile
Top Quartile
Note: Classroom-level impacts on average student performance, controlling for baseline scores,
student demographics and program participation. LAUSD elementary teachers, < 4 years experience.
Why Not Just Hire Good Teachers?
 Wise selection is the best means of improving the
school system, and the greatest lack of economy
exists wherever teachers have been poorly
chosen.
• Frank Pierrepont Graves, NYS Commissioner,
1932
 Unfortunately, easier said than done
– Decades of work on type of certification, graduate
education, exam scores, GPA, college selectivity, TFA
– (Very) small, positive effects on student outcomes
0
2
4
6
8
Large Variation in Value Added of NYC Teachers
is Not Related to Recruitment Channel
-.4
-.3
-.2
-.1
0
.1
.2
Student Level Standard Deviations
Traditionally Certified
Teach for America
.3
.4
Teaching Fellow
Uncertified
Note: Shown are estimates of teachers' impacts on average student performance, controlling for teachers' experience levels and students' baseline
scores, demographics and program participation; includes teachers of grades 4-8 hired since the 1999-2000 school year.
Of Course, Teacher Impact on State Test Score
is Not All We Care About

Depends on design & content of test

Test scores are proximate measures
–

But recent evidence suggests they capture longrun impact on student learning and other outcomes
Test scores are only one dimension of
performance
–
Non-cognitive skills (grit, dependability, …)
Value Added is Controversial
 “We need to find a way to measure classroom
success and teacher effectiveness. Pretending
that student outcomes are not part of the
equation is like pretending that professional
basketball has nothing to do with the score.”
(Arne Duncan 2009)
 “There is no way that any of this current data
could actually, fairly, honestly or with any integrity
be used to isolate the contributions of an
individual teacher.” (Randi Weingarten 2008)
18
What we learned from MET:
Value-added measures
• Identified teachers who caused students to learn more
on state tests following random assignment.
• Same teacher’s also caused students to learn more on
supplemental assessments and enjoy class more.
• Low year-to-year correlations in value-added (and other
performance measures) understate year-to-career
correlations.
19
20
.05
.1
Figure 1. Actual and Predicted Achievement
of Randomized Classrooms (Math)
-.1
-.05
0
Actual = Predicted
-.1
-.05
0
.05
Predicted achievement using teacher's past measures of teaching.
.1
Note: Teachers were sorted into 20 groups by their predicted student achievement relative to the randomization group mean.
Means are reported for each of the 20. Predictions are adjusted for non-compliance.
21
.05
.1
Figure 2. Actual and Predicted Achievement
of Randomized Classrooms (ELA)
-.1
-.05
0
Actual = Predicted
-.1
-.05
0
.05
Predicted achievement using teacher's past measures of teaching.
.1
Note: Teachers were sorted into 20 groups by their predicted student achievement relative to the randomization group mean.
Means are reported for each of the 20. Predictions are adjusted for non-compliance.
22
Measure #2
Classroom Observations
23
Classroom Observation
Using Digital Video
24
What you can expect from us:
Helping Districts Test Their Own New
Classroom Observations
Access to Validation Engine:
SEA/LEA chooses
a rubric, trains
raters
The MET Project
delivers sample
videos
SEA/LEA ratings used
to
-Predict value added
-Gauge reliability
25
Two Cross-Subject Observation Instruments
Instrument
Developer
Origin
Instructional
Focus
Structure
Scoring
Framework
for Teaching
Charlotte
Danielson
Outgrowth of
ETS’s PRAXIS
III licensing
exam
Constructivism
4 domains;
22 components
4 Points
Tool for
research on
early
childhood
development
Teacherstudent
interactions
Classroom
Assessment
Scoring
System
(CLASS)
Robert
Pianta,
Univ. of
Virginia
Intellectual
Engagement
MET uses 8
components*
3 domains;
12 dimensions
7 Points
*not: “flexibility & responsiveness” & “organization of physical space”
26
FFT competencies scored:
CLASSROOM ENVIRONMENT
Creating an environment of respect and rapport
Establishing a culture of learning
Managing classroom procedures
Managing Student Behavior
INSTRUCTION
Communicating with Students
Using Questioning and Discussion Techniques
Engaging Students in Learning
Using Assessments in Instruction
27
Math Observation Instruments
Instrument
Developer
Origin
Instructional
Focus
Structure
Scoring
6 overall
elements of
instruction
3 Points
Mathematical
Quality of
Instruction
(MQI)
Heather
Hill,
Harvard
Outgrowth
from written
test of math
teaching
knowledge
Math errors and
imprecision
UTEACH
Observation
Protocol
(UTOP)
Michael
Marder,
Univ. of
TexasAustin
Teacher prep
program for
math &
science
majors
Values different 4 sections;
5 Points
modes, from
22 subsections
direct instruction
to inquiry-based
28
ELA Observation Instrument
Instrument
Developer
Protocol for
Language Arts
Teaching
Observations
(PLATO)
Pam
Grossman
Stanford
Origin
Research on
effective
middle grade
ELA
instruction
Instructional
Focus
Structure
Modeling,
explicit teaching
of strategies,
guided practice
13 elements
Scoring
4 Points
6 elements
included in
MET study
29
What we learned from MET:
Classroom observations:
• Observation scores were correlated with a teacher’s valueadded (.15-.27).
• Different instruments were highly correlated with each other
(although subject-specific instruments were distinct from the
general-pedagogical instruments).
• Reliability requires certified observers and more than one
observer per teacher (because rater judgments differ).
• Principals rate their own teachers higher than other
observers do, but their rankings are similar.
• When teachers select their own videos, scores are higher,
but ranking remains the same.
30
Four Steps
Four Steps to High-Quality
Classroom Observations
31
Four Steps
Basic
Advanced Proficient
Framework for Teaching (Danielson)
Unsatisfactory
Step 1: Define Expectations
Yes/no Questions, posed in
rapid succession, teacher
asks all questions, same few
students participate.
Some questions ask for
student explanations, uneven
attempts to engage all
students.
Actual scores for
Most
questions
ask for
7500
lessons.
explanation, discussion
develops/teacher steps
aside, all students
participate.
All questions high quality,
students initiate some
questions, students engage
other students.
32
Four Steps
Step 2: Ensure Accuracy of Observers
33
Four Steps
Step 3: Monitor Reliability
34
More than 1 observer
One more observer +.16
One more
lesson
+.07
35
Four Steps
Step 4: Verify Alignment with Outcomes
Teachers with Higher Observation Scores Had Students Who Learned More
36
Measure #3
What do students say?
37
Students Distinguish Between Teachers
Percent of Students by Classroom Agreeing
38
Students Distinguish Between Teachers
Percent of Students by Classroom Agreeing
39
Students Distinguish Between Teachers
Percent of Students by Classroom Agreeing
40
Students Distinguish Between Teachers
Percent of Students by Classroom Agreeing
41
Students Distinguish Between Teachers
Percent of Students by Classroom Agreeing
42
What we learned from MET:
Student surveys:
• Surveys are a low-cost way to cover untested grades
and subjects.
• Student surveys are related to teacher value-added
(.15-.25).
• Student surveys are the most reliable measures we
tested.
43
Multiple Measures
The “Dynamic Trio”:
Classroom observations, student
feedback and student achievement gains.
44
Dynamic Trio
Three Criteria:
Predictive power:
Which measure could most accurately identify teachers
likely to have large gains when working with another group of students?
Reliability:
Which measures were most stable from section to section or year
to year for a given teacher?
Potential for Diagnostic Insight:
Which have the potential to help a
teacher see areas of practice needing improvement? (We’ve not tested this yet.)
45
Dynamic Trio
Measures have different strengths
…and weaknesses
Measure
Predictive power
Reliability
Potential for
Diagnostic Insight
Value-added
Student survey
Observation
46
Dynamic Trio
Combining Measures Improved Reliability
as well as Predictive Power
Difference in Math VA (Top 25% vs. Bottom 25%)
.05
.1
.15
.2
.25
The Reliability and Predictive Power of Measures of Teaching:
VA alone
Combined
(Criterion Weights)
Combined
(Equal Weights)
Student survey alone
Observation alone (FFT)
0
.1
.2
.3
.4
.5
.6
.7
Reliability
Note: Table 16 of the research report. Reliability based on one course section, 2 observations.
Note: For the equally weighted combination, we assigned a weight of .33 to each of the three measures. The criterion weights were chosen to
maximize ability to predict a teacher’s value-added with other students. The next MET report will explore different weighting schemes.
47
What we learned from MET:
Combining measures:
• The teachers identified as more effective caused
students to learn more following random assignment.
• Combining value added with student surveys and
classroom observations produces two benefits:
• Increased reliability
• Increased correlation with other outcomes such as value-added
on supplemental assessments and happiness in class
• Weighting value-added below .33, though, lowered
correlation with other outcomes and lowered reliability.
48
Can the measures be used for “high stakes”?
Scenario 1:
Teacher
You have been teaching biology for 10 years and want to improve your
practice. What weaknesses should you focus on and how will you know if
you're making progress?
Scenario 2:
Principal
A probationary teacher in your school is approaching the end of their 2nd
year. If you retain him/her, the teacher automatically earns tenure under
the collective bargaining agreement. Should you grant tenure (or recruit a
new novice teacher)?
Scenario 3:
Superintendent
Your district is considering offering coaching opportunities/higher pay to a
subset of your teachers. Should you (i) allocate those slots on the basis of
seniority, (ii) ensure that only excellent instructors are coaches? How
would you measure effectiveness fairly?
 High-stakes decisions are being made now, with little or no data.
 No information is perfect, but better information should lead to
better decisions and fewer mistakes.
49
No information is perfect.
But better information → better decisions
How do these compare to existing measures?
• Masters Degrees
• Years of Experience
• Classroom Observations Alone
50
Compared to
What?
Compared to MA Degrees and Years of Experience,
the Combined Measure Identifies Larger Differences
… on state tests
51
Compared to
What?
…and on low stakes assessments
52
Compared to
What?
…as well as on student-reported outcomes.
53
The Value of Going Beyond Classroom Observation
• Observations
• Observations
• Student
Perceptions
• Observations
• Student
Perceptions
• VA on state
tests
+
+
+
54
Average math Value Added, Other Class
-.2
-.1
0
.1
.2
.3
Compared to Classroom Observations Alone, the Combined
Measure Identifies Larger Differences (Math Value Added)
0
20
40
60
Percentile Rank on FFT
80
100
Rank using FFT only
Rank using FFT and Tripod
Rank using FFT, Tripod, and Value Added
55
Improving Teaching
What are Districts Doing?
56
Robust evaluation systems themselves
improve teaching outcomes
Source: Eric S. Taylor and John H. Tyler, “Can Teacher Evaluation Improve Teaching?”
Education Next, Fall 2012
Teacher Effectiveness Continues to
Improve in Better Environments
Source: Matthew A. Kraft and John P. Papay, “Can Professional Environments in
Schools Promote Teacher Development? Explaining Heterogeneity in Returns to
Teaching Experience,” January 2013 (on NCTE website).
The Best Foot Forward Project
1. Teachers record their own lessons.
•
•
•
Record ≥1 lesson every 2 weeks.
Submit 5 lessons over course of the year.
Viewed by principals, content experts.
2. Observers view and discuss videos with teachers.
•
•
Observers trained to use video for feedback.
Identify discreet, coachable changes.
3. Teachers can share videos with each other.
4. Students provide anonymous feedback.
59
Next Up: Dashboard for Tracking
Teacher Evaluations and Benchmarking Performance
1. Distribution of Observation Scores: What are the
differences in scores and are the differences between
schools, districts, grades and subjects larger than might
have occurred by chance?
2. Observations and Value-Added: What are the
relationships among the different measures? Do they
differ by district, school, grade level, subject? Are they
weaker/stronger than we observed in MET?
3. Reliability: How does each measure vary from school
to school and year to year?
60
Useful Resources
Available at: http://www.metproject.org/resources.php
 Student surveys: Tripod survey and “Asking Students about Teaching
Practitioner Brief”
 Roster Validation:
Report by Battelle for Kids on ways to allow teachers to
verify students in their class: “Identifying The Importance of Accurately Linking Instruction
to Students to Determine Teacher Effectiveness”
 Software for Certifying Observers using Pre-Scored
Videos: Certification engine from Empirical Education
Available at: http://www.gse.harvard.edu/ncte/resources/default.php
 Classroom Observation:
Links to FFT, CLASS, etc., and webinars with
six organizations currently supporting classroom observations
Additional examples of sites with useful resources:
 TNTP: http://tntp.org/ideas-and-innovations
 Pearson: http://educatoreffectiveness.pearsonassessments.com/

similar documents