### Using Remark Statistics to Evaluate Multiple

```Using Statistics to Evaluate
Multiple Choice Test Items
FACULTY DEVELOPMENT
PROFESSIONAL SERIES
OFFICE OF MEDICAL EDUCATION
TULANE UNIVERSITY SCHOOL OF MEDICINE
Objectives
• Define test reliability
• Interpret KR-20 statistic
• Evaluate item difficulty (p value)
• Define and interpret the point biserial correlation
• Evaluate distractor quality
• Differentiate among poor, fair, and good test items
What is Reliability?
 Test reliability is a measure of the accuracy, consistency,
or precision of the test scores.
 Statistics:

Coefficient (Cronbach) Alpha – generally used for surveys or tests
that have more than one correct answer

Kuder-Richardson Formula (KR-20) – Measures inter-item
consistency or how well your exam measures a single construct.
Used for knowledge tests where items are scored correct/incorrect
(dichotomous)

KR-21 – Similar to KR-20 but underestimates the reliability of an
exam if questions are of varying difficulty
Interpreting KR-20
 KR-20 statistic is influenced by:
 Number of test items on the exam
 Student performance on each item
 Variance for the set of student test scores
 Range: 0.00 – 1.00
 Values near 0.00 – weak relationship among items on the test
 Values near 1.o0 – strong relationship among items on the test
 Medical School exams should have a KR-20 of .70
or higher
Improving Reliability
Reliability can be improved by:
 Writing clear and simple directions
 Ensuring test items are clearly written and follow
NBME guidelines for construction
 Assuring that test items match course objectives and
content
 Adding test items; longer tests produce more reliable
scores
Item Difficulty
 Item Difficulty (p value) – measure of the proportion of
students who answered a test item correctly
 Range – 0.00 – 1.00


Ex. p value of .56 means that 56% of students answered the question
correctly.
p value of 1.00 means that 100% of students answered the question
correctly.
 For medical school tests where there is an emphasis on
mastery, MOST items should have a p-value of .70 or
higher.
What is a point biserial correlation?
The point biserial correlation:


Measures test item discrimination
Ranges from -1.00 to 1.00
A positive point biserial indicates that those scoring high on the
total exam answered a test item correctly more frequently than
low-scoring students.
 A negative point biserial indicates low scoring students on the total
test did better on a test item than high-scoring students.


As a general rule, a point biserial of ≥.20 is desirable.
Distractor Analysis
 Addresses the performance of incorrect response
options.
 Incorrect options should be plausible but incorrect.
 If no one chooses a particular option, the option is
not contributing to the performance of the test item

The presence of one or more implausible distractors can make
the item artificially easier than it should be.
Point biserial analysis
 Items that are very easy or very difficult will have low
ability to discriminate.

Such items are often needed to adequately sample course
content and objectives.
 A negative point biserial suggests one of the
following:



The item was keyed incorrectly.
The test item was poorly constructed or was misleading.
The content of the item was inadequately taught.
Test Item Analysis
Example 1
Interpretation:
 p value = 83.51
83.51% of class
 point biserial of .40

≥.20 – high scoring students were
more likely to choose the correct
 all distractors chosen
 GOOD QUESTION
Test Item Analysis
Example 2
Interpretation:
p value = 25.57
only 25.57% of class
point biserial = -0.14
 <.20 – low scoring students were
more likely to choose the correct
 POOR QUESTION
Test Item Analysis
Example 3
Interpretation:
p value = 97.73
97.73% of class
Point biserial = 0.08
 <.20 – BUT almost all students
unable to discriminate. This is
okay if item tests a concept all
students are expected to know.
 FAIR QUESTION
Self-Assessment
Review the following test
item statistics:
this test item?
The distractors were implausible and
should be replaced.
Low scoring students got this item
correct more often than high-scoring
students.
More than 10% of the class answered
this question incorrectly.
This test item showed high
discriminative ability and should be
retained.
Summary
Evaluate
KR-20
Revise, rewrite,
items as
needed
REVIEW
STEPS
Evaluate the
value of
distractors
Evaluate
p value of
each test item
Evaluate
point biserial
of correct