LG675_4 - University of Essex

Report
LG675
Session 4: Reliability I
Sophia Skoufaki
[email protected]
8/2/2012

What does ‘reliability’ mean in the context
of applied linguistic research?
 Definition
and examples

Which are the two broad categories of
reliability tests?

How can we use SPSS to examine the
reliability of a norm-referenced measure?
 Work
with typical scenarios
2
Reliability: broad definitions
The
degree to which a data-collection
instrument (e.g., a language test,
questionnaire) yields consistent results.
The
degree to which a person
categorises linguistic output consistently
(as compared to himself/herself or
someone else).
3
Reliability in applied linguistic
research: examples
a.
b.
c.
A researcher who has created a vocabulary-knowledge test wants
to see whether any questions in this test are inconsistent with the
test as a whole (Gyllstad 2009).
A researcher who has collected data through a questionnaire she
designed wants to see whether any questionnaire items are
inconsistent with the test as a whole (Sasaki 1996).
A researcher wants to see whether she and another coder agree to
a great extent in their coding of idiom-meaning guesses given by
EFL learners as right or wrong (Skoufaki 2008).
4
Two kinds of reliability

Reliability of norm-referenced datacollection instruments

Reliability of criterion-referenced data
collection instruments, AKA
‘dependability’.
5
Classification of data-collection
instruments according to the basis of
grading
Data-collection
instruments
Normreferenced
Criterionreferenced
6
Norm-referenced

“Each student’s score on such a test is
interpreted relative to the scores of all other
students who took the test. Such comparisons
are usually done with reference to the concept of
the normal distribution …” (Brown 2005)

In the case of language tests, these tests assess
knowledge and skills not based on specific
content taught.
7
Criterion-referenced

“… the purpose of a criterion-referenced test is to make
a decision about whether an individual test taker has
achieved a pre-specified criterion…” (Fulcher 2010)

“The interpretation of scores on a CRT is considered
absolute in the sense that each student’s score is
meaningful without reference to the other students’
scores.” (Brown 2005)

In the case of language tests, these tests assess
knowledge and skills based on specific content taught.
8
How reliability is assessed in normreferenced data-collection instruments
Norm-referenced
reliability
Test-retest
Equivalent
forms
Internal
consistency
9
Which reliability test we will use also
depends on the nature of the data
ratio
interval
ordinal
nominal
10
Scoring scales

Nominal: Numbers are arbitrary; they
distinguish between groups of individuals
(e.g., gender, country of residence)

Ordinal: Numbers show greater or lesser
amount of something; they distinguish between
groups of individuals and they rank them
(e.g., students in a class can be ordered)
11
Scoring scales (cont.)

Interval: Numbers show greater or lesser amount of something and
the difference among adjacent numbers remains stable throughout
the scale; numbers distinguish between groups of individuals and
they rank them and they show how large the difference is between
two numbers
(e.g., in tests where people have to get a minimum score to pass)

Ratio: This scale contains a number zero, for cases which
completely lack a characteristic; numbers do all the things that
numbers in interval scales do and they include a zero point
(e.g., in length, time)
12
Assessing reliability through the testrest or equivalent forms approach
The procedure for this kind of reliability
test is to ask the same people to do the
same test again (test-retest) or an
equivalent version of this test (equivalent
forms).
 Correlational or correlation-like statistics
are used to see how much the scores of
the participants are similar between the
two tests.

13
SPSS: Testing test-retest reliability
Open SPSS and input the data from Set
1 on page 5 of Phil’s ‘Simple statistical
approaches to reliability and item
analysis’ handout.
 Do this activity.
 Then do the activity with Sets 2 and 3.

14

Rater reliability: The degree to which
a) a rater rates test-takers’ performance
consistently (intra-rater reliability) and
b) two or more raters which rate testtakers’ performance give ratings which
agree among themselves (inter-rater
agreement)
15
Ways of assessing internal-consistency
reliability
Split-half reliability


We split the test items in half. Then we do a correlation between the
scores of the halves. Because our finding will indicate how reliable
half our test is (not all of it) and the longer a test is, the higher its
reliability, we need to adjust the finding. We use the SpearmanBrown prophecy formula for that.
Or

Statistic that compares the distribution of the scores that each item got
with the distribution of the scores the whole test got
 E.g.: Cronbach’s a or Kuder-Richardson formula 20 or 21

In both cases, the higher the similarity found, the higher the internalconsistency reliability.
Cronbach’s a is the most frequently used internal-consistency reliability
statistic.
16
SPSS: Assessing internal-consistency
reliability with Cronbach’s a





This is an activity from Brown (2005). He split
the scores from a cloze test into odd and even
numbered ones, as shown in the table in your
handout.
Input the file ‘Brown_2005.sav’ into SPSS.
Then click on Analyze...Scale...Reliability
analysis....
In the Model box, choose Alpha
Click on Statistics and tick Scale and
Correlations.
17
Assessing intra-judge agreement or interjudge agreement between two judges



When data is interval, correlations can be used (Pearson
r if the data are normally distributed and Spearman rho if
they are not).
When there are more than two judges and the data is
interval, Cronbach’s a can be used.
When data is categorical, we can calculate agreement
percentage (e.g., the two raters agreed 30% of the time)
or Cohen’s Kappa. Kappa corrects for the chance
agreement between judges. However, the agreement
percentage is good enough in some studies and Kappa
has been criticised (see Phil’s handout, pp. 14-17).
18
SPSS: Assessing interjudge
agreement with Cronbach’s a






This is an activity from Larsen-Hall (2010).
Go to
http://cw.routledge.com/textbooks/9780805861853/s
pss-data-sets.asp
Download and input into SPSS the file
MunroDerwingMorton.
Click on Analyze...Scale...Reliability analysis....
In the Model box, choose Alpha.
Click on Statistics and tick Scale, Scale if item
deleted and Correlations.
19
SPSS: Assessing rater reliability
through Cohen’s Kappa



The file ‘Kappa data.sav’ contains the results of
an error tagging task that I and a former
colleague of mine performed on some
paragraphs written by learners of English.
Each number is an error category (e.g.,
2=spelling error). There are 7 categories.
For SPSS to understand what each row means,
you should weigh the two judge variables by the
‘Count’ variable.
20
SPSS: Assessing rater reliability
through Cohen’s Kappa (cont.)
Go to Analyse…Descriptive
Statistics…Crosstabs.
 Move one of the judge variables in the
‘Row(s)’ and the other on the ‘Column(s)’
box.
 You shouldn’t do anything with the ‘Count’
variable.
 In ‘Statistics’ tick ‘Kappa’.

21
Kappa test
Kappa at
http://faculty.vassar.edu/lowry/kappa.html





Go to the website above and Select number of
categories in the data.
In the table, enter the raw numbers as they
appear in the SPSS contingency table.
Click Calculate.
The result is not only the same Kappa value as
in SPSS, but also three more.
Phil recommends using one of the Kappas which
have a possible maximum value (see p. 18 of his
handout).
23
Next week
Item analysis for norm-referenced
measures
 Reliability tests for criterion-referenced
measures
 Validity tests

24
References







Brown, J.D. 2005. Testing in language programs: a comprehensive guide to
English language assessment. New York: McGraw Hill.
Fulcher, G. 2010. Practical language testing. London: Hodder Education.
Gyllstad, H. 2009. Designing and evaluating tests of receptive collocation
knowledge: COLLEX and COLLMATCH. In Barfield, A. and Gyllstad, H.
(eds.) Researching Collocations in Another Language: Multiple
Interpretations (pp. 153-170). London: Palgrave Macmillan.
Larsen-Hall, J. 2010. A guide to doing statistics in second language
research using SPSS. London: Routledge.
Sasaki, C. L. 1996. Teacher preferences of student behavior in Japan. JALT
Journal 18 (2), 229-239.
Scholfield, P. 2011. Simple statistical approaches to reliability and item
analysis. LG675 Handout. University of Essex.
Skoufaki, S. 2008. Investigating the source of idiom transparency intuitions.
Metaphor and Symbol 24(1), 20-41.
25
Suggested readings

On the meaning of ‘reliability’ (particularly in relation to language
testing)
Bachman, L.F. and Palmer, A.S. 1996. Language Testing in Practice.
Oxford: Oxford University Press. (pp.19-21)
Bachman, L.F. 2004. Statistical Analyses for Language Assessment.
Cambridge University Press. (Chapter 5)
Brown, J.D. 2005. Testing in language programs: a comprehensive
guide to English language assessment. New York: McGraw Hill.
Fulcher, G. 2010. Practical language testing. London: Hodder
Education. (pp.46-7)
Hughes, A. 2003. Testing for Language Teachers. (2nd ed.) Cambridge:
Cambridge University Press. (pp. 36-44)

On the statistics used to assess language test reliability
Bachman, L.F. 2004. Statistical Analyses for Language Assessment.
Cambridge University Press. (chapter 5)
Brown, J.D. 2005. Testing in language programs: a comprehensive
guide to English language assessment. New York: McGraw Hill.
(chapter 8)
26
Suggested readings (cont.)





Brown, J.D. 1997. Reliability of surveys. Shiken: JALT
Testing & Evaluation SIG Newsletter 1 (2) , 18-21.
Field, A. 2009. Discovering statistics using SPSS. (3rd
ed.) London: Sage. (sections 17.9, 17.10)
Fulcher, G. 2010. Practical language testing. London:
Hodder Education. (pp.47-52)
Howell, D.C. 2007. Statistical methods for psychology.
Calif.: Wadsworth. (pp. 165-166)
Larsen-Hall, J. 2010. A guide to doing statistics in
second language research using SPSS. London:
Routledge. (section 6.4, 6.5.4., 6.5.5)
27
Homework



The file ‘P-FP Sophia Sumei.xls’ contains the
number of pauses (unfilled, filled, and total) in
some spoken samples of learners of English
according to my and a former colleague’s
judgment.
Which of the aforementioned statistical tests of
interjudge agreement seem appropriate for this
kind of data?
What else would you need to find out about the
data in order to decide which test is the most
appropriate?
28

similar documents