Three lessons from the historiography of language testing

Centre for Research in English Language Learning and Assessment
Three lessons from the
historiography of
language testing
Cyril J. Weir
Why is a knowledge of the past important for language
“We consider past developments in the field in order to have a richer,
more accurate, and complete understanding of our field.
It enables us to understand:
• the state of our knowledge at any given point in time
• the causes that have prompted change
• the time it has taken for ideas to change
• the links to developments in other disciplines.”
Micheline Chalhoub-Deville
“Familiarity with how a construct was measured in the past speaks to
the importance of humility and maturity as we realize a path has
been trodden before us.”
Lynda Taylor
“If I have seen further, it is by standing on ye sholders
of giants”
Isaac Newton in a Letter to Robert Hooke (15 February 1676)
“Bernard of Chartres used to compare us to
[puny] dwarfs perched on the shoulders of
giants. He pointed out that we see more and
farther than our predecessors, not because
we have keener vision or greater height, but
because we are lifted up and borne aloft on
their gigantic stature.”
John of Salisbury (1159) Metalogicon (quoted in Wikipedia)
Temporal Awareness
Bernard Spolsky, reminds us of the need to ensure temporal
…pride of place for a direct measure of oral language
proficiency is usually granted to the oral interview created
by the Foreign Service Institute (FSI) of the US State
Department developed originally between 1952-56…
Spolsky, B. (1990: 158). Oral examination: an historical note. Language Testing, 7 (2)
Jack Roach
Spolsky continues “It turns out to be the case, however, that many of
the important issues the FSI linguists had to struggle with, especially
those concerning reliability, had been anticipated and intelligently
ventilated in a paper written some years before the FSI activity started,
printed and circulated internally among examiners of the University of
Cambridge Local Examinations Syndicate (UCLES)”.
Some problems of oral examinations in modern languages: an experimental approach based on
the Cambridge Examinations in English for foreign students (J O Roach 1945)
Lynda Taylor suggested Roach's work in the 1930s and 1940s on issues in
speaking assessment has much to teach the LT world. Cambridge had been
conducting oral test for 40 years already by the time of the FSI development
(an oral component including conversation was part of the original CPE in
“Thanks in large part to the influence of Roach (Assistant Secretary to the
Syndicate from 1925 to 1945 ), Cambridge was already well-sighted on many
of the key issues, e.g. face-to-face format to allow for reciprocal interaction,
multiple task design, scales with some sort of descriptors attached, rater
training and standardization.”
Glenn Fulcher shared this earlier example with me:
"The earliest record of an attempt to assess second language speaking dates to the
first few years after Rev. George Fisher became Headmaster of the Greenwich Royal
Hospital School in 1834. In order to improve and record academic achievement, he
instituted a “Scale Book”, which recorded performance on a scale of 1 to 5 with
quarter intervals. A scale was created for French as a second language, with typical
speaking prompts to which boys would be expected to respond at each level...”
Chadwick, E. (1864). Statistics of educational results. Museum: A Quarterly Magazine of Education,
Literature and Science, 3, 479-484.
Cadenhead, K. and Robinson, R. (1987). Fisher’s “Scale Book”: An Early Attempt at Educational
Measurement. Educational Measurement: Issues and Practice 6(4), 15 – 18.
Edward L. Thorndike’s standardised scales
Barry O’Sullivan drew my attention to Thorndike’s work in the early 20th on the
creation of standardized scales.
Instead of estimating a scale based simply on connoisseurship as was often the
case in the United Kingdom. Thorndike took a large sample of handwritten scripts
and used a large number of teachers to rank these scripts in order.
From the data he created a scale upon which he placed each script. He then
provided a set of exemplar scripts at various levels to operationalise the scale
from an absolute zero base, with scale points defined and their distances
Teachers were asked to compare their student’s scripts with those samples on the
scale and identify the closest match to give the level.
F.Y. Edgeworth 1888
• Weir (1983) noted that in the C19th Edgeworth (1888) had observed
that one-third of scripts marked by different examiners in the British
civil service examinations received a different mark and, further, that in
a re-examination of scripts by the same examiner one seventh received
a different mark.
• Edgeworth offered two solutions to these problems in scoring validity:
increasing the number of components in an exam and multiple
marking. He argued the more components that were aggregated, the
more likely that individual marker errors would be eliminated. He also
stressed that the more markers that were involved in examining a
script, the more likely it was that a ‘true value’ would emerge.
Edgeworth, F.Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society, LI,
Edgeworth, F. Y. (1890). The element of chance in competitive examinations. Journal of the Royal
Statistical Society, 53, 460-75 and 644-63.
First Lesson
There is nothing new under the sun, but there
are lots of old things we don't know.
Ambrose Bierce, The Devil's Dictionary
[which we should know, or face the ignominy of
our work being seen as temporally and/or
geographically challenged…]
Part 2
Blind monks examining an elephant
Hanabusa Itcho 1652-1724
A while back Fred Davidson brought the following
apposite quotation to the attention of L-Test list
"Despite some exceptional instances, the first logical
step in the development of psychometrics seems to
be to devise a series of instruments each of which
measures something accurately, regardless of what
that something may be; and the second, and
following step, to discover what that something is."
O'Connor, J. 1934. Psychometrics: A Study of Psychological
Measurements. Cambridge, MA: Harvard University Press, p. xvi.
Construct: the sine qua non of language testing
The more fully we are able to describe the construct we
are attempting to measure at the a priori stage the more
meaningful might be the statistical procedures
contributing to construct validation that can subsequently
be applied to the results on the test.
Statistical data do not in themselves generate conceptual
labels. We can never escape from the need to define
what is being measured, just as we are obliged to
investigate how adequate a test is in operation.
Measured Constructs 1913-2012
• Attempts at explicit construct definition are a
relatively recent phenomenon.
• In first part of C20th seemingly little overt
attention was paid to the underlying
construct(s) in language tests
• Only really in the 1960’s that it becomes an
explicit concern in the work of language
testers such as J.B. Carroll, Bernard Spolsky
and Alan Davies
The influence of ideas from language teaching
on testing 1913-2012
Changing priorities in approaches to language
learning/teaching obtaining at various stages in the
C20th had an influence on language testing in the
UK. These included:
 the Grammar Translation or Traditional Method, based
upon the method used for the teaching of classical
 the direct method with its focus on spoken language
promoted in continental Europe for the formal
education system
 the structuralist approach
 the communicative approach with its focus on the needs
of learners to use language for real life communication
Cambridge Certificate of Proficiency in English 1913
CPE in 1913 can be seen as a hybrid creation which drew on a number
of legacies (academic and social)from the past concerning what was to
be taught and how:
• (i) the Grammar Translation Approach reflected in the inclusion of
translation tasks and questions on English grammar. “The prime
object of scholastic education is the training of the mental faculties”
(R.W. Hiley 1887 Journal of Education Vol IX: 308) ;
• (ii) the Reform Movement (Viëtor 1882 Passy 1899, Jespersen
1904) reflected in the inclusion of a phonetics paper, an Oral
paper.The assistance of modern ideas from phonetics, allowed for a
new pedagogical approach rooted in the spoken language.
Henry Sweet (1845-1912): a champion of the oral approach
Sweet’s (1899) The Practical Study of Languages. A Guide for Teachers and
Learners regarded by Howatt (1984:202) as one of the best Language
Teaching methodology books ever written: “… unsurpassed in the history of
linguistic pedagogy”.
The papers in CPE 1913 correspond closely to the chapters in his book
Lesson 2
Alan Davies wrote in the first issue of the journal Language
“…in the end no empirical study can improve a test’s
validity... What is most important is the preliminary
thinking and the preliminary analysis as to the
nature of the language learning we aim to capture.”
Davies, A. (1984). Validating three tests of English language proficiency.
Language Testing, 1 (1), 50-69.
Part 3
The wider picture
"People make their own history, but they do not
make it as they please; they do not make it
under self-selected circumstances, but under
circumstances existing already, given and
transmitted from the past.”
Karl Marx, The Eighteenth Brumaire of Louis Bonaparte, Part 1
Different gods, different mountain tops
Substantive differences grew between the UK and
the USA in their approaches to testing from 19131970.
In the US the predominant focus was on scoring
validity, in particular the psychometric qualities
of a test with a predilection for MCQ
In the UK, for example in Cambridge English
language examinations, there was a far greater
concern with content validity: a concern with the
how in the US as against the what in the UK.
An important reason for the Atlantic rift can
be found in the differing socio economic
contexts prevailing in Britain and the USA in
the early C20th.
The compelling need to produce tests on an
industrial scale in the US strongly influenced
testing organizations in the direction of
objective multiple choice methods at a very
early stage.
Population explosion
Resnick (1982: 177,187) describes how in
US schools “the need to identify those
who had the least probability of being
able to carry on normal work for their
age, was stimulated
by the
demographic explosion…
In 1870 there were
about 80,000
students … by 1910 there were
Allocating wartime jobs in the military
Glenn Fulcher (1999: 390) describes how a serious logistical
challenge faced the army in WW1. This was to result in the
increased use of objective test formats in intelligence tests.
Resnick (1982:182) records the successful placement in
appropriate jobs of 1.7 million army recruits mobilised in 191718 through the administration of the US Army's Alpha and Beta
tests, following Robert Yerkes’ successful advocacy of these.
Harold Ormsby recently described this on L-test L as
“… a
significant moment in the history of mass testing.” Fred
Davidson saw it as: “‘proof” that large-scale normative
psychometric testing could work”.
In short the pressure of numbers was one of the
main factors which helped drive US testing in
schools and in the military in the direction of
psychometrically driven tests, especially MCQ.
In the UK in 1913 there were only 3 candidates for
the Cambridge Certificate of Proficiency in English
(CPE), 15 in 1931 and 750 by 1939. A cottage
industry as against an industrial behemoth.
Samelson concludes: “The multiple choice
test – efficient quantitative, objective,
capable of sampling wide areas of subject
matter and easily generating data for
complicated statistical analysis – had
become the symbol or synonym of
American Education.”
Samelson F, Was Early Mental Testing (a) Racist Inspired, (b) objective science, (c) a technology
for democracy, (d) the origin of multiple-choice exams, (e) none of the above? (Mark the RIGHT
answer). In Michael M. Sokal, Psychological Testing and American Society, 1890-1930 (New
Brunswick and London: Rutgers University Press, 1987).
“Perfidious Albion”: English as an
instrument of UK foreign policy
Pennycook (1994:134) viewed the attempt to
spread English around the globe in the C20th as
part of a wider focus on cultural and linguistic
expansion in preference to the earlier material
exploitation by the western powers.
A “search for new means of social and political
control in the world” saw “the prodigious
spread” of English
Spreading English around the world
The Cambridge examination [CPE]…. was seized on
by Jack Roach when he joined the Syndicate after
the First World War for both ideological and
personal reasons. He thought an international test
would realize his ‘modest ambition of making
English the world language’ (Roach 1956: 2) and he
saw a role for his own activities…
Roach, in 1929, hoped for ‘the reaffirmation and
spread of British influence’
Spolsky (2004: 305)
Propagation by simplification
Richard Smith (2004:229-31) identified a politically
motivated focus on lexical content in British ELT from
the 1930s until the end of WW2. He describes how:
“Discussions in the emerging UK ‘centre’ from the
mid-1930s until the end of World War II had focused
quite explicitly and narrowly on needs to propagate
English as a world language via simplification of the
lexical contents of instruction”
LCE 1939
• 1939 saw the introduction of the Cambridge Lower Certificate in
English (LCE later in 1975 FCE)
• The UCLES Regulations for the 1939 LCE examination papers
(December 1938) reveal a lot about LCE constructs; in particular
the references to ‘simplified’ texts, ‘simple English’ and
‘relatively limited vocabulary’.
• Developing a test at a lower level than CPE with a large potential
candidature fitted well with the expansionist rationale for ELT
that Smith described and Roach subscribed to.
• 1941 saw the signing of an agreement between Cambridge and
the British Council to spread the former’s English examinations
round the world
Socio Economic Forces: the obligation to define
multiple levels of proficiency
Progress towards a European Economic Community from the 1970's onwards
brought with it a felt need on the part of intergovernmental agencies in
Europe to define language teaching and learning goals more precisely and to
make a start on delineating the stages of progression across the language
proficiency spectrum.
The result was a more granular approach to construct definition at different
proficiency levels in Europe, which does not appear to have been a major
concern for testers in the United States. It resulted, for example, in additional
Cambridge English language examinations across further proficiency levels
from 1980 onwards (PET at the B1 CEFR level in 1980, CAE at the CEFR C1
level in 1991 and KET at the CEFR A2 level in 1994
With the need for granularity came the need to establish specific criterial
contextual and cognitive parameters to differentiate the six proficiency levels
as per the CEFR.
Third Lesson
The last word goes to Bernard Spolsky (1990a:159)
who impresses on us the need to take account of:
“…external, non theoretical, institutional social forces,
that on deeper analysis, often turn out to be a much
more powerful explanation of actual practice…
A clearer view of the history of the field will emerge
once we are willing to look carefully at not just the
ideas that underlie it, but also the institutional, social
and economic situations in which they are realized.”

similar documents