### “A chi-square test showed that…” – or did it really?

```«A chi-square test showed that...»
– or did it really?
Bård Uri Jensen
http://privat.hihm.no/buj/
[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */
Allowing [statistical software] to do our thinking
is a sure recipe for disaster.
(Good & Hardin, 2012, p. xi)
- or did it really?
«Simple» statistical tests
• chi-square (X 2) test
• t-test
- or did it really?
Statistical hypothesis testing
1. Formulate a hypothesis

E.g. In Norwegian L2, Vietnamese have more TENSE errors than Somali.
2. Formulate a null-hypothesis

Vietnamese and Somalis have the same rate of TENSE errors.
3. «Disprove» the null-hypothesis = demonstrate its unlikelihood


E.g. less than 5% chance for the null-hypothesis to be true
= «Significance»
• We choose α according to what we consider an acceptable risk of
false conclusions
 Often 5% in linguistic research
- or did it really?
Conditions of use
• Independent observations
 chi-square test
 t-test
• Parametric assumptions
 t-test
• The dangers of repeated testing
 any test
- or did it really?
A simple example from ornithology
- or did it really?
A simple example from ornithology
- or did it really?
A simple example from ornithology
- or did it really?
A simple example from ornithology
- or did it really?
A simple example from corpus linguistics
- or did it really?
A simple example from corpus linguistics
• The observations should be independent.
• An important condition of use for
 chi-squared test
 t-test
 The observations should be of different individuals.
«Chi-square is a much-abused test in second language
research studies, and often one of its assumptions (that of
independence of data) is violated as a matter of course.»
Larson-Hall (2010, p.206)
- or did it really?
Example 1:
Chi-squared test, non-independent observations
• Blom & Paradis 2013
 Journal of Speech, Language, and Hearing Research
 On past tense production in L2 children with language impairment
• 48 children with English as L2
• Overregularization of past tense
 Hypothesis: Less common in verb stems ending in /d/ or /t/
overregularization
zero marking
d# or t#
16
69
others
42
98
• X 2 (1) = 3.45, p (one-sided) = 0.032
• Problem: n = 85 + 140, N = 48
• Observations are not independent, so the result is invalid.
- or did it really?
Example 1:
Chi-squared test, non-independent observations
• Solution A:
 Pick just one observation from each author/speaker
• “To exclude the author as one more relevant factor, the database
was cleaned so that there is only one example for each verb
from any single author.”
Sokolova 2012, p. 94
- or did it really?
Example 1:
Chi-squared test, non-independent observations
• Solution A:
 Pick just one observation from each author/speaker
 Sokolova 2012
• Solution B:




•
Calculate average values for each informant
Use the average values as independent observations
Test significance with an appropriate test, e.g. t-test or U-test
Gujord 2013
Both these solutions might require a larger corpus!
• «Solution» C:
 Alter the research question
 Danckaert 2011
- or did it really?
Example 1:
Chi-squared test, non-independent observations
• Solution B:
- or did it really?
Example 2:
T-test, non-independent observations
• Klavan 2012
 PhD thesis from Tartu University
 Investigation of adposition ‘peal’ and adessive case
• 450 observations of each, from 2 corpora
•
•
•
•
t = 8.02, p < 0.001
Conclusion: adessive phrases are longer than ‘peal’-phrases
Problem: Observations are not independent.
The conclusion is invalid.
- or did it really?
- or did it really?
Example 3:
T-test, non-normal populations
• Hunter (2011, s. 48)
 PhD thesis from Birmingham University
 On grammaticality judgements by L2 students
• Conclusion:
• the accuracy (max. = 1) for the teacher group (M = .98, SD = .14)
was significantly higher than the student group (M = .64, SD = .49),
t(1) = 4.9, p < .001.
• Problem:
 Mean = 0.98, Maximum value = 1
 Standard deviation= 0.14
• The distribution cannot possibly be normal.
• The result is invalid.
- or did it really?
2,5
2,0
1,5
1,0
0,5
0,0
0,0
- or did it really?
0,5
1,0
1,5
Example 4
Repeated testing
• Leedham 2011
 PhD thesis, The Open University
 Features in the writing of Chinese students in UK universities
• Conclusion:
• There are differences in frequencies of certain phrases
between 3rd year students and younger students
• Problem:
• Repeated testing without adjusting the probability values
• Some of the results are not valid.
- or did it really?
CV
- or did it really?
CV
Moral
There are no simple tests.
1. You should understand the conditions of the test.
2. You should take the conditions into account.
3. You should document properly



how you perform the test,
what numbers you put into it,
how the conditions are met.
«A chi-square test showed that the difference is significant.»
- or did it really?
Is it really that important?
• «[C]ompared to other social sciences (e.g., psychology,
communication, sociology, anthropology, …) or branches of
linguistics (e.g., psycholinguistics, phonetics, sociolinguistics…),
most of corpus linguistics has paradoxically only begun to
develop this methodological awareness.»
Gries (forthcoming, p.1)
- or did it really?
Is it really that important?
• «It has become increasingly apparent over a period of several
years that psychologists, taken in the aggregate, employ the
chi-square test incorrectly.»
Lewis and Burke (1949)
- or did it really?
Whose responsibility is it?
- or did it really?
«Corpus linguistics needs to ‘catch up’ [...]»
Gries (forthcoming, p.1)
- or did it really?
References (http://privat.hihm.no/buj)
Boneau, A. C. (1960). The effects of violations of assumptions underlying the t test. Psychological Bulletin, 57(1),
49-64.
Good, P.I. & Hardin, J.W. (2012). Common errors in statistics (and how to avoid them). Hoboken: John Wiley.
Gries, S (forthcoming). Quantitative designs and statistical techniques.
http://www.linguistics.ucsb.edu/faculty/stgries/research/InProgr_STG_QuantDesAndMethCorpLing_CUPHb.pdf
Larson-Hall, J. (2010). A Guide to Doing Statistics in Second Language Research Using SPSS. New York:
Routledge.
Lewis, D., & Burke, C. J. (1949). The use and misuse of the chi-square test. Psychological Bulletin, 46(6), 433-
489.
Blom & Paradis (2013). Past Tense Production by English Second Language Learners With and Without Language
Impairment. In Journal of Speech, Language, and Hearing Research. 56, 281-294.
Danckaert, L. (2011). On the left periphery of Latin embedded clauses. Ph.D. thesis. University of Gent.
Gujord, A.H. (2013). Grammatical encoding of past time in L2 Norwegian : The roles of L1 influence and verb
semantics. Ph.D. thesis. University of Bergen.
Hunter, J.D. (2011). A multi-method investigation of the effectiveness and utility of delayed corrective feedback in
second-language oral production. Ph.D. thesis. University of Birmingham.
Klavan, j. (2012). Evidence in linguistics : corpus-linguistic and experimental methods for studying grammatical
synonymy. Ph.D. thesis. University of Tartu.
Leedham, M. (2011). A corpus-driven study of features of Chinese students’ undergraduate writing in UK
universities. Ph.D. thesis. The Open University.
Sokolova, S. (2012). Asymmetries in Linguistic Construal : Russian Prefixes and the Locative Alternation. Ph.D.
thesis. University of Tromsø.
- or did it really?
```