### PPT slides

```Current Developments in
Quantitative Research
Methods
LOT Winter School
January 2014
Luke Plonsky
Welcome & Introductions
Course Introduction
 Methodological reform (revolution?) taking place
 Goal: more accurately inform theory, practice, and future research
Theory
Research
Practice
 Content objectives: conceptual and practical (but mostly
conceptual)
 Inform participants’ current and future research efforts
 Motivate future inquiry with a methodological focus
 Not stats-heavy, technical; assumed: basic knowledge of
descriptive and inferential statistics (e.g., M, SD, t test, ANOVA)
 Examples mostly from second language (L2) research
 Lecture, all-group discussion, and small-group discussion  ask
Qs at any time!
Course Overview
 Monday/today: Statistical power, effect sizes, and
fallacies of statistical significance
 Tuesday: Meta-analysis and the synthetic approach
 Wednesday: Assessing methodological quality
 Thursday: Replication research
 Friday: Data transparency, reporting practices, and
visualization techniques
Statistical power, effect sizes,
and fallacies of statistical
significance
Luke Plonsky
Current Developments in
Quantitative Research Methods
Day 1
Review of Common Stats: Comparing Means
ANOV
A
Mean
scores
(DV)
t test
Groups (IV)
from Bialystok &
Miller (1999)
Review of Common Stats : Correlations
 Question: What is the relationship between two
(continuous) variables?
 Positive, negative, curvilinear
 Strong, weak, moderate, none
from DeKeyser
(2000)
A Model of Research
Conduct a study
(e.g., the effects of A on B)
p < 0.05
p > 0.05
Important finding / Get published!
Modify relevant theory, research, practice
Trash
p Values
(Another quick review)
Q: Wait, real quick: What’s a p value?
A1: The probability of similar results (e.g., differences between groups;
relationship between variables) given NO difference between groups / no
relationship between variables
A2: NOT an indication of the magnitude, importance, direction, or
replicability of an effect/relationship
WHAT WE REALLY WANT TO KNOW!
 Also: Observed p values vary as a function of sample size (N), effect size
(e.g., Cohen’s d), and variance.
OK, on to the Controversy
 60+ years and 400+ articles (e.g,.,
Schmidt, 1996; Thompson, 2001)
“The almost universal reliance on merely
refuting the null hypothesis as the standard
method for corroborating substantive theories …
is a terrible mistake, is basically unsound, poor
scientific strategy, and one of the worst things
that ever happened in the history of psychology”
(Meehl, 1967, p. 72).
 APA Task Force on Statistical
Inference (Wilkinson & TFSI, 1999)
 AL: strict (dogmatic?) adherence to
NHST; very little discussion until
recently (Crookes, 1991; Ellis, 2006; Larson-Hall, 2010; Lazaraton,
1991; Nassaji, 2012; Norris, 2013; Norris & Ortega, 2000, 2006; Oswald &
Plonsky, 2010; Plonsky, 2011, 2012, 2013; Plonsky & Gass, 2011)
http://oak.ucc.nau.edu/ldp3/bib_nhst.html
(Anderson et al., 2000)
Wilkinson & TFSI (1999)
 Purpose: “to initiate discussion in the field about changes in
current practices of data analysis and reporting”
 General recommendations: be transparent; calculate power
a priori; inspect data descriptively and visually; simpler
analyses are best
 Specifics: report exact p values; report (contextualized) ESs
for all tests; CIs
Main arguments against NHST?
NHST is Unreliable
 The effects of A and B are always different—in some decimal
place—for any A and B. Thus asking ‘are the effects different?’
is foolish (Tukey, 1991, p. 100).
Study
N1
N2
M1 (SD1)
M2 (SD2)
p
d
1
5
5
15 (3)
18 (4)
.2265
0.85
↑15
↑45
↑15
↑45
15 (3)
18 (4)
15 (3)
18 (4)
2
3
↓.0276
↓.0001
0.85
0.85
The (nil) hypothesis that
d = 0 is (almost) always
false! (Cohen, 1994)
NHST is Unreliable (Cont’d)
 Same goes for p values based on correlations
 Remember: r = .30 = 0.30 = 0.30
p = .05
NHST is Unreliable (Cont’d)
“[with NHST] … tired researchers, having collected data
on hundreds of subjects, then conduct a statistical test to
evaluate whether there were a lot of subjects, which the
researchers already know, because they collected the
data and know they are tired.”
Thompson, 1992, p. 436
NHST is Crude and Uninformative
 Continuous data  yes/no dichotomy
 p values say nothing about:
 Replicability
 Theoretical or practical importance
 Magnitude of effects
 p > .05 ≠ zero effect size: The absence of evidence for differences is not
evidence for equivalence (Kline, 2004, p. 67)
 Large p values can correspond to large effects and vice versa
 Other explanations for p > .05?
 small sample/low power/high sampling error; small (i.e., hard-todetect effect size; unreliable instruments; weak treatment; other
hidden variables; …
 Appropriate for a limited period of exploratory research
 (Should be an) inverse relationship between theoretical maturity
and reliance on p
NHST is Crude and Uninformative
What could these t test and resulting p values
possibly contribute here?
Key
p >.05
Do
you but
see
sizeable
d
any
similar
patterns
p here?
<.05 but
not large
d
(Hint:
look
at the p
p >.05
values
and
w/neg.
ESs) d
Taylor et al. (2006)
NHST is Arbitrary
 …surely, God loves the .06 nearly as much as the .05
(Rosnow & Rosenthal, 1989, p. 1277)
 How much more (or less) would we know if the conventional
alpha level was .03 (or .15)?
 What if tests of statistical significance never existed? (Harlow et
al., 1997)
NHST is Counter-productive
 Adherence to NHST (and p values) constrains progress of theory
 inefficient research efforts
 NHST & publication bias (Rothstein,
et al., 2005)
Conduct a study
(e.g., the effects of A on B)
 Scenario: 100 intervention studies; H0 is true (i.e., no difference
between treatments A and B with alpha .05)
p < 0.05
 (At least) 5 studies will find p < .05
p > 0.05
 95 studies will sit unpublished, or be re-run until p < .05
Important finding / Get published!
(jelly beans cause acne)
 Type 1 error (false positive) in published studies = 100%
relevant theory, research, practice
 Treatment effects Modify
(which
are nil) become grossly overestimated
Trash
Summary
 (Quantitative) linguistics research relies heavily on NHST, which
is…





highly controversial at best and possibly dangerous and to-be-avoided;
unreliable;
crude and uninformative;
arbitrary; and
counter-productive
OK, but what we can do to improve?
Power
(Or: a possible solution to our obsession with p values?)
Statistical Power
 What is it?
 Why does it matter?
 How many participants do I need? (A very practical and
common question)
What kind of power is needed vs. typical?
 Table 2 in Cohen (1992)
d=0.2
d=0.5
d=0.8
Are these Ns typical in linguistics research?
What kind of power is needed vs. typical?
 Plonsky & Gass (2011)
 2% conducted a power analysis
 Median d = 0.65 + median n = 22
 Overall post hoc power = .56
 Plonsky (2013)
 1% (6/606 studies) conducted a power analysis
 median d = .71 (inflated?) + median n = 19
 Overall post hoc power = .57
 What does this mean for





Internal validity (and, hence, external validity/generalizability)?
Past research?
Theory-building?
Practical implications?
Availability bias in meta-analyses?
The “Power Problem” in L2 Research
(Plonsky, 2013, in press)
 Rarely analyze power
 Small samples
 Heavy reliance on NHST (median = 18)
 Effects not generally very large
 Omission of non-statistical results
 Rarely check assumptions
 Rarely use multivariate statistics
Tools for Power Analysis
 Cohen’s (1988, 1992) power tables
 A priori
 Conceptually?
 Practically: http://danielsoper.com/statcalc3/calc.aspx?id=47
 Post hoc
 Conceptually?
 Practically: http://danielsoper.com/statcalc3/calc.aspx?id=49
Quick Review
What if you can’t get enough
power?
 This may be the case when, for example…
 You’re studying a very small or hard-to-find population (L3
learners of Swahili with L1 Korean)
 You have limited funding for running participants
 Your phenomenon/relationship/effect of interest is small (i.e.,
hard to detect)
 Avoid or limit inferential stats
 Form less (sub)groups  less contrasts
 Focus on descriptives (including effect sizes and CIs)
 ‘Bootstrap’ the data?
Bootstrapping
 Random re-sampling from observed data to produce a simulated but
more stable outcome (see Larson-Hall & Herrington, 2010)
 (More) robust to: outliers, non-normal data  common
 Larson-Hall & Herrington (2010)
 ANOVA: p>.05 between NSs (n=15) and 3 learner groups (n=14, 15, 15)
 Tukey post hocs: p < .05 ONLY between NSs and Group A (p = .002); pb =
.407; pc = .834
 Bootstrapped post hoc tests  p < .05 for all three groups
 p values non-statistical due to a lack of power; Type II error
 Plonsky et al. (in press)




Re-analyzed raw data from 26 primary L2 studies
4 (of 16) Type I ‘misfits’ (i.e., 25% Type I ‘misfit’ rate)
0 Type II ‘misfits’
Too much power (via large N)  inflated findings?
BUT EVEN WITH GREATER POWER
VIA BOOTSRTAPPING, OUR
RESULTS ARE STILL BASED ON THE
FLAWED NOTION OF STATISTICAL
SIGNIFICANCE
EFFECT SIZES!
(Or: a MUCH BETTER solution to our obsession with p values)
Effect Sizes
 What are they? How do we calculate them?
 What advantages do ESs provide over p values?
 How can we interpret ESs?
What is an effect size?
 A quantitative indication of the strength of a relationship
or an effect
 Common effect sizes
 Standardized mean differences (Cohen’s d)




 M1-M2 / SDpooled (see Excel macro for calculating d)
Correlation coefficients (e.g., r)
Shared variance (R2, eta2)
Odds Ratios (likelihood of A given B)
Percentages
Why Effect Sizes?
- An alternative to NHST (p) Null Hypothesis Significance Testing (p) vs. Effect Sizes (d)
 Unreliable: result dependent on sample size (e.g., Kline, 2009)
ESs: not dependent on N
 Crude and uninformative: a) forces continuous data into a yes/no
dichotomy; b) tells us nothing about practical significance or magnitude (e.g.,
Cohen, 1994)
ESs: Express magnitude/size of relationship
(i.e., WHAT WE REALLY WANT TO KNOW)
 Arbitrary: …surely, God loves the .06 nearly as much as the .05 (Rosnow &
Rosenthal, 1989, p. 1277)
ESs: Continuous and can be compared/combined
across studies
36
Research Questions and Their
 Think of a study you read recently or one that you’re
working on.
 What were the RQs?
 Where they phrased dichotomously (Do …? Is there a
difference …?)?
 If so, what kind of answer can come from such a RQ?
 How might the findings differ with an emphasis on magnitude
rather than presence/absence of a relationship or effect?
Why Effect Sizes?
- Journal Requirements  APA Publication Manual, 6th Edition
 Three major L2 Journals: Language Learning, TESOL
Quarterly, Modern Language Journal
• Plonsky & Gass (2011): 0% (1980s)  0% (1990s) 
27% (2000s)
• Plonsky (2013): 3% (1990s)  42% (2000s)
So now effect sizes get reported more often…

?
…but very rarely do we interpret them
What do they
mean
anyway?
What does d =
0.50 (or 0.10, or
1.00…) mean?
How big is ‘big’?
And how small
is ‘small’?
What implications do
these effect have for
future research,
theory, and practice?
ESs: Summary
 ESs are best understood in relation to other, field-specific effects
 d ≈ 0.40 (small)
 d ≈ 0.70 (medium)
 d ≈ 1.00 (large)
Empirically-based, field-specific scale for
d values in L2 research
 …if people interpreted effect sizes [using fixed benchmarks] with the same
rigidity that .05 has been used in statistical testing, we would merely be being
stupid in another metric (Thompson, 2001, pp. 82–83).








Theoretical and methodological maturity (over time)
SD units
Research setting (lab vs. classroom; SL vs. FL)
Length/intensity of treatment
Manipulation of IVs
Publication bias
Sample size / sampling error
Instrument reliability
A Revised Model of Research
Conduct a study
(e.g., the effects of A on B)
p < 0.05
d=?
p > 0.05
d=?
Accumulation of results (via meta-analysis)
More precise and reliable estimate of effects
Modify relevant theory, research, practice
Trash
Based on our discussion today,
what changes would you suggest
to the field?
10 Suggestions for Reform
1.
2.
3.
4.
A diminished reliance on NHST / p-values
Drop the “significant” from “statistically significant”
Focus on the practical and theoretical importance of results
Better educate ourselves and future generations of researchers 
Emphasize: ESs, alternatives to NHST, synthetic-mindedness in
primary research  De-emphasize NHST
5.
6.
ESs (for all findings, not only when p < .05)
CIs (for all findings, not only when p < .05)  “a quiet but insistent
reminder that no knowledge is complete or perfect” (Sagan, 1996)
7.
8.
9.
10.
Replication (to mitigate effects of low power)
Examine data visually
Meta-analysis / a synthetic approach
Initiative from the top down
 Beyond significance testing (Kline, 2013)
 The cult of statistical significance (McCloskey, 2008)
 Understanding the new statistics (Cumming, 2012)
 Effect sizes for research (Grissom & Kim, 2012, 2nd
ed.)
 Statistical power analysis for the behavioral sciences
(Cohen, 1988, 2nd ed.)
Connections to Other Topics
to be Discussed this Week
 Meta-analysis (relies on ES) rather than p values
(TUESDAY)
 Replication (THURSDAY)
 Reporting practices (full descriptives including ES, always;
data transparency, etc.) (FRIDAY)
Tomorrow: Meta-analysis
 Motivation for and benefits of (conceptual
understanding)
 Procedures/techniques (practical understanding)
```