### Approximate Randomization tests

```Approximate Randomization
tests
February 5th, 2013
Classic t-test
Why ar testing?
• Classic tests often assume a given distribution
(student t, normal, …) of the variable
• This is ≈ok for recall, but not for precision or Fscore
• Possible hypotheses to test with nonparametric tests is limited
Illustration
•
•
•
•
30,000 runs, 1000 instances, 500 of class A
True positives (TP): 400 (stdev:80)
False positives (FP): 60 (stdev: 15)
Assumption: true and false positives for class
A are normally distributed. This is already an
approximation since TP and FP are restricted
by 0 and the number of instances.
Definitions
• Recall = truly predicted A / A in reference
= truly predicted A / Cte
 If A is normal, recall is normal.
• Precision = truly predicted A / A in system
 A in system is a non-linear combination of
TP and FP. Precision is not normal.
• F-score: non-linear combination of recall and
precision
 Not normal.
Approximate randomization test
• No assumption on distribution
• Can handle complicated statistics
• Only assumption: independence between
shuffled elements
• References:
– Computer Intensive Methods for Testing
Hypotheses, Noreen, 1989.
– More accurate tests for the statistical significance
of results differences, Yeh, 2000.
Basic idea
• Exact randomization test
Glass 1
Glass 2
Glass 3
Glass 4
Contents
Polish
Russian
Budget
Expert
Polish
Budget
Russian
Exact probability
H0: expert is independent of
contents
P(ncorrect ≥ 2) = 7/24
= 0.29
Thus, do not reject H0
because the probability is
larger than alpha=0.05.
Approximate probability
• The number of permutations is n! => quick
increase of number of permutations
• If too much permutations to compute:
approximation: P = (nge + 1) / (NS + 1)
– nge : number of times pseudostatistic ≥ actual
statistic
– NS: number of shuffles
– +1: correction for validity
DIFFERENT SETUPS
Translation to instances
• Each glass is an instance
• Contents and expert are two labeling systems
• Contents has an accuracy of 100%, expert has
an accuracy of 50%
• Statistic is precision, f-score, recall, … instead
of accuracy
Stratified shuffling
• For labeled instances, it makes no sense to
shuffle the class label of one instance to
another
• Only shuffle labels per instance
MBT
• Assumpton of independence between
instances
• Shuffle per sentence rather than per token
System 1
System 2
This
DT
NNS
is
VBZ
VB
nice
JJ
RB
.
.
.
Term extraction
• Shuffling extracted terms between output of
two term extraction systems
Reference
System 1
System 2
happy
happy
good
good
lively
happy
angry
Script
• http://www.clips.ua.ac.be/~vincent/software.html#art
• http://www.clips.ua.ac.be/scripts/art
• Options:
–
–
–
–
–
Exact and approximate randomization tests
Instance based, also for MBT
Term extraction based
Stratified Shuffling
Two sided / one-sided (check code!)
Remarks on usage
• It makes no sense to shuffle if exact
randomization can be computed
• The value of p depends on NS. The larger NS,
the lower p can be
• Validity check
– Sign-test
– Re-test: to alleviate bad randomization
Sign test
• Can be compared with P for accuracy
• H0: correctness is
System 1
independent of
system i.e.
P(groen) = 0.5
• Binomial test
System 2
Interpretation (1)
Reference
System 1
System 2
A
A
B
B
A
B
C
A
B
How much do these two systems differ based on precision for the A label?
-
Maximally
Intermediate
Minimally
Interpretation (2)
Labels
PrecisionA
A
B
C
System 1
System 2
Δ
AB
AB
AB
1/3
0
1/3
BA
AB
AB
0
1
-1
AB
AB
BA
1/2
0
1/2
BA
BA
AB
0
1/2
-1/2
BA
AB
BA
1/2
0
1/2
AB
BA
BA
1
0
1
BA
BA
BA
0
1/3
-1/3
AB
BA
AB
1/2
0
1/2
Conclusion
• Approximate randomization testing can be
used for many applications.
• The basic idea is that the actual difference
between two systems is (im)probable to occur
when all possible permutions of the outputs
are evaluated.
• Difference can be computed in many ways as
long as the shuffled elements are
independent.
```