### Sensitivity & specificity

```Evaluation of segmentation
Example
Reference standard & segmentation
Segmentation performance
• Qualitative/subjective evaluation  the easy way
out, sometimes the only option
• Quantitative evaluation preferable in general
• A wild variety of performance measures exists
• Many measures are applicable outside the
segmentation domain as well
• Focus here is on two class problems
Some terms
•
•
•
•
Ground truth = the real thing
Gold standard = the best we can get
Bronze standard = gold standard with limitations
Reference standard = preferred term for gold
standard in the medical community
What to evaluate?
• Without reference standard, subjective or
qualitative evaluation is hard to avoid
• Region/pixel based comparisons
• Border/surface comparisons
• (a selection of) Points
• Global performance measures versus local
measures
Example
Reference standard & segmentation
What region to evaluate over?
Combination of reference and result
true positive
true negative
false negative
false positive
False positives
False negatives
Confusion matrix (Contingency table)
Segmentation
negative
negative
positive
191152
3813
TN
Reference
positive
9764
FP
19648
FN
TP
Do not get confused!
• False positives are actually negative
• False negatives are actually positives
Confusion matrix (Contingency table)
Segmentation
negative
negative
positive
.852
.017
TN
FP
Reference
positive
.044
.088
FN
TP
Accuracy, sensitivity, specificity
sensitivity = true positive fraction
= 1 – false negative fraction
= TP / (TP + FN)
specificity = true negative fraction
= 1 – false positive fraction
= TN / (TN + FP)
accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy
• Range: from 0 to 1
• Useful measure, but:
• Depends on prior probability (prevalence); in
other words: on amount of background
• Even ‘stupid’ methods can achieve high
accuracy (e.g. ‘all background’, or ‘most likely
class’ systems)
Sensitivity & specificity
• Are intertwined
• ‘stupid’ methods can achieve arbitrarily large
sensitivity/specificity at the expense of low
specificity/sensitivity
• Do not depend on prior probability
• Are useful when false positives and false
negatives have different consequences
P
N
N
P
P
P
N
N
P
N
N
P
P
P
N
N
true positives (TP)
false positives (FP)
sensitivity = true positive fraction
= 1 – false negative fraction
false negatives (FN)
= TP / (TP + FN)
true negatives (TN)
specificity = true negative fraction
= 1 – false positive fraction
= TN / (TN + FP)
accuracy = (TP+TN) / (TP+TN+FP+FN)
P
N
N
P
P
P
N
N
P
N
N
P
P
P
N
N
true positives (TP) = 3
false positives (FP) = 3
sensitivity = TP / (TP + FN)
= 3 / 5 = 0.6
false negatives (FN) = 2
true negatives (TN) = 4
specificity = TN / (TN + FP)
= 4 / 7 = 0.57
accuracy = (TP+TN) /
(TP+TN+FP+FN) = 7 / 12 = 0.58
P
P
P
N
P
N
N
N
P
P
N
P
P
P
N
algorithm 1
=3
algorithm 2
P
P
=4
=3
=5
P
N
P
P
P
P
N
N
N
=2
N
N
=1
=4
=2
P
P
N
P
sensitivity = 3 / 5 = 0.6
specificity = 4 / 7 = 0.57
accuracy = 7 / 12 = 0.58
sensitivity = 4 / 5 = 0.8
specificity = 2 / 7 = 0.29
accuracy = 6 / 12 = 0.5
Which system is better?
Back to the retinal image…
result
reference
negative
positive
negative
.852 TN
.017 FP
positive
.044 FN
.088 TP
Accuracy: 0.93949
Sensitivity: 0.668027
Specifity: 0.980443
Overlap = intersection / union = TP/(TP+FP+FN)
Reference
FN
Segmentation
TP
TN
FP
Overlap
• Overlap ranges from 0 (no overlap) to 1
(complete overlap)
• The background (TN) is disregarded in the
overlap measure
• Small objects with irregular borders have lower
overlap values than big compact objects
Kappa
• Accuracy would not be zero if we used a system
that is ‘guessing’
• A ‘guessing’ system should get a ‘zero’ mark
(remember multiple choice exams…)
• Kappa is an attempt to measure ‘accuracy in
excess of accuracy expected by chance’
Kappa
Result
negative positive
Reference
negative 191152
positive
3813
194965
9764
19648
29412
200916
23461
224377
System positive rate:
23461/224377 = .105
System accuracy:
(191152 + 19648)/
224377 = .939
Total number
of positives
True positives of a
guessing system:
.105 * 29412 = 3075
… etc
Accuracy guessing
system: .792
Kappa
• accguess = the accuracy of a randomly guessing
system with a given positive (or negative) rate
• kappa = (acc – accguess) / (1 – accguess)
• In our case: kappa = (.939 - .792)/(1 - .792) =
.707
Kappa
• Maximum value is 1, can be negative
• A ‘guessing’ system has kappa = 0
• ‘Stupid systems’ (‘all background’ or ‘most likely
class’) have kappa = 0
• Systems with negative kappa have ‘worse than
chance’ performance
Positive/negative predictive value
• PPV and NPV depend on prevalence, contrary to sensitivity
and specificity
ROC analysis
Evaluating algorithms
• Most algorithms can produce a continuous instead of a
discrete output, monotonically related to the probability that
a case is positive.
• Using a variable threshold on such a continuous output, a
user can choose the (sensitivity, specificity) of the system.
This is formalized in an ROC (receiver operator
characteristic) analysis.
Reference standard & segmentation
Reference standard & soft segmentation
ROC analysis
Pn(x)
Pp(x)
true positive fraction
true negative fraction
false positive fraction
x
ROC curve
1
0.8
true positive probability
true positive fraction
sensitivity
detection rate
0.9
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
false positive probability
false positive fraction
1 - specificity
chance of false alarm
0.9
1
ROC curves
•
•
•
•
•
Originally proposed in radar detection theory
Formalizes the trade-off between sensitivity and specificity
Makes the discriminability and decision bias explicit
Each hard classification is one operating point on the ROC
curve
ROC curves
• A single measure for the performance of a
system is the area under the ROC curve Az
• A system that randomly generates a label with
probability p has an ROC curve that is a straight
line from (0,0) to (1,1), Az = 0.5
• A perfect system has Az = 1
• Az does not depend on prior probabilities
(prevalence)
ROC curves
• If one assumes Pn(x) and Pp(x) are Gaussian, two
parameters determine the curve: the difference between
the means and the ratio of the standards deviations. They
can be estimated with a maximum-likelihood procedure.
• There are procedures to obtain confidence intervals for
ROC curves and to test if the Az value of two curves are
significantly different.
Intuitive meaning for Az
• Is there an intuitive meaning for Az?
• Consider the two-alternative forced-choice
experiment: an observer is confronted with one
positive and one negative case, both randomly
chosen. The observer must select the positive
case. What is the chance that the observer does
this correctly?
Pp(x)
Pn(x)
x



x
chancecorrectdecision 2 - AFC exp.  dx Pn( x)  dx' Pp ( x' )
1
0.9

true positive probability
0.8
Az 
0.7
0.6
0.5
 dx P ( x)  dx' P ( x' )
n

0.4

p
x
0.3
0.2
true positive fraction
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
false positive probability
0.8
0.9
1
width false positive
fraction column
Az as a segmentation performance measure
• Ranges from 0.5 to 1
• Soft labeling is required (not easy for humans in
segmentation)
• Independent of system threshold (operating
point) and prevalence (priors)
• Depends on ‘amount of background’ though!
Summary
• Various pixel-based measures were considered
for two class, hard (binary) classification results:
– Accuracy
– Sensitivity, specificity
– Overlap
– Kappa
• ROC
```