### PPT

```Performance Metrics for
1
Outline
•
•
•
•
•
•
Introduction to Performance Metrics
Supervised Learning Performance Metrics
Unsupervised Learning Performance Metrics
Optimizing Metrics
Statistical Significance Techniques
Model Comparison
2
Outline
•
•
•
•
•
•
Introduction to Performance Metrics
Supervised Learning Performance Metrics
Unsupervised Learning Performance Metrics
Optimizing Metrics
Statistical Significance Techniques
Model Comparison
3
Introduction to Performance Metrics
Performance metric measures how well your data mining algorithm is
performing on a given dataset.
For example, if we apply a classification algorithm on a dataset, we first
check to see how many of the data points were classified correctly. This is
a performance metric and the formal name for it is “accuracy.”
Performance metrics also help us decide is one algorithm is better or
worse than another.
For example, one classification algorithm A classifies 80% of data points
correctly and another classification algorithm B classifies 90% of data
points correctly. We immediately realize that algorithm B is doing better.
There are some intricacies that we will discuss in this chapter.
4
Outline
•
•
•
•
•
•
Introduction to Performance Metrics
Supervised Learning Performance Metrics
Unsupervised Learning Performance Metrics
Optimizing Metrics
Statistical Significance Techniques
Model Comparison
5
Supervised Learning Performance Metrics
Metrics that are applied when the ground truth is
Outline:
• 2 X 2 Confusion Matrix
• Multi-level Confusion Matrix
• Visual Metrics
• Cross-validation
6
2X2 Confusion Matrix
An 2X2 matrix, is used to tabulate the results of 2-class supervised learning problem
and entry (i,j) represents the number of elements with class label i, but predicted to
have class label j.
Predicted
Class
True Positive
Actual
Class
False Negative
+
-
+
f++
f+-
C = f++ + f+-
-
f-+
f--
D = f-+ + f--
A = f++ + f-+ B = f+- + f-- T = f+++ f-++ f+-+ f-False Positive
True Negative
+ and – are two class labels
7
2X2 Confusion Metrics
Example
Results from a Classification
Algorithms
Vertex
ID
Actual
Class
Predicted
Class
1
+
+
2
+
+
3
+
+
4
+
+
5
+
-
6
-
+
7
-
+
8
-
-
Corresponding
2x2 matrix for the given table
Predicted
Class
Actual
Class
+
-
+
4
1
C=5
-
2
1
D=3
A=6
B=2
T=8
•
•
•
•
True positive = 4
False positive = 1
True Negative = 1
False Negative =2
8
2X2 Confusion Metrics
Performance Metrics
Walk-through different metrics using the following example
1. Accuracy is proportion of correct predictions
2. Error rate is proportion of incorrect predictions
3. Recall is the proportion of “+” data points predicted as “+”
4. Precision is the proportion of data points predicted as “+” that are
truly “+”
9
Multi-level Confusion Matrix
An nXn matrix, where n is the number of classes and entry (i,j) represents the
number of elements with class label i, but predicted to have class label j
10
Multi-level Confusion Matrix
Example
Predicted Class
Class 1 Class 2 Class 3
Marginal Sum
of
Actual Values
Class 1
2
1
1
4
Class 2
1
2
1
4
Class 3
1
2
3
6
Marginal Sum of
Predictions
4
5
5
T = 14
Actual
Class
11
Multi-level Confusion Matrix
Conversion to 2X2
f-+
f++
Actual
Class
Predicted Class
Class 1
Class 2
Class 3
Class 1
2
1
1
Class 2
1
2
1
Class 3
1
2
3
2X2 Matrix
Specific to Class 1
Actual
Class
f+f-We can
now apply
all the 2X2
metrics
Predicted Class
Class 1 (+)
Not Class 1
(-)
Class 1 (+)
2
2
C=4
Not Class 1 (-)
2
8
D = 10
A=4
B = 10
T = 14
Accuracy = 2/14
Error = 8/14
Recall = 2/4
Precision = 2/4
Multi-level Confusion Matrix
Performance Metrics
Predicted Class
Actual
Class
Class 1
Class 2
Class 3
Class 1
2
1
1
Class 2
1
2
1
Class 3
1
2
3
1. Critical Success Index or Threat Score is the ratio of correct predictions for class
L to the sum of vertices that belong to L and those predicted as L
2. Bias - For each class L, it is the ratio of the total points with class label L to the
number of points predicted as L.
13
Bias helps understand if a model is over or
under-predicting a class
Confusion Metrics
R-code
•
•
•
•
•
•
•
•
•
•
•
•
•
•
library(PerformanceMetrics)
data(M)
M
[,1] [,2]
[1,] 4 1
[2,] 2 1
twoCrossConfusionMatrixMetrics(M)
data(MultiLevelM)
MultiLevelM
[,1] [,2] [,3]
[1,] 2 1 1
[2,] 1 2 1
[3,] 1 2 3
multilevelConfusionMatrixMetrics(MultiLevelM)
14
Visual Metrics
Metrics that are plotted on a graph to obtain the visual
picture of the performance of two class classifiers
(0,1) - Ideal
ROC plot
True Positive Rate
1
(1,1)
Predicts the +ve
class all the time
AUC = 0.5
(0,0)
Predicts the –ve
class all the time
0
False Positive Rate
0
1
Plot the performance of multiple models to
decide which one performs best
15
Understanding Model Performance
based on ROC Plot
1. Models that lie in
upper right are
liberal.
2. Will predict “+”
with little
evidence
3. High False
positives
1. Models that lie in
lower left are
conservative.
2. Will not predict
“+” unless strong
evidence
3. Low False
positives but high
False Negatives
True Positive Rate
Models that lie in this
upper left have good
1
performance
Note: This is where you
aim to get the model
0
0
AUC = 0.5
False Positive Rate
1
Models that lie in
this area perform
worse than random
Note: Models here can
be negated to move
them to the upper right
corner
16
ROC Plot
Example
True Positive Rate
1
0
0
M1 (0.1,0.8)
M3 (0.3,0.5)
M2 (0.5,0.5)
False Positive Rate
1
M1’s performance occurs furthest
in the upper-right direction and hence is considered
the best model.
17
Cross-validation
Cross-validation also called rotation estimation, is a way to analyze
how a predictive data mining model will perform on an unknown dataset,
i.e., how well the model generalizes
Strategy:
1. Divide up the dataset into two non-overlapping subsets
2. One subset is called the “test” and the other the “training”
3. Build the model using the “training” dataset
4. Obtain predictions of the “test” set
5. Utilize the “test” set predictions to calculate all the performance metrics
Typically cross-validation is performed for multiple iterations,
selecting a different non-overlapping test and training set each time
18
Types of Cross-validation
• hold-out: Random 1/3rd of the data is used as test and
remaining 2/3rd as training
• k-fold: Divide the data into k partitions, use one partition as test
and remaining k-1 partitions for training
• Leave-one-out: Special case of k-fold, where k=1
Note: Selection of data points is typically done in stratified manner, i.e.,
the class distribution in the test set is similar to the training set
19
Outline
•
•
•
•
•
•
Introduction to Performance Metrics
Supervised Learning Performance Metrics
Unsupervised Learning Performance Metrics
Optimizing Metrics
Statistical Significance Techniques
Model Comparison
20
Unsupervised Learning Performance Metrics
Metrics that are applied when the ground truth is
not always available (E.g., Clustering tasks)
Outline:
• Evaluation Using Prior Knowledge
• Evaluation Using Cluster Properties
21
Evaluation Using Prior Knowledge
To test the effectiveness of unsupervised learning methods is by considering a
dataset D with known class labels, stripping the labels and providing the set as
input to any unsupervised leaning algorithm, U. The resulting clusters are then
compared with the knowledge priors to judge the performance of U
To evaluate performance
1. Contingency Table
2. Ideal and Observed Matrices
22
Contingency Table
Cluster
Class
Same Cluster
Different Cluster
Same Class
u11
u10
Different Class
u01
u00
(A) To fill the table, initialize u11, u01, u10, u00 to 0
(B) Then, for each pair of points of form (v,w):
1. if v and w belong to the same class and cluster then increment u11
2. if v and w belong to the same class but different cluster then increment u10
3. if v and w belong to the different class but same cluster then increment u01
4. if v and w belong to the different class and cluster then increment u00
23
Contingency Table
Performance Metrics
Example Matrix
Class
Cluster
Same Cluster
Different Cluster
Same Class
9
4
Different Class
3
12
•
Rand Statistic also called simple matching coefficient is a measure
where both placing a pair of points with the same class label in the same cluster
and placing a pair of points with different class labels in different clusters are
given equal importance, i.e., it accounts for both specificity and sensitivity of the
clustering
•
Jaccard Coefficient can be utilized when placing a pair of points with the
same class label in the same cluster is primarily important
24
Ideal and Observed Matrices
Given that the number of points is T, the ideal-matrix is a TxT matrix, where each
cell (i,j) has a 1 if the points i and j belong to the same class and a 0 if they belong to
different clusters. The observed-matrix is a TxT matrix, where a cell (i,j) has a 1 if
the points i and j belong to the same cluster and a 0 if they belong to different
cluster
•
Mantel Test is a statistical test of the correlation between two matrices of the
same rank. The two matrices, in this case, are symmetric and, hence, it is
sufficient to analyze lower or upper diagonals of each matrix
25
Evaluation Using Prior Knowledge
R-code
•
•
•
•
•
•
•
library(PerformanceMetrics)
data(ContingencyTable)
ContingencyTable
[,1] [,2]
[1,] 9 4
[2,] 3 12
contingencyTableMetrics(ContingencyTable)
26
Evaluation Using Cluster Properties
In the absence of prior knowledge we have to rely on the information from
the clusters themselves to evaluate performance.
1. Cohesion measures how closely objects in the same cluster are related
2. Separation measures how distinct or separated a cluster is from all
the other clusters
Here, gi refers to cluster i, W is total number of clusters, x and y are data points,
proximity can be any similarity measure (e.g., cosine similarity)
We want the cohesion to be close to 1 and separation to be close to 0
27
Outline
•
•
•
•
•
•
Introduction to Performance Metrics
Supervised Learning Performance Metrics
Unsupervised Learning Performance Metrics
Optimizing Metrics
Statistical Significance Techniques
Model Comparison
28
Optimizing Metrics
Performance metrics that act as optimization
functions for a data mining algorithm
Outline:
• Sum of Squared Errors
• Preserved Variability
29
Sum of Squared Errors
Squared sum error (SSE) is typically used in clustering algorithms to measure
the quality of the clusters obtained. This parameter takes into consideration the
distance between each point in a cluster to its cluster center (centroid or some other
chosen representative).
For dj, a point in cluster gi, where mi is the cluster center of gi, and W, the total
number of clusters, SSE is defined as follows:
This value is small when points are close to their cluster center, indicating a good
clustering. Similarly, a large SSE indicates a poor clustering. Thus, clustering
algorithms aim to minimize SSE.
30
Preserved Variability
Preserved variability is typically used in eigenvector-based dimension reduction
techniques to quantify the variance preserved by the chosen dimensions. The
objective of the dimension reduction technique is to maximize this parameter.
Given that the point is represented in r dimensions (k << r), the eigenvalues are
λ1>=λ2>=….. λr-1>=λr. The preserved variability (PV) is calculated as follows:
The value of this parameter depends on the number of dimensions chosen: the
more included, the higher the value. Choosing all the dimensions will result in the
perfect score of 1.
31
Outline
•
•
•
•
•
•
Introduction to Performance Metrics
Supervised Learning Performance Metrics
Unsupervised Learning Performance Metrics
Optimizing Metrics
Statistical Significance Techniques
Model Comparison
32
Statistical Significance Techniques
• Methods used to asses a p-value for the different performance
metrics
Scenario:
– We obtain say cohesion =0.99 for clustering algorithm A. From the first look it feels
like 0.99 is a very good score.
– However, it is possible that the underlying data is structured in such a way that you
would get 0.99 no matter how you cluster the data.
– Thus, 0.99 is not very significant. One way to decide that is by using statistical
significance estimation.
We will discuss the Monte Carlo Procedure in next slide!
33
Monte Carlo Procedure
Empirical p-value Estimation
Monte Carlo procedure uses random sampling to assess the significance of a
particular performance metric we obtain could have been attained at random.
For example, if we obtain a cohesion score of a cluster of size 5 is 0.99, we would be
inclined to think that it is a very cohesive score. However, this value could have
resulted due to the nature of the data and not due to the algorithm. To test the
significance of this 0.99 value we
1. Sample N (usually 1000) random sets of size 5 from the dataset
2. Recalculate the cohesion for each of the 1000 sets
3. Count R: number of random sets with value >= 0.99 (original score of cluster)
4. Empirical p-value for the cluster of size 5 with 0.99 score is given by R/N
5. We apply a cutoff say 0.05 to decide if 0.99 is significant
Steps 1-4 is the Monte Carlo method for p-value estimation.
34
Outline
•
•
•
•
•
•
Introduction to Performance Metrics
Supervised Learning Performance Metrics
Unsupervised Learning Performance Metrics
Optimizing Metrics
Statistical Significance Techniques
Model Comparison
35
Model Comparison
Metrics that compare the performance of different algorithms
Scenario:
1) Model 1 provides an accuracy of 70% and Model 2 provides an accuracy
of 75%
2) From the first look, Model 2 seems better, however it could be that
Model 1 is predicting Class1 better than Class2
3) However, Class1 is indeed more important than Class2 for our problem
4) We can use model comparison methods to take this notion of
“importance” into consideration when we pick one model over another
Cost-based Analysis is an important model comparison method discussed
in the next few slides.
36
Cost-based Analysis
In real-world applications, certain aspects of model performance are considered
more important than others. For example: if a person with cancer was diagnosed as
cancer-free or vice-versa then the prediction model should be especially penalized.
This penalty can be introduced in the form of a cost-matrix.
Predicted
Class
Cost
Matrix
Actual
Class
Associated with f11 or u11
+
-
+
c11
c10
-
c01
c00
Associated with f01 or u01
Associated with f10 or u10
Associated with f00 or u00
37
Cost-based Analysis
Cost of a Model
The cost and confusion matrices for Model M are given below
Confusion Matrix
Actual
Class
Predicted
Class
+
-
+
f11
f10
-
f01
f00
Predicted
Class
Cost Matrix
Actual
Class
+
-
+
c11
c10
-
c01
c00
Cost of Model M is given as:
38
Cost-based Analysis
Comparing Two Models
This analysis is typically used to select one model when we have more than one
choice through using different algorithms or different parameters to the learning
algorithms.
Predicted
Class
Cost
Matrix
Actual
Class
+
-
+
-20
100
-
45
-10
Confusion
Matrix of
My
Predicted
Class
+
-
Actual
Class
+
3
2
-
2
1
Predicted
Class
Confusion
Matrix of
Mx
Actual
Class
+
-
+
4
1
-
2
1
Cost of My : 200
Cost of Mx: 100
CMx < CMy
Purely, based on cost model, Mx is a better model
39
Cost-based Analysis
R-code
•
•
•
•
•
•
•
•
•
•
•
•
•
•
library(PerformanceMetrics)
data(Mx)
data(My)
data(CostMatrix)
Mx
[,1] [,2]
[1,] 4 1
[2,] 2 1
My
[,1] [,2]
[1,] 3 2
[2,] 2 1
costAnalysis(Mx,CostMatrix)
costAnalysis(My,CostMatrix)
40
```