slides - Computer Science and Engineering

```CLUSTERABILITY
A THEORETICAL STUDY
Margareta Ackerman
Joint work with Shai Ben-David
The theory-practice gap
Clustering is one of the most widely used tools for
exploratory data analysis.
Social Sciences
Biology
Astronomy
Computer Science
.
.
.
All apply clustering to gain a first understanding of the
structure of large data sets.
Yet, there is distressingly little
theoretical understanding of clustering.
Inherent obstacles
• Clustering is not well defined.
There is a wide variety of different clustering tasks, with
different (often implicit) measures of quality.
• In most practical clustering tasks there is
no clear ground truth to evaluate your solution by.
(in contrast with classification tasks, in which you can
have a hold out labeled set to evaluate the classifier against).
Common solutions
Objective utility functions
Sum Of In-Cluster Distances, Average Distances to Center Points, Cut Weight,
Spectral Clustering, etc. (Shmoys, Charikar, Meyerson, Luxburg, ..)
Analyze the computational complexity of discrete optimization problems.
Consider a restricted set of distributions (“generative
models”):
Ex. Mixtures of Gaussians [Dasgupta ‘99], [Vempala, ’03], [Kannan et al ‘04],
[Achlitopas, McSherry ‘05].
Recover the parameters of the model generating the data.
Many more….
Quest for a General Theory
What can we say independently of any
specific algorithm,
specific objective function,
or specific generative data model
?
Axioms of clustering
Ex. Clustering functions [Puzicha, Hofmann, Buhmann ‘00], [Kleinberg ‘02]
Axioms of clustering quality measures [Ackerman and Ben-David, ‘08].
Find axioms that define clustering.
Why study clusterability?
Even if a data set has no
meaningful structure, any
clustering algorithm will find
some partition of the data set.
Clusterability aims to determine whether a data
set can be meaningfully clustered.
Notions of clusterability quantify the degree
of clustered structure in a data set.
Our contributions
We set out to explore notions of clusterability, compare
them, and find patterns of similarity.
Computational complexity of clustering
Data sets that are more clusterable are computationally
easier to cluster well.
Hardness
Determining whether a data set is clusterable is usually
NP-hard.
Comparison
Notions of clusterability are pairwise inconsistent.
Outline









Definitions and notation
Clusterability and the complexity of clustering
Introduce a new notion of clusterability: PC-clusterability
Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)
Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)
Variance ratio (Zhang, 2001)
Comparison of notions of clusterability
The hardness of determining clusterability
Summary and future work
Definitions and Notation



A k-clustering of X is a k-partition of X
A loss function (or cost function) is a function that takes a
clustering and outputs a real number. Ex. k-median, k-means.
Let OPTL,k(X) denote the minimal loss over all
k-clusterings of X.
OPTL,k(X) = min{L(C): C a k-clustering of X}.
Definitions and Notation
A clustering C = {X1, X2, … , Xk } is center-based if there are centers
ci in Xi, such that all points in Xi are closer to ci than to any other
center.
Outline









Definitions and notation
Clusterability and the complexity of clustering
Introduce a new notion of clusterability: PC-clusterability
Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)
Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)
Variance ratio (Zhang, 2001)
Comparison of notions of clusterability
The hardness of determining clusterability
Summary and future work
Better clusterability implies that data
is easier to cluster well
In most formulations, clustering is an NP-hard problem. (ex. k-means and
k-median clustering, correlation clustering, ect.)
 When a data set has no significant clustered structure,
there is no sense in clustering it.
 Clustering makes sense only on data sets that have
meaningful clustering structure.
 We show that the more clusterable a data set is, the
easier it is to cluster well.
 Clustering is hard only when there isn’t sufficient
clustering structure in the data set.
Outline









Definitions and notation
Clusterability and the complexity of clustering
Introduce a new notion of clusterability: PC-clusterability
Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)
Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)
Variance ratio (Zhang, 2001)
Comparison of notions of clusterability
The hardness of determining clusterability
Summary and future work
Center Perturbation Clusterability:
A high level definition
Call a clustering optimal when it has optimal cost by
some loss function L.
If a clustering looks like the optimal clustering, we
might expect that its cost is also near-optimal (in
well-clusterable data sets.)
A data set X is CP-clusterable if all clusterings that
are structurally similar to an optimal clustering of
X have near-optimal cost.
Center Perturbation Clusterability
Two center-based clusterings C and C’ are ε-close if there exist
centers c1, c2, c3 ,…, ck of C and c’1, c’2, c’3 ,…, c’k of C’ s.t.
|ci -c’i| ≤ ε.
Definition:
A data set X is (ε,δ)-CP Clusterable (for k) if for every
clustering C of X that is ε-close to some optimal kclustering of X,
L(C) ≤(1+ δ) OPTL,k(X) .
Good PC-clusterability implies that it is easy to
cluster well
Theorem:
Given a data set X in Rm that is (rad(X)/sqrt(l),δ)-PC
clusterable (for k), there is an algorithm that runs in time
polynomial in n, and outputs a k-clustering C such that
L(C) ≤(1+ δ)OPTL,k(X).
This result holds with any loss function where the optimal
clusterings are center-based.
Proof that PC-clusterability implies
that it is easy to cluster well
Let an l-sequence denote a collection of l elements of X (not
necessarily distinct).
Algorithm 1:
For each k-tuple of l-sequences of X
S := centers of mass of the l-sequences
Ctemp := the clustering that S induces on X
If L(Ctemp ) < L(C)
C := Ctemp
Return C
Proof that PC-clusterability implies
that it is easy to cluster well
Theorem [Maurey, 1981]:
For any fix l ≥1, and each x in the convex hull of X in Rm,
there exist x1, x2, ,…, xl ɛ X such that the average of the
xi’s is at most rad(X)/sqrt(l ) away from x.



By Maurey’s result, there is a clustering C’ examined by Algorithm 1
that is rad(X)/sqrt(l )-close to an optimal clustering of X.
Since Algorithm 1 selects the minimal loss clustering of the ones it
examines, L(C)≤ L(C‘).
L(C)≤ L(C ‘) ≤ (1+ δ) OPTL,k(X).
Outline









Definitions and notation
Clusterability and the complexity of clustering
Introduce a new notion of clusterability: PC-clusterability
Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)
Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)
Variance ratio (Zhang, 2001)
Comparison of notions of clusterability
The hardness of determining clusterability
Summary and future work
Worst Pair Ratio Clusterability:
Preliminaries
Introduced by Epter, Krishnamoorthy, and Zaki in 1999.


The width of a clustering is the maximum
distance between points in the same
cluster (over all clusters).
The split of a clustering is the minimum
distance between points in different
clusters.
Definition:
The Worst Pair Ratio of X (w.r.t. k) is
WPRk(X)=max{split(C)/width(C) : C a k-clustering of X}.
Better WPR-clusterability implies that it is
easier to cluster well
Theorem:
Given a data set X (on n elements) where
WPRk (X) > 1, we can find a k-clustering of X with
the maximal split over width ratio in time
O(n 2 log n).
Outline









Definitions and notation
Clusterability and the complexity of clustering
Introduce a new notion of clusterability: PC-clusterability
Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)
Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)
Variance ratio (Zhang, 2001)
Comparison of notions of clusterability
The hardness of determining clusterability
Summary and future work
Separability clusterability
Introduced by Ostrovsky, Rabani, Schulman, and Swamy in 2006.
Definition:
A data set X is (k,ε)-separable if
OPTL,k(X) ≤ εOPTL,k-1(X).


Separability measures how much is gained in the transition
from k-1 to k clusters.
In the original definition, L is k-means.
Better separability implies that it is easier
to cluster well
Theorem [Theorem 4.13, Ostrovsky et al. 2006]:
Given a (k, ε2)-separable data set X in Rm, for small
enough ε, we can find a k-clustering with k-means loss
at most (1- ε 2)/(1-37 ε 2)OPTk-means, k(X) away from the
optimal, with probability at least 1- O(ε2) in time O(nm).
Outline









Definitions and notation
Clusterability and the complexity of clustering
Introduce a new notion of clusterability: PC-clusterability
Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)
Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)
Variance ratio (Zhang, 2001)
Comparison of notions of clusterability
The hardness of determining clusterability
Summary and future work
Variance Ratio – Preliminaries.
Introduced by Zhang in 2001.
The within cluster variance of C is W (C ) 
k
| Xi | 2
 (Xi )

i 1 | X |
The between cluster variance of C is
k
| Xi |
(centerOfMass ( X i )  centerOfMass ( X ))2
i 1 | X |
B(C )  
Definition:
The Variance Ratio of a data set X (for k) is
Vark(X) = max{B(C)/W(C) : C a k-clustering of X}
Better VR clusterability implies that it is
easier to cluster well
Theorem:
Given a (2, ε2)-separable data set X in Rm, we can find a 2clustering with k-means loss at most 1-Θ(ε2) away from
the optimal, with probability at least 1- O(ε2) in time
O(nm).
Proof:
-
We can show that VR 2(X) = 1/S 2(X) -1.
The result follows from a theorem by Ostrovsky et al. for 2-clusterings
[Theorem 3.5, Ostrovsky et al. 2006].
Summary of notions of clusterability

Center-perturbation:
Whenever a clustering is structurally similar to the optimal
clustering, its cost is near-optimal.

Separability:
Loss of the optimal k-clustering/Loss of the optimal (k-1)-clustering

Variance Ratio:
Between-cluster variance/Within-cluster variance.

Worst Pair Ratio:
Split/width.
Outline









Definitions and notation
Clusterability and the complexity of clustering
Introduce a new notion of clusterability: PC-clusterability
Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)
Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)
Variance ratio (Zhang, 2001)
Comparison of notions of clusterability
The hardness of determining clusterability
Summary and future work
Comparing notions of clusterability
Worst
Pair
Ratio
CenterPerturbation
Separability
Variance
Ratio
An arrow from notion A to notion B indicates good clusterability
by notion A implies good clusterability by notion B.
No two notions are equivalent.
Outline









Definitions and notation
Clusterability and the complexity of clustering
Introduce a new notion of clusterability: PC-clusterability
Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)
Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)
Variance ratio (Zhang, 2001)
Comparison of notions of clusterability
The hardness of determining clusterability
Summary and future work
Computational complexity of
determining the clusterability value
What is the computational complexity of
determining the clusterability of a data set?



It is NP-hard to determine whether a data set is
(k,ε)-separable.
It is NP-hard to find the Variance Ratio of a data
set.
It a data set is well-clusterable by WPR, then the
WPR can found in polynomial time.
Outline









Definitions and notation
Clusterability and the complexity of clustering
Introduce a new notion of clusterability: PC-clusterability
Worst pair ratio (Epter, Krishnamoorthy, and Zaki, 1999)
Separability (Ostrovsky, Rabani, Schulman, and Swamy, 2006)
Variance ratio (Zhang, 2001)
Comparison of notions of clusterability
The hardness of determining clusterability
Summary and future work
Summary




We initiate a study of clusterability and
introduce a new notion of clusterability.
We show that three previous and the new notion
of clusterabilty are pairwise distinct.
Better clusterability implies that it is easier to
cluster well for distinct notions of clusterability.
Determining the degree of clusterability is
usually NP-hard.
Future work



Property: The more clusterable a data set is, the
easier it is, computationally, to find a nearoptimal clustering of the data.
Does this property hold for other natural notions
of clusterability?
Can clusterability be axiomatized?
Can it be shown that the above property holds for
all reasonable notions of clusterability?
Appendix: Variance Ratio does not imply
the other notions


Select a data set X1 with arbitrarily poor (k-1)-clusterability according
to any other notion B (separability, Worst Pair Ratio, or CenterPerturbation)
Create data set X2 by taking X1 and adding a single point very far away,
making it’s own cluster in the optimal k-clustering.

X2 can have arbitrarily good Variance Ratio.

Clusterability is the same on data sets X1 and X2 by notion B.
X1
X2