Lecture 12: Clustering (2)

Clustering (2)
Center-based algorithms
Fuzzy k-means
Density-based algorithms (DBSCAN as an example)
Evaluation of clustering results
Figures and equations from Data Clustering by Gan et al.
Center-based clustering
Have objective functions which define how
good a solution is;
 The goal is to minimize the objective
function;
Efficient for large/high dimensional datasets;
The clusters are assumed to be convex
shaped; The cluster center is representative of
the cluster;
Some model-based clustering, e.g. Gaussian
mixtures, are center-based clustering.
Center-based clustering
K-means clustering. Let
disjoint clusters.
be k
Error is defined as the sum of the distance from
the cluster center
Center-based clustering
The k-means algorithm:
Center-based clustering
Understanding k-means as an optimization
procedure:
The objective function is:
Minimize the P(W,Q) subject to:
Center-based clustering
The solution is iteratively solving two subproblems:
When
is fixed,
is minimized if and only if:
When
if
is fixed,
is minimized if and only
Center-based clustering
In terms of optimization, the k-means procedure
is greedy.
 Every iteration decreases the value of the
objective function; The algorithm converges to a
local minimum after a finite number of iterations.
Results depend on initiation values.
The computational complexity is proportional to
the size of the dataset  efficient on large data.
The clusters identified are mostly ball-shaped.
Works only on numerical data.
Center-based clustering
A variant of k-means to save computing time: the
compare-means algorithm. (There are many.)
Based on triangle inequality,
d(x, mi)+d(x, mj)≥d(mi, mj)
d(x, mj)≥d(mi, mj)-d(x, mi)
If d(mi, mj)≥2d(x, mi),
then d(x, mj)≥d(x, mi)
In every iteration, the small number of between-mean
distances are first computed. Then for every x, first
compare its distance to the closest known mean with the
between-mean distances, to find which of the d(x, mj)
really need to be compute.
Center-based clustering
Automated selection of k?
The x-means algorithm based on AIC/BIC.
A family of models at different k:
Is the likelihood of the data given the jth
model. pj is the number of parameters.
We have to assume a model to get the likelihood.
The convenient one is Gaussian.
Center-based clustering
Under the assumption of identical spherical
Gaussian assumption, (n is sample size; k is
number of centroids)
μ(i) is the centroid associated with xi.
The likelihood is:
The number of parameters is (d is dimension):
(class probabilities + parameters for mean & variance)
Center-based clustering
K-harmonic means --- insensitive to initiation.
K-means error:
K-harmonic means
Center-based clustering
K-modes algorithm for categorical data.
Let x be a d-vector with categorical attributes. For a
group of x’s, the mode is defined as the vector q
that minimizes
Where
The objective function is similar to the one for the
original k-means.
Center-based clustering
For mixed data type, the k-probabilities algorithm.
Use the Gower distance:
For quantitative attributes, (R is range)
, if neither is missing.
if xk=1 & yk=1;
For binary attributes,
if xk=1 or yk=1.
For nominal attributes,
if neither is missing.
if xk= yk ;
Center-based clustering
The distance from a sample i to cluster p:
wipk=1 if we can compare the kth variable
between case i and cluster p; 0 otherwise.
wk=1 if neither xik nor μpk is missing.
Objective function is
Center-based clustering
K-prototypes algorithm for mixed type data.
Between any two points, the distance is defined:
γ is a parameter to balance between continuous
and categorical variables.
Cost function to minimize:
Fuzzy k-means
Soft clustering --- an observation can be assigned
to multiple clusters. With n samples and c
partitions, the fuzzy c-partition matrix (c × n):
If take max for every sample we get back to hard
partition:
Fuzzy k-means
The objective function is:
q>1, it controls the “fuzziness”.
Vi is the centroid of cluster i, uij is the degree of
membership of xj belonging to cluster i, k is
number of clusters.
Fuzzy k-means
Density-based algorithms (DBSCAN as an example)
Capable of finding arbitrarily shaped clusters.
Clusters are defined as dense regions surrounded
by low-density regions.
Automatically select the number of clusters.
Needs only one scan through the original data set.
Density-based algorithms (DBSCAN as an example)
Define the ε-neighborhood of a point x as:
A point x is “directly density-reachable” from
point y if:
This relationship is not symmetric.
Density-based algorithms (DBSCAN as an example)
Point x is “density-reachable” from point y if there’s a
sequence of points
x, x1, x2, ……, xi, y
where each point is directly density reachable from
the next one.
Points x and y are “density connected” if there exists
a point z, such that both x and y are densityreachable from z.
(* all the relationships are with respect to ε and Nmin)
Density-based algorithms (DBSCAN as an example)
The definition of “cluster”:
Let D be the dataset. A cluster C with respect to ε
and Nmin satisfies:
(1) If x is in C and y is density-reachable from x,
then y is in C;
(2) Every pair of x and y in C are density-connected.
“Noise” is a set of points that don’t belong to any
cluster.
Density-based algorithms (DBSCAN as an example)
noise
Border point
Core point
Density-based algorithms (DBSCAN as an example)
Algorithm:
Start with an arbitrary point x, find all points that
are density-reachable from x. (If x is a core point,
then a cluster is found; if x is a border or noise
point, no point is density reachable from it.)
Visit the next unclassified point.
Two clusters may be merged if close enough.
Cluster distance is defined by single linkage:
Density-based algorithms (DBSCAN as an example)
Density-based algorithms (DBSCAN as an example)
How to choose ε and Nmin?
A heuristic method called the “sorted k-dist
graph”.
Sort the Fk(D) (from the entire dataset) in
descending order and plot.
Find the k* where k>k* doesn’t bring much
change to the graph. Nmin=k*
Find the first point in the first valley z0. Set
ε=Fk*(z0).
Density-based algorithms (DBSCAN as an example)
Evaluation
Evaluation
External criteria approach: Comparing clustering
results (C ) with a pre-specified partition (P).
For all pairs of samples,
In same
cluster in C
In different
cluster in C
M=a+b+c+d
In same
cluster in P
a
In different
cluster in P
b
c
d
Evaluation
Monte Carlo methods based on H0 (random
generation), or bootstrap are needed to find
significance.
Evaluation
External criteria:
An alternative is to compare the proximity matrix
Q with the given partition P.
Define matrix Y based on P:
Evaluation
Internal criteria: evaluate clustering structure
by features of the dataset (mostly proximity
matrix of the data).
Example:
For Hierarchical clustering,
Pc:cophenetic matrix, the ijth element
represents proximity level at which two data
points xi and xj are first joined into the same
cluster.
P: proximity matrix.
Evaluation
Cophenetic correlation coefficient index:
dij: elements in P
cij: elements in Pc
CPCC is in [-1,1]. Higher value indicates better
agreement.
Evaluation
Relative criteria: choose the best result out of a set
according to predefined criterion.
Example:
Modified Hubert’s Γ statistic:
P is the proximity matrix of the data.
High value indicates compact clusters.