Stream Clustering

Report
Stream Clustering
CSE 902
Big Data
Stream analysis
Stream: Continuous flow of data
Challenges
◦ Volume: Not possible to store all the data
◦ One-time access: Not possible to process the data using multiple
passes
◦ Real-time analysis: Certain applications need real-time analysis of
the data
◦ Temporal Locality: Data evolves over time, so model should be
adaptive.
Stream Clustering
Topic
cluster
Article
Listings
Stream Clustering
• Online Phase
• Summarize the data into
memory-efficient data
structures
• Offline Phase
• Use a clustering
algorithm to find the
data partition
Stream Clustering Algorithms
Data Structures
Examples
Prototypes
Stream, Stream Lsearch
CF-Trees
Scalable k-means, single pass k-means
Microcluster Trees
ClusTree, DenStream, HP-Stream
Grids
D-Stream, ODAC
Coreset Tree
StreamKM++
Prototypes
Stream, LSearch
CF-Trees
Summarize the data in each CF-vector
• Linear sum of data points
• Squared sum of data points
• Number of points
Scalable k-means, Single pass k-means
Microclusters
CF-Trees with “time” element
CluStream
• Linear sum and square sum of timestamps
• Delete old microclusters/merging microclusters if
their timestamps are close to each other
Sliding Window Clustering
• Timestamp of the most recent data point added to
the vector
• Maintain only the most recent T microclusters
DenStream
• Microclusters are associated with weights based on
recency
• Outliers detected by creating separate microcluster
Grids
D-Stream
• Assign the data to grids
• Grids weighted by recency of points
added to it
• Each grid associated with a label
DGClust
• Distributed clustering of sensor data
• Sensors maintain local copies of the grid
and communicate updates to the grid to
a central site
StreamKM++ (Coresets)
 A weighted set S is a (, ) coreset for a data set D if the clustering of S approximates the
clustering of D with an error margin of 
1 −  ∗  ,  ≤  ,  ≤ (1 + ) ∗ dist D, C
 Maintain data in buckets 1 , 2 …  .
Buckets 2 to  contain either exactly
contains 0 or m points. 1 can have
any number of points between 0 to m
points.
 Merge data in buckets using coreset
tree.
StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012
Kernel-based Clustering

 ( x , y )  ( x , 2 xy , y )
2
2
T
 ,  = () ()
Kernel-based Stream Clustering
 Use non-linear distance measures to define similarity between
data points in the stream
 Challenges
 Quadratic running time complexity
 Computationally expensive to compute centers using linear sums and
squared sums (CF-vector approach will not work)
Stream Kernel k-means (sKKM)
1
(1 , 2 , …  )
Kernel k-means
2
3
Weighted
Kernel k-means
(1 , 2 , …  )
Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012
History from only the preceding data chunk retained
Statistical Leverage Scores
Measures the influence of a point in the low-rank approximation

  
=
=1
Leverage score  =

 2
(
=1  )
Statistical Leverage Scores
Used to characterize the matrices which can be approximated accurately with a sample of the
entries
0 0 1
= 0 0 0
0 0 0
1 0
= 0 0
0 1
0 1
1 0
0 0
0 0 0 0 −1
0 0 0 −1 0
0 0 1 0
0
Leverage scores are 1, 1, 1 – all rows are equally important
All the entries of the matrix need to be sampled
If singular vectors/eigenvectors are spread out (uncorrelated with the standard basis),
then we can approximate the matrix with a small number of samples
Approximate Stream kernel k-means
o Uses statistical leverage score to determine which data points in the stream are
potentially “important”
o Retain the important points and discard the rest
o Use an approximate version of kernel k-means to obtain the clusters – Linear
time complexity
o Bounded amount of memory
Approximate Stream kernel k-means
Importance Sampling
Sampling probability
 =  Σ  
 =
1


2
2
Kernel matrix construction
 =
−1
∅
ℎ  
∅ ( ,  )
−1 ℎ  1 − 
Clustering
• Using kernel k-means to recluster M each time a point is added will be expensive
• Reduce complexity by employing a low-dimensional representation of the data

Kernel k-means
min

max
 ∈ 0,1  ×   ∙ ∈
=1 =1

 ∙ − ( ,∙)
 
2

• Constrain the cluster centers to the top k eigenvectors of the kernel matrix

“Approximate”
Kernel k-means
min ×  max
 ∈ 0,1
 ∙ ∈

=1 =1

 ∙ − ( ,∙)
 
 =  (1 , …  )
2

Clustering
“Approximate”
Kernel k-means

min ×  max
 ∈ 0,1
 ∙ ∈

=1 =1

 ∙ − ( ,∙)
 
2

 =  (1 , …  )
 ∙ =
1
 Σ 1/2

max ×  ( Σ    )
 ∈ 0,1
Solve by running k-means on  Σ 1/2 - ( 2 ) running time complexity
Updating eigenvectors
• Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and
clustering
• Update the eigenvectors and eigenvalues incrementally
( +  3 ) running
time complexity
 = Σ 
 +  = 

 =  −   
Component orthogonal to 


Σ′ 



Σ ′ contains the eigenvalues of
Σ 
sparse matrix 


Approximate Stream Kernel k-means
Network Traffic Monitoring
 Clustering used to detect intrusions in the network
 Network Intrusion Data set
 TCP dump data from seven weeks of LAN traffic
 10 classes: 9 types of intrusions, 1 class of legitimate traffic.
Running Time in milliseconds
(per data point)
Cluster Accuracy
(NMI)
Approximate stream kernel k-means
6.6
14.2
StreamKM++
0.8
7.0
sKKM
42.1
13.3
Around 200 points clustered per second
Summary
 Efficient kernel-based stream clustering algorithm - linear running time complexity
 Memory required is bounded
 Real-time clustering is possible
 Limitation: does not account for data evolution

similar documents