Report

K-MEANS ALGORITHM Jelena Vukovic 53/07 jeca.zr@gmail.com Introduction • Basic idea of k-means algorithm • Detailed explenation • Most common problems of the algorithm • Applications • Possible improvements Elektrotehnički fakultet u Beogradu 2/16 Bassic principles of algorithm • Given the set of points (x1, x2, … , xn) • Partition n points into k sets (n>k) (S1, S2, … , Sk) • The goal is to minimize within-cluster sum of squares • µi is the mean of points in Si Elektrotehnički fakultet u Beogradu 3/16 The algorithm • Initialize the number of means (k) • Iterate: 1. Assign each point to the nearest mean 2. Move mean to center of its cluster Elektrotehnički fakultet u Beogradu 4/16 The algorithm Assign points to nearest mean Elektrotehnički fakultet u Beogradu Move means 5/16 The algorithm • The complexity is O(n * k * I * d) • n – number of points • k – number of clusters • I – number of iterations • d – number of attributes Re-assign points Elektrotehnički fakultet u Beogradu 6/16 The algorithm Elektrotehnički fakultet u Beogradu 7/16 K nearest neighbors • Very similar algorithm • The decision is made based on the simple majority of the closest k neighbors • In k-means the Euclidian distant measure is used Elektrotehnički fakultet u Beogradu 8/16 Some limitations of algorithm • The number of clusters needs to be known in advance • Initialization of means position • Problems appear when clusters have different • Shapes • Sizes • Density Elektrotehnički fakultet u Beogradu 9/16 Initial centroids problem • Random distribution (the most common) • Multiple runs • Testing on a data sample • Analyze the data Elektrotehnički fakultet u Beogradu 10/16 Different density Original points Elektrotehnički fakultet u Beogradu 3 Clusters 11/16 Non-globular shapes Original points Elektrotehnički fakultet u Beogradu 2 Clusters 12/16 Pros and cons Pros Cons • Simple to implement • K needs to be known • Fast • Ellipsoid shape is • Not highly demanding assumed • Requires some knowledge about data in advance • Possibility of many loop turns, without significant changes in clusters Elektrotehnički fakultet u Beogradu 13/16 Applications of the algorithm • Many different uses • Computer vision • Market segmentation • Geostatic • Astronomy • etc Elektrotehnički fakultet u Beogradu 14/16 Improvements • Pre-processing of the data in order to better estimate k • Run multiple iteration in parallel with different centroid initialization • Ignore possible errors to avoid non-standard cluster shapes Elektrotehnički fakultet u Beogradu 15/16 Thank you! Elektrotehnički fakultet u Beogradu 16/16