Report

Advanced Statistical Methods for Research Math 736/836 Cluster Analysis Part 2: K-means clustering. Watch out for the occasional paper clip! ©2009 Philip J. Ramsey, Ph.D. 1 K-means clustering is quite different in its approach from hierarchical clustering. It is based on disjoint clusters. K-means clustering tends to have the same biases as Ward’s method in terms of characteristics of the clusters. In K-means clustering we may start with a predetermined number of clusters K and assign observations to the clusters in such a way that the within cluster variation is minimized and the between cluster variation is maximized. It is for this reason that Everitt and Dunn refer to this as an optimization method. The idea is to minimize the distances between observations within each cluster while maximizing the distance between the K clusters. This method should typically only be used for larger datasets, say at least 200. The solutions are highly unstable with smaller datasets. ©2009 Philip J. Ramsey, Ph.D. 2 The basic K-means cluster algorithm is to start with some prespecified number of clusters or use an algorithm to pick an optimal number of clusters if the user has no a priori idea how many clusters should exist in the data. The first step is to designate a set of N seed values and assign observations to clusters based on the nearness of the seed values. Next compute the initial K centroids based on the groupings from the seed points. Then points are reassigned based upon nearness to one of the centroids. Keep reassigning observations to clusters based upon the nearness of the centroids. Each time a point is moved from one cluster to another, recalculate the centroids for the clusters involved in the swap. Continue the process until no further reassignments take place. ©2009 Philip J. Ramsey, Ph.D. 3 Example: We use the JMP sample data set Cytometry.JMP. We will perform K-means clustering with K = 4 clusters specified. The variables CD3 and CD8 are used for the initial clustering attempt. ©2009 Philip J. Ramsey, Ph.D. 4 Example: In K-Means cluster report window we are given a number of options under the Control Panel, however the JMP defaults are recommended – see the JMP help file if you wish to change some of the defaults. We can click on “Go” to have JMP find a solution, or the user can click on “Step” to see each iteration of the process. ©2009 Philip J. Ramsey, Ph.D. 5 Example: After 16 iterations JMP converges to a solution for 4 clusters. ©2009 Philip J. Ramsey, Ph.D. 6 Example: Under the hotspot at the top of the report you can choose to see a biplot that displays the variable rays and the clusters. ©2009 Philip J. Ramsey, Ph.D. 7 Example: An important question is how many clusters to begin without a scientific basis for specifying the number of clusters. One good approach is try some scatter plots and see if any natural clustering seems to occur. Of course this is limited to 2D or 3D plots but is a valid basis when no a priori information exists. Bivariate Fit of CD8 By CD3 A Fit Y by X plot indicates that 4 clusters may exist. A 3D Plot in CD3, CD8, and CD4 also indicated 4 clusters. 500 400 CD8 300 200 100 0 0 100 200 300 400 500 CD3 ©2009 Philip J. Ramsey, Ph.D. 8 Example: A 3D plot of CD3, CD8, and MCB indicated the possibility of 5 clusters. ©2009 Philip J. Ramsey, Ph.D. 9 Example: K-means clustering was repeated with CD3, CD8, and MCB and 5 clusters specified. ©2009 Philip J. Ramsey, Ph.D. 10 Example: Below is a 3D biplot for the five clusters and three variables. ©2009 Philip J. Ramsey, Ph.D. 11 JMP provides a fairly modern version of K-means clustering based on Self Organizing Maps (SOMs). K-means clustering can be thought of as a special case of SOMs methodology. JMP provides a limited but useful application of SOMs to cluster analysis. SOMs are a data visualization technique which reduce the dimensions of data through the use of self-organizing neural networks. SOMs reduce dimensions by producing a 2D map which plots the similarities of the data by grouping similar data items together. The goal of a SOM is to form clusters in a particular layout on a cluster grid, such that points in clusters that are near each other in the SOM grid are also near each other in multivariate space. In classical K-means clustering, the structure of the clusters is arbitrary, but in SOMs the clusters have the grid structure. This grid structure may be of value in interpreting the clusters. ©2009 Philip J. Ramsey, Ph.D. 12 Example: We continue with the Cytometry example. Iterative Clustering Control Panel Self Organizing Map The bandwidth option is a weighting function used in estimating the clusters. JMP help does not provide details on its form. However, the choice of bandwidth can impact the final cluster. Try various values of the bandwidth and see if a more interpretable cluster analysis occurs. Standardize data by Std Dev Color while clustering Shift distances using sampling rates Use within-cluster std deviations Bandwidth: 0.75 Cluster Summary Step Criterion 18 0 Cluster Count Max Dist 1 1785 2.75310404 2 988 2.21211425 3 815 1.77335177 4 1412 1.30760703 Cluster Means Cluster CD3 CD8 1 314.022388 291.282325 2 344.491485 117.459102 3 139.366061 135.557004 4 218.502141 108.402641 Cluster Standard Deviations Cluster ©2009 Philip J. Ramsey, Ph.D. CD3 CD8 1 22.1007724 25.7207367 2 28.5748537 37.0754192 3 33.4412448 40.2863212 4 23.7036989 35.3287975 13 Example: Below is the initial SOM layout of the four cluster locations for the Cytometry example. Biplot 2 CD8 1 3 1 4 2 P rin 2 0 -1 CD3 -2 -3 -4 -3 -2 -1 0 1 2 3 4 Prin 1 Eigenvalues 1.4604529 ©2009 Philip J. Ramsey, Ph.D. 0.5395471 14 Example: Below is the final SOM for the Cytometry example using a bandwidth of 0.75 (left biplot) and 0.5 (right biplot). Changing the bandwidth can significantly change the map. Biplot Biplot 2 CD8 2 CD8 1 3 1 1 3 1 4 P rin 2 P rin 2 0 -1 4 0 2 CD3 -1 2 -2 -2 CD3 -3 -3 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 0 1 2 3 4 Prin 1 Prin 1 Eigenvalues Eigenvalues 1.4604529 -1 0.5395471 ©2009 Philip J. Ramsey, Ph.D. 1.4604529 0.5395471 15 Example: Below is the final SOM for the Cytometry example using a bandwidth of 0.90. Notice that with a high bandwidth we do not even have four clusters – there is too much smoothing. Biplot CD8 2 1 2 P rin 2 4 0 -1 1 3 -2 CD3 -3 -4 -3 -2 -1 0 1 2 3 4 Prin 1 Eigenvalues 1.4604529 ©2009 Philip J. Ramsey, Ph.D. 0.5395471 16 Normal Mixtures is an iterative technique implemented in the Kmeans clustering platform in JMP, although it is not a traditional Kmeans clustering algorithm. Both K-means and Hierarchical clustering methods attempt to group observations into clusters. However the normal mixture approach is more of an estimation method to characterize the cluster groups. If clusters overlap, assigning each observation to one cluster is problematic. In the overlap areas, there are observations from several clusters sharing the same space. Rather than classifying each observation into a cluster, the mixtures method estimates the probability that an observation is in each cluster based upon the assumption of a multivariate normal distribution for each cluster. The user must specify the number of clusters. The method produces fuzzy clusters that may overlap based on the assignments using the probabilities of cluster membership. ©2009 Philip J. Ramsey, Ph.D. 17 The concept of normal mixtures is that the observed multivariate data are generated as a finite mixture of K multivariate normal distributions. With each distribution contributing some proportion p to the total number of observations N. In cluster analysis the user specifies the number of multivariate distributions that must be estimated. Once the parameters of the K multivariate distributions are estimated, then it is possible to calculate probabilities that each of the N observations came from one of the distributions. Generally, the observation can be assigned to the cluster for which it has the highest estimated probability of belonging. In this sense the observations are not strictly assigned to clusters, but rather can be assigned classifications based on the probabilities. An algorithm estimates the distribution parameters and proportions. ©2009 Philip J. Ramsey, Ph.D. 18 To illustrate the concept of finite normal mixtures, suppose we have the bivariate case – there are two variables of interest. These two variables take on values that come from one of two distributions (K = 2) with proportions p and (1-p) respectively. For each of the two distributions we have to estimate the centroid μ, the covariance matrix , and the proportion (unless known in advance). In clustering we assume the proportions are unknown. Our overall bivariate distribution is estimated from the observations as a combination of the two underlying normal distributions. f X 1, X 2 p M V N μ A ; Σ A 1 p M V N μ B ; Σ B For some number of distributions m > 2 and a vector of variables X, the formula generalizes to m f X pi M V N μ i ; Σ i i 1 ©2009 Philip J. Ramsey, Ph.D. 19 The parameters and proportions for the distributions are estimated using the method of maximum likelihood. We will not delve into the mathematical details, however a sketch of the approach is provided by Everitt and Dunn text in the Cluster Analysis chapter. Example: We will use the Cytometry data. We have K = 4 clusters for two variables CD3 and CD8, so we have to estimate m = 4 bivariate normal distributions and proportions. We will use the Kmeans, Normal Mixtures option to estimate the proportions and parameters of the distributions. ©2009 Philip J. Ramsey, Ph.D. 20 Example continued: The estimated proportions and distribution parameters are given below. JMP only provides correlation estimates. Cluster Summary Correlations for Normal Mixtures Step Criterion 500 3.793e-9 Cluster Cluster1 Proportion Mixture Count 1 0.39396573 1905.11684 2 0.17119194 844.329154 3 0.34261552 1686.30176 4 0.09222681 564.252252 Cluster Means Cluster CD3 CD8 CD3 CD8 CD3 1.0000 0.3751 CD8 0.3751 1.0000 CD3 CD8 CD3 1.0000 0.0121 CD8 0.0121 1.0000 CD3 CD8 Cluster2 Cluster3 1 218.339067 129.156625 2 339.684674 88.7314432 CD3 1.0000 0.3801 3 318.018017 306.756268 CD8 0.3801 1.0000 4 120.184407 96.4038892 CD3 CD8 CD3 1.0000 0.2413 CD8 0.2413 1.0000 Cluster4 Cluster Standard Deviations Cluster CD3 CD8 1 52.5587037 45.1128235 2 25.8393687 33.2087219 3 19.8353249 20.8102507 4 39.0601389 26.6025811 Note: StdDev slightly inflated for regularization. ©2009 Philip J. Ramsey, Ph.D. 21 Example continued: Below is a 3D view of the mixture density. Notice that cluster 4 does not resolve due to a low proportion. #1 #3 #2 #4 ©2009 Philip J. Ramsey, Ph.D. 22 Example continued: Below is a Fit Y by X plot of the clusters showing 95% density ellipses for each cluster. Bivariate Fit of CD8 By CD3 500 400 3 CD8 300 200 1 4 100 2 0 0 100 200 300 400 500 CD3 Bivariate Normal Ellipse P=0.900 Cluster==1 ©2009 Philip J. Ramsey, Ph.D.Bivariate Normal Ellipse P=0.900 Cluster==2 23 Example continued: Below is a partial view of the spreadsheet with the calculated probabilities of cluster membership for the observations. ©2009 Philip J. Ramsey, Ph.D. 24 It is possible to use K-means clustering with predetermined or fixed clusters. As an example, there may be known species that make up a multivariate dataset and the species are the basis for the clusters. The estimated cluster formula could then be saved to the spreadsheet and then applied to a new set of measurements where the identity of the species is not known. This takes clustering into the realm of classification (supervised learning). We use the JMP sample dataset OwlDiet. JMP to demonstrate fixed clustering using Normal Mixtures. The data consists of measurements on seven species of Malaysian rats that are eaten by owls. There are two data sets. The first (classification) data set had 179 observations and contained measures of skull length, teeth row, palentine foramen, and jaw width for seven known species. The second (observational) set had 115 observations and no information regarding species. We will try to identify the unknown species. ©2009 Philip J. Ramsey, Ph.D. 25 Example: For the owl data we specify species as the “Centers” variable in the Cluster Analysis launch window. The number of clusters for K-means clustering is now set by “Species.” ©2009 Philip J. Ramsey, Ph.D. 26 Example: The observations with no designated species will be classified into their nearest cluster. Select the “Save Clusters” option to see the classifications for the observations missing a species designation. Cluster Summary Step Cluster Summary Criterion 0 Cluster 0 Criterion 999 3.43e-13 species Proportion annandalfi 0.14285714 0 1 2 argentiventer 0.14285714 0 3 exulans 0.14285714 0 3 4 rajah 0.14285714 0 5 surifer 0.14285714 1 Mixture Count Step Cluster species Proportion Mixture Count annandalfi 0.36207334 28.4645603 2 argentiventer 0.12937488 28.680347 exulans 0.05468071 13.4915849 4 rajah 0.17687071 0 5 surifer 0.01698639 3.8828717 52 6 tiomanicus 0.14285714 0 6 tiomanicus 0.04568538 6.96155194 7 whiteheadi 0.14285714 0 7 whiteheadi 0.2143286 160.519084 Cluster centers defined by species ©2009 Philip J. Ramsey, Ph.D. Cluster centers defined by species 27 Example : Notice in the partial spreadsheet view that the observations without a species are assigned a cluster (or species). ©2009 Philip J. Ramsey, Ph.D. 28 Example : If you would like the Cluster column to display species designations, change the column data type to character (Column Info window), while the cluster column is highlighted click on the Column menu and select “recode.” Now recode the cluster numbers to species. ©2009 Philip J. Ramsey, Ph.D. 29 Example : A biplot with the 7 clusters marked on the plot. Palantine foramen is not highly correlated with the other variables. The species whiteheadi (cluster 7) has a very small palatine foramen. The species exulans (cluster 3) have a small skull length, teeth, and jaw. Notice considerable overlap in the clusters. ©2009 Philip J. Ramsey, Ph.D. Exulans palatine foramen jaw length teeth row skull length whiteheadi 30 Example : A 3D biplot with the 7 clusters marked on the plot as ellipsoids. Scatterplot 3D Exulans whiteheadi Principal Components ©2009 Philip J. Ramsey, Ph.D. Prin1 Prin2 Prin3 31