Cluster analysis more usually referred to as clustering, is a common data mining task. In clustering the goal is to divide the data set into groups so that objects in the same group are similar to one another while objects in different groups are different to one another. In other words the goal is to minimize the intra-cluster distance while maximizing the inter-cluster distance.
Clustering has wide applicability being used in areas as diverse as text mining, bioinformatics, customer segmentation, spatial data analysis and many others. Clustering is different to classification and prediction in that it is an unsupervised learning method and so there is no test set against which to gauge accuracy. There are other ways to determine the validity of results but one could argue that there is an element of art to good clustering.
Types of Clustering
Approaches to clustering include:
Partitioning algorithms try to split the n data objects into k partitions or clusters mostly using distance measures. An example is k means clustering where a value for k is specified by the analyst and the algorithm then selects k objects in the data set as initial cluster centers. Each remaining object is then assigned to its cluster centre. The algorithm then calculates new cluster centres by taking the mean of the existing clusters. Objects are then checked to see if they should be reassigned to different clusters and this process iterates until convergence is achieved.
Hierarchical clustering also works by partitioning data but does not require to specify the number of partitions in advance. It may be agglomerative, in which case each object is initially considered a cluster and clusters are merged until there is just one large cluster, or divisive in which all objects are initially in one cluster and then get further subdivided. There are different methods of agglomerating clusters which may include complete linkage, single linkage and average linkage among others.
Density Based Methods
Unlike partitioning and hierarchical methods which are designed to find globular clusters, density based methods can find clusters of other types. They work by forming clusters in dense regions of the data space which are separated from one another by sparser regions.
Grid based methods assign objects to cells that form a grid structure and then compute the density of each cell. Cells with low density are eliminated and clusters are formed from the remaining adjacent cells. One of the advantages of these types of algorithms is their fast processing time.
There are many other approaches to clustering which can be useful for specialised mining tasks, including model based methods, fuzzy clustering and correlation based approaches. Some of these algorithms can for example assign objects to clusters probabilistically rather than assigning them to single clusters. Clustering is a dynamic field of research and researchers continue to refine existing algorithms and develop new ones.
Clustering involves grouping similar objects together so an important consideration in clustering is how to measure object similarity. More details on proximity measures for data mining are available in this previous article. There is a wide choice of distance measures to choose from when performing clustering and the choice can greatly impact the validity of the results. How this is so can perhaps best be illustrated by an example from practice.
Below is a graph of the results of clustering a data set consisting of outputs and outcomes of further education and training courses using the Euclidean distance metric. Euclidean distance is the shortest straight line distance between two points and is often used to assess similarity of objects measured on a continuous numeric scale.
This graph shows the first two principal components of the data set plotted against the results of a Partitioning around Medoids clustering algorithm with a k value of 2. Partitioning around Medoids is essentially a more robust form of the k means clustering algorithm. The resulting clusters in this case don’t look particularly well defined or compact and a silhouette plot of the analysis confirms this.
The silhouette plot above shows how similar objects are to other objects in their own cluster versus how similar they are to objects in other clusters. The objects in cluster 1 actually look well clustered here, however objects in cluster 2 much less so. In fact many of the objects in cluster 2 have a negative silhouette coefficient indicating they are in the wrong cluster. The average silhouette width is only 0.25 which is much too low and the result of objects in cluster 2 dragging down the average width.
Another way of assessing the validity of a clustering analysis is to run a classification algorithm on the data using cluster membership as the class label. Running a decision tree using descriptive features of courses to predict cluster membership indicates that this clustering solution is far from optimal achieving an accuracy of not much above chance (around 57%).
However one important characteristic of this data set that needs to be taken into account when determining how to analyse it, is its sparsity i.e. the number of zeros in the data. Although the data was measured on a continuous numerical scale the number of zeros in this data set is very high so a similarity measure other than Euclidean is indicated. Running the same algorithm on the cosine distance matrix of the data set produces a much better clustering solution.
Results of the silhouette plot of this analysis are much more encouraging. The silhouette width is higher and nearly all objects have positive silhouette coefficient values.
Running a decision tree using these clusters as class labels results in much better accuracy (over 70%). This shows that given just the descriptive attributes of a course e.g. things like course duration, certification etc, and even before a course starts, we can predict with over 70% accuracy into which cluster its outputs and outcomes will go. Accuracy could likely be improved further by better feature engineering.
The only difference between these two analyses is that one used Euclidean distance to measure object similarity while the other used Cosine distance. Similarity measurement is just one factor to be taken into account in clustering, there are others but space precludes going into them all here. A second part of this article will demonstrate an application of clustering in r.