Clustering Analysis in R – Part 1

The last article provided a brief introduction to clustering. This one demonstrates how to conduct a basic clustering analysis in the statistical computing environment R (I have actually split it into 2 parts as it got rather long!). For demos like this it is easiest to use a small data set, ideally with few features relative to instances. The one used in this example is the Acidosis Patients data set available from this collection of clustering data sets. This data set has 40 instances, each corresponding to a patient and 6 features each corresponding to a measurement of blood or cerebrospinal fluid.

Acidosis is a condition that may result from metabolic or respiratory causes. Metabolic acidosis occurs when the pH of the blood falls below a certain level due to overproduction of acid or excessive loss of bicarbonate. Respiratory acidosis is caused by an excessive buildup of carbon dioxide usually from poor lung function. This data set contains 6 measurements for 40 patients. Measurements are pH of cerebrospinal fluid and blood (in nmol/L), Bicarbonate (HCO3) in cerebrospinal fluid and blood (in mmol/L) and Carbon Dioxide (CO2) pressure in cerebrospinal fluid and blood (in mmHg).

The goal of cluster analysis is to place the data into groups so that instances in one group are more similar to one another and less similar to instances in other groups. If running a cluster analysis on this data set then the algorithm tries to place the patients into groups based on similarity of their blood and cerebrospinal fluid measurements. This might be useful for example if a clinician needed to group patients based on severity of symptoms or for the purposes of treatment.

Before starting the analysis it is necessary to load the libraries and data. If you don’t already have the libraries installed then you will need to download and install them before running the code below. I’d strongly advise using  RStudio as an IDE for your R programming (you can get it here and it’s free).

library (cluster)
library (clustertend)
library (factoextra)
library (fpc)
library (clValid)
data <- read.table("acidosis.txt", header=TRUE)
head (data)

#first 6 rows of the data
  X1    X2   X3   X4   X5   X6
1 39.8 38.0 22.2 23.2 38.8 36.5
2 53.7 37.2 18.7 18.5 45.1 28.3
3 47.3 39.8 23.3 22.1 48.2 36.4
4 41.7 37.6 22.8 22.3 41.6 34.6
5 44.7 38.5 24.8 24.4 48.5 38.8
6 47.9 39.8 22.0 23.3 46.2 38.5

Above is the first 6 rows of the data set. We can see that all the variables are numeric and look to be within similar ranges (this can be confirmed by running a summary command) so it isn’t necessary to normalise the data. We will add variable names before proceeding however and take a quick look at a plot of the dataset

colnames (data) <- c("pHCerbero", "pHBlood", "HCO3Cerebro", "HCO3Blood", "CO2Cerbero", "CO2Blood")
plot (data, main="Pairwise Scatterplots of the Data")

We can see that there is quite a clear and strong linear relationship between some variables e.g. measures of CO2 in CerebroSpinal Fluid and Blood. On a larger real world dataset this would need to be looked at further as highly correlated variables can be problematic in a cluster analysis.

Much of the code for the next part of the exercise, the clustering analysis, is based on that in Kassambara’s excellent and very accessible book “A Practical Guide to Cluster Analysis in R”.  There’s lots more in the book too. You can purchase here if interested.

This quick intro to cluster analysis contains the following 4 steps:

1.Assess clustering tendency of the data
2.Estimate number of clusters
3.Run Clustering Algorithm
4.Assess Validity of Clustering Results

Clustering Tendency

Clustering algorithms are designed to find clusters in data (it’s what they do!) so you can apply a clustering algorithm to any data set. Before doing this however it’s worth assessing whether there actually are clusters in the data set. If you apply a clustering algorithm to a data set that doesn’t have natural clusters in it, the algorithm will still find clusters but the results probably won’t be very useful.

There are a few ways to do this. One is simply to plot the data and visually assess it for clusters. There are 6 features in this data-set which is too many for a standard 2-D graph so firstly the dimensionality should be reduced. If we first run principal components on the data-set, we can then plot the acidosis data against the first two principal components.

fviz_pca_ind(prcomp(data), title = "PCA - Acidosis Data", palette = "jco", 
geom = "point", ggtheme = theme_classic(), legend = "bottom")

In this case the first two principal components explain over 95% of the variance in the data-set. It looks like there are clusters in this data-set though there seems to be some outlying points also. Ignoring these it looks like there are possibly 2-3 clusters in this data set. One thing to consider here is the graph appears to indicate the presence of a non-linear relationship in the data. In a more rigorous analysis this should be investigated further since it might influence the type of clustering method chosen.

Another approach to assessing clustering tendency involves running a Hopkins test on the data. The Hopkins statistic is sometimes used as a proxy measure for clustering tendency. The Hopkins statistic measures the likelihood that the data comes from a uniform distribution.

To calculate Hopkins statistic an artificial data set is generated from a uniform random distribution with the same variance as the data set being tested. The mean distance of points in the artificial data set to their nearest neighbours in the real data set is then divided by the sum of the mean distances from points in the real and artificial data sets to nearest neighbours in the real data set. A value close to 1 (i.e. the value for the mean distance of points to neighbours in real data set is small) may indicate that the data set is highly clusterable. A value of 0.5 indicates the data set is likely to be uniformly distributed and so may not be clusterable.

hopkins <- hopkins (data, n=nrow(data)-1)
[1] 0.2054393

Thankfully R can do all the calculations for you in just one line of code! I have calculated Hopkins statistic using the Clustertend package in R.  In this case we get a value of approximately 0.21 which indicates the likely presence of clusters in the acidosis data set. If the value were close to 0.5 then we would need to consider whether the data set is actually clusterable.

Another way of assessing clustering tendency is to use Visual Assessment of Clustering Tendency (VAT). This involves computing and displaying the dissimilarity matrix of points in the data set.

fviz_dist(dist(data), show_labels = FALSE) +
  labs(title = "Acidosis data - dissimilarity matrix")

The colours in the image indicate the level of dissimilarity between observations with blue indicative of greater similarity and red indicative of greater dissimilarity. It is apparent that there are patterns in the VAT image for the acidosis data. This is clear if we compare this image to a VAT image generated from random data.

 random <- apply(data, 2, function(x){runif(length(x), min(x), (max(x)))})
 fviz_dist(dist(random), show_labels = FALSE) + labs(title = "Random Data")

The image above is produced from random data and the difference between this and that generated from the acidosis data set is readily apparent. In short it appears the acidosis data-set is clusterable. We can now move onto the second part of the analysis estimating how many clusters exist in the data-set.

Estimate number of Clusters
After determining that there are clusters in the data-set, then depending on the clustering algorithm (e.g. partitioning algorithms) we are using we may need to specify the number of clusters that the algorithm should find. There are plenty of statistical indices that can be used to estimate the number of clusters and one nice function in the NbClust package computes up to 30 different indices and presents the results in terms of a majority vote for the correct number of clusters.

 clust_indices <- NbClust(data = data, diss = NULL, distance = "euclidean", = 2, = 10, method = "kmeans")

* Among all indices:                                                
* 5 proposed 2 as the best number of clusters 
* 9 proposed 3 as the best number of clusters 
* 5 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 2 proposed 8 as the best number of clusters 
* 2 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
* According to the majority rule, the best number of clusters is  3 

In the above example the function computes indices for each cluster number from 2 to 10 using Kmeans clustering and Euclidean as the distance measure. The results show that out of the twenty-four indices calculated, nine suggested 3 as the best number of clusters. Of course this assumes that k means is the best clustering algorithm for this data-set. It is possible to specify other clustering algorithms but only one at a time. See the help documentation for more information.

                     KL      CH Hartigan     CCC   Scott      Marriot  TrCovW   TraceW
Number_clusters 10.0000  5.0000   3.0000  2.0000  4.0000 4.000000e+00       3    3.000
Value_Index     17.4594 54.6713  26.3159 15.3889 73.5798 1.470831e+15 1710455 3075.002
                Friedman  Rubin Cindex     DB Silhouette   Duda PseudoT2   Beale Ratkowsky
Number_clusters     4.00   8.00 3.0000 4.0000     2.0000 3.0000   3.0000  3.0000    3.0000
Value_Index      3694.14 -19.97 0.3089 0.6341     0.5263 1.0819  -1.8171 -0.2774    0.4688
                    Ball PtBiserial   Frey McClain   Dunn Hubert SDindex Dindex    SDbw
Number_clusters    3.000     2.0000 2.0000  2.0000 8.0000      0  4.0000      0 10.0000
Value_Index     2683.725     0.6263 1.5061  0.1848 0.3257      0  0.1852      0  0.0719

The above output shows that the indices Hartigan, TrCovW, TraceW, Cindex, Duda, PseduoT2, Beale, Ratkowsky and Ball are the ones which assessed 3 as the most suitable number of clusters. It’s beyond the scope of this article to explain how each of these are calculated but references to the relevant papers are in the help documentation linked to above. This suggests that 3 may be the most appropriate number of clusters for this data-set (at least using k-means algorithm) though 2 and 4 were also commonly suggested. That concludes the first half of the analysis. In the next part I’ll cluster the data and look at some validation statistics.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.