Clustering Analysis in R – part 2

This is part 2 of a clustering demo in R. You can read Part 1 here which deals with assessing clustering tendency of the data and deciding on cluster number. This part looks at performing clustering using Partitioning Around Medoids algorithm and validating the results.

Perform Clustering
For this example I am using Partitioning Around Medoids algorithm, a more robust variation of kmeans. I set cluster number to 3 based on the results of the previous step in the analysis. Euclidean distance is used as the distance metric which is fine here as the attributes are all numeric and the clustering should be based on differences in measurement levels of the attributes.

pam.res3 <- pam(data, 3,  metric = "euclidean", stand = FALSE)

The resulting 3 clusters, have size of 21, 6 and 13. After running the algorithm we can now plot the results.

fviz_cluster(pam.res3, palette = c("#FC4E07", "#00AFBB", "#E7B800"), ellipse.type = "euclid", 
star.plot = TRUE, 
repel = TRUE, 
ggtheme = theme_minimal() )

cluster plot
This graph shows the 3 clusters plotted against the first two principal components of the data and which cluster each object belongs to. Note the slight difference in variance explained by the first two principal components in this graph compared to the one in the previous part of the analysis. This is because the variables have been standardised prior to computing PCA in this case.

Cluster Validation
The last part of this analysis is to validate the clusters found. Some validation measures have already been computed when running NbClust. Silhouette coefficient measures how close an object is to the other objects in its own cluster versus those in neighbouring cluster. Values close to 1 indicate the object is well clustered.

silhouette plot
The 3 cluster solution results in an average silhouette width of 0.46, Cluster 1 has the highest average width (0.51). One of the objects in cluster 2 has a negative silhouette coefficient indicating it is likely in the wrong cluster.

data_clustered3 <- cbind (data, pam.res3$clustering)
aggregate(data_clustered3[,1:6], list(data_clustered3$`pam.res3$clustering`), mean)

     pHCerbero   pHBlood    HCO3Cerebro  HCO3Blood   CO2Cerbero  CO2Blood
1	46.79524	37.57143	21.62381	22.390476	43.91429	33.73810
2	44.33333	55.88333	13.21667	9.483333	25.38333	20.28333
3	50.17692	41.37692	26.40000	29.353846	57.56154	48.26154

The table above compares the mean value of objects in each cluster. It is apparent that Cluster 3 has a higher mean value for every measurement except pH of Blood for which cluster 2 has higher measurement.

For comparison purposes we can visualise the results of clustering when k=2. In this case we get 1 large cluster with 34 objects and a smaller cluster containing 6. The larger cluster has higher mean values on all measurements except pH of blood.

pam.res2 <- pam(data, 2,  metric = "euclidean", stand = FALSE)
fviz_silhouette(pam.res2, palette = "jco", ggtheme = theme_classic())

Silhouette Cluster 2
There is a higher average silhouette width for the 2 cluster solution and no objects with negative silhouette coefficients so this solution may be preferable. In addition to looking at other statistical measures of cluster validation, domain knowledge is important in interpreting cluster results. For example a physician would be guided by his or her specialist knowledge in determining how many clusters were appropriate.

Another way of validating the clustering solution would be to use cluster membership as a class label for prediction using features such as symptomology or other patient characteristics. If able to predict cluster membership with high degree of accuracy then it is likely the clustering solution is providing useful summary of the data.

That’s the end of this quick introduction to clustering in R. Hopefully it was useful and inspires you to try out some cluster analyses on small datasets similar to this one. It’s the best way to start learning about the algorithms and their different parameters.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.