This is part 2 of a clustering demo in R. You can read Part 1 here which deals with assessing clustering tendency of the data and deciding on cluster number. This part looks at performing clustering using Partitioning Around Medoids algorithm and validating the results.
For this example I am using Partitioning Around Medoids algorithm, a more robust variation of kmeans. I set cluster number to 3 based on the results of the previous step in the analysis. Euclidean distance is used as the distance metric which is fine here as the attributes are all numeric and the clustering should be based on differences in measurement levels of the attributes.
pam.res3 <- pam(data, 3, metric = "euclidean", stand = FALSE)
The resulting 3 clusters, have size of 21, 6 and 13. After running the algorithm we can now plot the results.
fviz_cluster(pam.res3, palette = c("#FC4E07", "#00AFBB", "#E7B800"), ellipse.type = "euclid", star.plot = TRUE, repel = TRUE, ggtheme = theme_minimal() )
This graph shows the 3 clusters plotted against the first two principal components of the data and which cluster each object belongs to. Note the slight difference in variance explained by the first two principal components in this graph compared to the one in the previous part of the analysis. This is because the variables have been standardised prior to computing PCA in this case.
The last part of this analysis is to validate the clusters found. Some validation measures have already been computed when running NbClust. Silhouette coefficient measures how close an object is to the other objects in its own cluster versus those in neighbouring cluster. Values close to 1 indicate the object is well clustered.
The 3 cluster solution results in an average silhouette width of 0.46, Cluster 1 has the highest average width (0.51). One of the objects in cluster 2 has a negative silhouette coefficient indicating it is likely in the wrong cluster.
data_clustered3 <- cbind (data, pam.res3$clustering) aggregate(data_clustered3[,1:6], list(data_clustered3$`pam.res3$clustering`), mean) pHCerbero pHBlood HCO3Cerebro HCO3Blood CO2Cerbero CO2Blood 1 46.79524 37.57143 21.62381 22.390476 43.91429 33.73810 2 44.33333 55.88333 13.21667 9.483333 25.38333 20.28333 3 50.17692 41.37692 26.40000 29.353846 57.56154 48.26154
The table above compares the mean value of objects in each cluster. It is apparent that Cluster 3 has a higher mean value for every measurement except pH of Blood for which cluster 2 has higher measurement.
For comparison purposes we can visualise the results of clustering when k=2. In this case we get 1 large cluster with 34 objects and a smaller cluster containing 6. The larger cluster has higher mean values on all measurements except pH of blood.
pam.res2 <- pam(data, 2, metric = "euclidean", stand = FALSE) fviz_silhouette(pam.res2, palette = "jco", ggtheme = theme_classic())
There is a higher average silhouette width for the 2 cluster solution and no objects with negative silhouette coefficients so this solution may be preferable. In addition to looking at other statistical measures of cluster validation, domain knowledge is important in interpreting cluster results. For example a physician would be guided by his or her specialist knowledge in determining how many clusters were appropriate.
Another way of validating the clustering solution would be to use cluster membership as a class label for prediction using features such as symptomology or other patient characteristics. If able to predict cluster membership with high degree of accuracy then it is likely the clustering solution is providing useful summary of the data.
That’s the end of this quick introduction to clustering in R. Hopefully it was useful and inspires you to try out some cluster analyses on small datasets similar to this one. It’s the best way to start learning about the algorithms and their different parameters.