Data mining is the process of finding interesting patterns in large quantities of data. This process of knowledge discovery involves various steps, the most obvious of these being the application of algorithms to the data set to discover patterns as in, for example, clustering. The goal of clustering is to find natural groups in the data so that objects in a group are similar to other objects in the same group and dissimilar to objects in other groups. When implementing clustering (and also some classification algorithms such as k nearest neighbours), it is important to be able to quantify the proximity of objects to one another. Proximity may be measured in terms of how alike objects are to one another (object similarity) or how unlike they are (object dissimilarity). The term distance measure is often used instead of dissimilarity measure.
Similarity measures will usually take a value between 0 and 1 with values closer to 1 signifying greater similarity. Distance measures can take any value between 0 and \infty with higher values signifying greater dissimilarity or distance.
One thing that’s worth bearing in mind is that not all dissimilarity measures are distance metrics even though the terms are sometimes used erroneously as if they were interchangeable. A distance metric is a function d(x,y) that applied to two objects, x and y, produces a real number and satisfies the following conditions …
Non Negativity – d(x,y)\ge 0 The result of distance function must be equal to or greater than 0.
Identity of Indiscernibles – The result of the distance function, d(x,y) is zero if and only if x=y
Distance Symmetry– d(x,y) = d(y,x) The result of the distance function is symmetric
Triangle Inequality– Introducing a third object or point z cannot produce a smaller result than applying the function to two points i.e. d(x,y)\le d(x,z) + d(z,y)

Common Proximity Measures
Euclidean Distance embodies the popular notion of distance as the shortest straight line between two points. Named after the Greek mathematician and father of geometry Euclid, it is defined as d(x,y) = \sqrt{\sum_{i=1}^n (X_i  Y_i)^2}. Euclidean distance is used with numeric data and is the default distance measure option in most statistical software packages.
Manhattan Distance is the shortest distance between two objects travelling across a grid i.e. horizontally and vertically only, so named because it is the shortest distance between city blocks. It is defined as \sum_{i=1}^n {\vert}X_i  Y_i{\vert}.
Manhattan and Euclidean are specific examples of the Minkowski Distance d(x,y) = {\sqrt[p]{\sum_{i=1}^n{\vert}X_i  Y_i{\vert}}^p}. It is apparent that when p=1 then the Minkowski distance and Manhattan distance are the same and when p=2 the Minkowski and Euclidean distances are equal. The Chebyshev Distance, \max_{i=1}^n{\vert}x_iy_i{\vert}, corresponds to the Minkowski distance as p approaches \infty
What if the data is not numeric? One way of calculating the distance between objects described by nominal attributes is to look at the ratio of mismatches defined by d(x,y) = {pm \over p} where m is the number of matches of attributes between objects x and y and p is the total number of attributes describing the objects. To convert this dissimilarity measure to a similarity measure then simply subtract it from 1.
If the data is binary and either outcome is equally important (i.e. symmetric binary) then Simple Matching Coefficient (SMC) can be used to calculate similarity between objects. SMC is given by calculating the ratio of matching attributes to total attributes i.e. {M_{00} + M_{11}}\over{M_{00} + M_{01} + M_{10} + M_{11}}. Simple matching distance (SMD) is calculated as 1SMC.
If dealing with asymmetric binary data then Jaccard similarity can be used. Jaccard similarity is defined as {\vert}X\cap Y{\vert}\over{\vert}X\cup Y {\vert}. To calculate the Jaccard distance then subtract Jaccard similarity from 1. Alternatively Jaccard distance can be written as {\vert}X\cup Y{\vert}  {\vert}X\cap Y{\vert}\over{\vert}X\cup Y {\vert}
Cosine Similarity is a measure that calculates the cosine of the angle between two vectors. This measure is typically used in sparse high dimensional data sets such as those encountered in text mining. Cosine similarity is defined as \sum_{i=1}^nX_iY_i\over{\sqrt{\sum_{i=1}^nX_i^2}\sqrt{\sum_{i=1}^nY_i^2}} and if carried out on normalised data is equivalent to calculating Pearson Correlation Coefficient on the data. Cosine similarity is used with numerical or binary data.
The above is a list of common proximity measures used in data mining. There are many others. The choice of measure used will be determined by the data set. If looking to calculate proximity measures in R then for the more common ones you can use the ‘dist’ function in the stats package. Alternatively for a more comprehensive list of of measures take a look at the proxy package.
2 comments