Proximity Measures for Data Mining

Data mining is the process of finding interesting patterns in large quantities of data. This process of knowledge discovery involves various steps, the most obvious of these being the application of algorithms to the data set to discover patterns as in, for example, clustering. The goal of clustering is to find natural groups in the data so that objects in a group are similar to other objects in the same group and dissimilar to objects in other groups. When implementing clustering (and also some classification algorithms such as k nearest neighbours), it is important to be able to quantify the proximity of objects to one another. Proximity may be measured in terms of how alike objects are to one another (object similarity) or how unlike they are (object dissimilarity). The term distance measure is often used instead of dissimilarity measure.

Similarity measures will usually take a value between 0 and 1 with values closer to 1 signifying greater similarity. Distance measures can take any value between 0 and \infty with higher values signifying greater dissimilarity or distance.

One thing that’s worth bearing in mind is that not all dissimilarity measures are distance metrics even though the terms are sometimes used erroneously as if they were interchangeable. A distance metric is a function d(x,y) that applied to two objects, x and y, produces a real number and satisfies the following conditions …
Non Negativityd(x,y)\ge 0 The result of distance function must be equal to or greater than 0.
Identity of Indiscernibles – The result of the distance function, d(x,y) is zero if and only if x=y
Distance Symmetryd(x,y) = d(y,x) The result of the distance function is symmetric
Triangle Inequality– Introducing a third object or point z cannot produce a smaller result than applying the function to two points i.e. d(x,y)\le d(x,z) + d(z,y)

    Common Proximity Measures

Euclidean Distance embodies the popular notion of distance as the shortest straight line between two points. Named after the Greek mathematician and father of geometry Euclid, it is defined as d(x,y) = \sqrt{\sum_{i=1}^n (X_i - Y_i)^2}. Euclidean distance is used with numeric data and is the default distance measure option in most statistical software packages.

Manhattan Distance is the shortest distance between two objects travelling across a grid i.e. horizontally and vertically only, so named because it is the shortest distance between city blocks. It is defined as \sum_{i=1}^n {\vert}X_i - Y_i{\vert}.

Manhattan and Euclidean are specific examples of the Minkowski Distance d(x,y) = {\sqrt[p]{\sum_{i=1}^n{\vert}X_i - Y_i{\vert}}^p}. It is apparent that when p=1 then the Minkowski distance and Manhattan distance are the same and when p=2 the Minkowski and Euclidean distances are equal. The Chebyshev Distance, \max_{i=1}^n{\vert}x_i-y_i{\vert}, corresponds to the Minkowski distance as p approaches \infty

What if the data is not numeric? One way of calculating the distance between objects described by nominal attributes is to look at the ratio of mismatches defined by d(x,y) = {p-m \over p} where m is the number of matches of attributes between objects x and y and p is the total number of attributes describing the objects. To convert this dissimilarity measure to a similarity measure then simply subtract it from 1.

If the data is binary and either outcome is equally important (i.e. symmetric binary) then Simple Matching Co-efficient (SMC) can be used to calculate similarity between objects. SMC is given by calculating the ratio of matching attributes to total attributes i.e. {M_{00} + M_{11}}\over{M_{00} + M_{01} + M_{10} + M_{11}}. Simple matching distance (SMD) is calculated as 1-SMC.

If dealing with asymmetric binary data then Jaccard similarity can be used. Jaccard similarity is defined as {\vert}X\cap Y{\vert}\over{\vert}X\cup Y {\vert}. To calculate the Jaccard distance then subtract Jaccard similarity from 1. Alternatively Jaccard distance can be written as {\vert}X\cup Y{\vert} - {\vert}X\cap Y{\vert}\over{\vert}X\cup Y {\vert}

Cosine Similarity is a measure that calculates the cosine of the angle between two vectors. This measure is typically used in sparse high dimensional data sets such as those encountered in text mining. Cosine similarity is defined as \sum_{i=1}^nX_iY_i\over{\sqrt{\sum_{i=1}^nX_i^2}\sqrt{\sum_{i=1}^nY_i^2}} and if carried out on normalised data is equivalent to calculating Pearson Correlation Coefficient on the data. Cosine similarity is used with numerical or binary data.

The above is a list of common proximity measures used in data mining. There are many others. The choice of measure used will be determined by the data set. If looking to calculate proximity measures in R then for the more common ones you can use the ‘dist’ function in the stats package. Alternatively for a more comprehensive list of of measures take a look at the proxy package.

One comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.