One of the oldest and most popular classification algorithms is nearest neighbors algorithm. It’s also one of the easiest algorithms to understand so is a good place to start when learning about data mining algorithms. Part one of this article provides a brief introduction to, and overview of, k-NN. Part two will demonstrate an implementation of it in r.
Essentially the nearest neighbors algorithm is based on the premise that the more features objects have in common the more likely they are to belong to the same class. Nearest neighbors is a non-parametric method so it is not reliant on assumptions about the underlying distribution of the data set. It is called a lazy learning method because unlike most classification algorithms it does not attempt to model the data set. Instead test cases are compared to other cases in the data set to determine their class.
Given a test instance (x) its class is calculated as follows …
1. Compute the distance of each instance in the training set from x.
2. Rank the instances in order of increasing distance from x.
3. Select the first k instances from the ranked set.
4. Assign the majority class label in these instances to x. If there is a tie, then assign the class label at random.
The diagram above shows a simplified example with the different coloured shapes representing different class labels and the question mark denoting the test instance. In this example k is 3 and the k-NN are circled on the graph. The test instance in this example would get the class label designated by the blue triangle.
In comparison with other algorithms there are few parameters that can be tweaked when deploying this algorithm, just the value for k and the distance metric. The optimal value for k may be chosen by using cross validation to estimate the misclassification rate for different values of k and choosing the k value that minimises this rate. With regard to distance metrics I will probably do a whole other post on that but suffice it to say that Euclidean distance is normally used unless there is a compelling reason not to. One other thing to bear in mind is that before using a metric like Euclidean distance attributes may need to be scaled. This is particularly true if the data set contains attributes with very different ranges.
Points to Consider
Computational Complexity – k-NN in its basic form defers all computation until presented with a test instance. It then runs through the whole data set to rank the distances between labelled instances and the test. With large data sets this can become computationally expensive so variants of k-NN have been proposed to tackle this. They usually rely on some pre-processing of the data-set to build data structures that aid subsequent processing and reduce the computational burden. Examples of this approach are k-d trees and ball trees.
The Curse of Dimensionality – This is also something I will probably return to in more detail in another post. It is a problem relevant to other classification algorithms also and not just k-NN. It refers to the tendency for classifier performance to degrade severely when the number of dimensions in the data set is large. I won’t delve into the mathematics here but basically using too many features to train a classifier results in overfitting. In the case of k-NN, having many dimensions means that the distance between data points converge. k-NN is based on the idea that objects sharing a class label will be located close together in feature space but nearest neighbors can start to lose its meaning in high dimensional space when all points are located far away from each other. Using a statistical technique like Principal Components Analysis can reduce the number of features that need to be considered in the analysis and may ameliorate the effects of the The Curse of Dimensionality. There are other options too but I might return to those again at another time.
Susceptibility to Noise – With k-NN the class label is determined only by the classes of the nearest neighbours which means this algorithm is quite susceptible to noise in the data set. There are pre processing techniques that can be used to clean up the data set and in fact this is where much of the time is spent on data science projects. Alternatively, or perhaps additionally, there are new hybrid variants of k-NN which can handle noise better than the basic algorithm. See for example this recent paper by Yu et al.
Thus completes my quick overview of kNN. As always comments and discussion are welcome. I will do a Part 2 to this article showing how to implement kNN in r.