Part 2 of this intro to k-NN demonstrates an implementation of the algorithm in r. Part 1 discussed the algorithm itself. I have chosen a data set from the UCI Machine Learning Repository to work with. I am using the Banknote Authentication data set. This data set consists of measurements of 400 x 400 pixel pictures of forged and genuine bank notes. Pictures were grey scale with a resolution of about 660 dpi. A wavelet transform tool was used to extract features from the pictures. The features used are variance, skewness and kurtosis of the wavelet transformed image and the entropy of the image. The class label is whether the bank note is genuine or not (0=no, 1=yes). k-NN should be a reasonable choice of algorithm for this data set as features are numerical and there are not too many of them in relation to the number of instances though obviously other factors (e.g. amount of noise in the data set) are important also.
The first step is to load the packages necessary for the analysis and to import the data set. Note that if you do not have these packages installed you will need to first download them from CRAN. I have used the caret package for the analysis. This is an excellent package for machine learning in r. It also has some nice preprocessing functions. Once the data is imported column names need to be added.
library(readr) library (caret) library (e1071) banknote_data <- read_csv("~/banknote_data.txt", col_names = FALSE) names(banknote_data) <- c ("varianceWTI","skewnessWTI","kurtosisWTI","entropy","genuine") str (banknote_data) any.na(banknote_data) summary(banknote_data) sum(banknote_data$genuine)
The output from the last four commands in the code above indicates that there are 1372 instances and 5 attributes, there are no NA values in the dataset, the ranges of the attributes are similar and there are 610 examples of genuine bank notes in the dataset and therefore 762 forged examples.
set.seed(300) for_Train <- createDataPartition (y=banknote_data$genuine, p=0.7, list=FALSE) training <- banknote_data[for_Train,] testing <- banknote_data[-for_Train,]
The code above creates a test set and a training set. 70% of data goes into the training set and the remainder into the testing set. Set seed is included for the purposes of reproducibility. Results may differ slightly if a different seed value is used.
training$genuine <- as.factor(training$genuine) t_control <- trainControl(method="cv", number=10) knn_fit <- train(genuine ~., data = training, method = "knn", trControl=t_control, preProcess = c("range"), tuneLength = 10)
In the t_control object above I have specified 10 fold cross validation is to be used in the training phase. This is then passed as an argument to the train function. I have also specified that the data should be normalised using the preProcess argument. If we examine this data-set it looks like normalising the data won’t make much difference in this case as the variables have similar ranges, however it could be important if the variables had very different measurement scales. If predictive accuracy is the goal then, ideally, when normalising training data, one should cache the mean and standard deviation from the training set so that these can be used to normalise the test set. Again though this depends on the data you are working with.
k-Nearest Neighbors 961 samples 4 predictor 2 classes: '0', '1' Pre-processing: re-scaling to [0, 1] (4) Resampling: Cross-Validated (10 fold) Summary of sample sizes: 865, 865, 865, 865, 865, 865, ... Resampling results across tuning parameters: k Accuracy Kappa 5 0.9989583 0.9979058 7 0.9989583 0.9979058 9 0.9937607 0.9874384 11 0.9927191 0.9853441 13 0.9927191 0.9853441 15 0.9927191 0.9853441 17 0.9916881 0.9832764 19 0.9916881 0.9832764 21 0.9916881 0.9832764 23 0.9906465 0.9811748 Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 7.
As can be seen from the k-NN fit above, accuracy is identical when k is equal to 5 or 7. Accuracy is extremely high (almost 100%) which gives some concern that overfitting may be occuring. To test this we run kNN with a k value of 7 against the test set to determine classification accuracy. I output the confusion matrix which is basically a table showing a classifiers performance on test data for which the class label is known.
test_result <- predict(knn_fit, newdata=testing) confusionMatrix(test_result, testing$genuine) Confusion Matrix and Statistics Reference Prediction 0 1 0 236 0 1 1 174 Accuracy : 0.9976 95% CI : (0.9865, 0.9999) No Information Rate : 0.5766 P-Value [Acc > NIR] : <2e-16 Kappa : 0.995 Mcnemar's Test P-Value : 1 Sensitivity : 0.9958 Specificity : 1.0000 Pos Pred Value : 1.0000 Neg Pred Value : 0.9943 Prevalence : 0.5766 Detection Rate : 0.5742 Detection Prevalence : 0.5742 Balanced Accuracy : 0.9979
The confusion matrix above indicates that classification accuracy on the test instances is very high also (99.76%). Only one instance was misclassified so in this case k-NN appears to be doing a good job of discriminating between fake and genuine bank notes. The one error was misclassification of a fake note as genuine though which is probably something we want to avoid!
This ends my brief introduction to k-NN. The code above can be modified to run the algorithm with other data sets and this is a good way of learning about the algorithm and how it is implemented in R.