Classification is a common machine learning task. This is where we have a data set of labelled examples with which we build a model that can then be used to (hopefully accurately!) assign a class to new unlabelled examples. There are various points at which we might want to test the performance of the model. Initially we might tune parameters or hyperparameters using cross validation, then check the best performing models on the test set. If putting the model into production we may also want to test it on live data, we might even use different evaluation measures at different stages of this process. This article discusses some frequently used measures for evaluating the performance of classification models.

There are various evaluation measures that can be used to assess classifier performance. Quite commonly used are accuracy and measures related to the confusion matrix. The best way to understand these is with a practical example. To that end I am using the credit card dataset from the UCI Machine Learning repository. This dataset contains records from 30,000 customers in Taiwan measured on some demographic variables and variables relating to previous credit card payment history. The response variable is whether the person made their next credit card payment or not. Of the 30,000, 6,636 did not pay (i.e defaulted) and 23,364 made their payment. We want to build a classifier using this dataset to predict whether a person will default on the next payment or not and then check its accuracy.

Firstly the data is split into a training and testing set in the ratio 70:30 and then I ran a decision tree algorithm (rpart in caret) on the training set. For this example, I just ran the algorithm without any preprocessing or varying the hyperparameters since the results are just being used to illustrate some evaluation measures.

library (readr) credcard <- read_csv("~/credit_card_data.csv", col_types = cols(`default payment next month` = col_character())) library (caret) library (rpart) credcard <- credcard[-1] credcard levels (credcardlevels (credcard$default payment next month) <- c("Paid", "Defaulted") inTrain <- createDataPartition(y=credcard $default payment next month, p=0.70, list=FALSE) Training <- credcard[inTrain,] Testing <- credcard[-inTrain,] set.seed(647) model1rpart <- train (`default payment next month`~.,method="rpart", data =Training) model1rpart CART 21001 samples 23 predictor 2 classes: 'Paid', 'Defaulted' No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 21001, 21001, 21001, 21001, 21001, 21001, ... Resampling results across tuning parameters: cp Accuracy Kappa 0.002367628 0.8197736 0.3703975 0.003443823 0.8199043 0.3658651 0.190271201 0.7952824 0.1519863 Accuracy was used to select the optimal model using the largest value. The final value used for the model was cp = 0.003443823.

Results show that the best model had a complexity parameter (cp) of 0.003 and accuracy of almost 82% with a kappa of 0.365. Next is to check the accuracy of this model on the held-out data, the test set and generate a confusion matrix. Results of the performance measures of the classifier are below.

pred1rpart <- predict(mod1rpart, Testing) confusionMatrix (pred1rpart, Testing$default payment next month, positive="Defaulted")

Confusion Matrix and Statistics Reference Prediction Paid Defaulted Paid 6692 1333 Defaulted 317 657 Accuracy : 0.8166 95% CI : (0.8085, 0.8246) No Information Rate : 0.7789 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.3487 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.33015 Specificity : 0.95477 Pos Pred Value : 0.67454 Neg Pred Value : 0.83389 Prevalence : 0.22114 Detection Rate : 0.07301 Detection Prevalence : 0.10823 Balanced Accuracy : 0.64246 'Positive' Class : Defaulted

Leaving the confusion matrix aside for the moment, the first thing to note is that **Accuracy **(number of correct predictions divided by total number of predictions) is 81.66% with a **95% confidence interval **of 0.8085 and 0.8246 meaning that there is a 95% likelihood that the true accuracy for this model lies within this range. This might sound like good accuracy and it might be ok if the sample were split 50/50 between defaulters and non-defaulters. In actuality there are around 3.5 times more non-defaulters in the dataset. The **no-information rate** is 0.7789. This is the accuracy achievable by always predicting the majority class label. In this case if asked to predict whether a person will default or not, by always choosing “won’t default” we can achive nearly 78% accuracy on the test set. Our 81.66% accuracy doesn’t look so good now. Nonetheless its enough for the model to offer significantly better performance over the no-information rate as indicated by the **p-value**. Statistical significance and practical significance can be two different things however, and we might want to look at other metrics to evaluate this model.

The** Kappa **statistic shows how well our classifiers predictions matched the actual class labels while controlling for the accuracy of a random classifier. Kappa for this model is 0.3487 which is relatively low. According to the guidelines proposed by Landis & Koch (1977) it represents only fair agreement between our classifier and the true class labels once random accuracy is controlled for.

To return to the confusion matrix (also often referred to as error matrix), below is a modified version of the confusion matrix from the R output above. The blue cells are those examples which were correctly classified while the red are those that were incorrectly classified. Our example is a binary classifier but it is possible to construct a confusion matrix for multiclass classfication problems also.

By convention the positive class is often in the first column of the confusion matrix, but I have manually set “default” as the positive class for this example since for a task like this it might be more important to correctly identify defaulters. The confusion matrix above shows that we correctly classified 6,692 examples as paid and incorrectly classified 317 paid as defaulted. We correctly classified 657 defaulters and incorrectly classified 1,333 defaulters as paid.

We can see how the metrics in the R code above were calculated …

**Sensitvity** – also referred to as true positive rate or recall, shows the proportion of the positive class correctly predicted. This shows the proportion of defaulters correctly predicted. For this example we can calculate it using the formula Cell D/(Cell B + Cell D). 657/(1333+657) = 0.33015 so our classifier correctly predicted around a third of those who defaulted.

**Specificity **– also referred to as true negative rate, shows the proportion of the negative class correctly predicted. This shows the proportion of those who paid that were correctly predicted. It is given by Cell A/(Cell A + Cell C). 6692/(6692+317) = 0.95477 so we correctly predicted more than 95% of those who made their next payment.

**Positive Predictive Value** – also referred to as precision, shows the number of the positive class correctly predicted as a proportion of the total positive class predictions made. Cell D/(Cell C + Cell D). 657/(657 + 317)=0.67454

**Negative Predictive Value** – shows the number of the negative class correctly predicted as a proportion of the total negative class predictions made. Cell A/(Cell A + Cell B). 6692/(6692 + 1333) = 0.83389.

**Prevalence** -shows how often the positive class actually occurs in our sample. In this example it is (Cell B + Cell D)/(Cell A + Cell B + Cell C + Cell D). (1333 + 657)/(6692 + 1333 + 317 + 657)= 0.2214

**Detection Rate **-shows the number of correct positive class predictions made as a proportion of all of the predictions made. Cell D/(Cell A + Cell B + Cell C + Cell D). 657/(6692 + 1333 + 317 + 657)= 0.070301.

**Detection Prevalence **– shows the number of positive class predictions made as a proportion of all predictions. Cell C + Cell D/(Cell A + Cell B + Cell C + Cell D)= (657 + 317)/(6692 + 1333 + 317 + 657)= 0.10823.

**Balanced Accuracy **– essentially takes the average of the true postive and true negative rates i.e (sensitvity + specificity)/2. (0.33015 + 0.95477)/2= 0.64246.

Our results are typical of a dataset where the class labels are slightly imbalanced, in that prediction accuracy of the majority class label is good, while accuracy of the minority label (which may be the one we are most interested in predicting correctly) is not so good. Our classifier correctly predicted only about a third of defaulters ( its sensitivity). About 22% of the dataset were actually defaulters (prevalence) and we predicted default in just under 11% of cases (detection prevalence) with about two thirds of these being correct (its positive predictive value).

The above is just a quick overview of some of the commonly used evaluation measures for classification models. There are others e.g. AUC, Log Loss and so on but discussion of those is for another time.