How to Evaluate a Binary Classifier?

It is a common task to evaluate a binary classifier in the context of machine learning. In this blog, we will talk about what metrics to use for evaluating a binary classifier; the reason for the need of multiple metrics; how to plot a receiver operating characteristic (ROC) curve, how a random classifier works, and the need for a precision-recall curve.

For a binary classification problem, a perfect classifier is able to detect both the positive samples and negative samples. However, a classifier tends to mistake negative samples as positive, if it is very sensitive to positive samples. That is the trade-off between identifying all the positive samples and sacrificing negative samples. Thus two metrics, true positive rate (TPR) and false positive rate (FPR), are used to measure the performance of a binary classifier. The TPR, which is equal to TP/P, quantifies the percentage of truly positive samples identified. TP denotes the number of truly positive samples among all the samples predicted to be positive; P denotes the number of positive samples among all the samples. FPR is equal to FP/N, where FP is the number of negative samples predicted to be positive and N is the number of negative samples among all the samples. The false positive rate can be thought of as the price to pay for identifying positive samples. A perfect classifier pays no price to detect all the positive samples, which is equivalent to say TPR=1 and FPR=0; In practice, the FPR increases as the TPR increases.

To measure how the FPR changes as the TPR increases, now we talk about the ROC curve. Usually a binary classifier generates a score for each sample and we can define a cut-off for detecting positive samples. As the cut-off varies, the TPR and FPR changes accordingly, So we can get multiple pairs of (FPR, TPR) and we will get a curve if we connect all the points. The curve is called a ROC curve. Then how to measure the overall behavior in terms of ROC curve? Usually the area under the ROC is the metric for this task.

It is common to see the diagonal on a ROC curve plot. The diagonal represents the performance of a random classifier. A random classifier works in this way. If we want to detect k percentage of the positive samples using a random classifier, what we can do is to randomly select k percentage of all the samples and label them as positive. Because we randomly select the k percentage, k percentage of the positive samples in all the samples are in the samples predicted to be positive. Likewise, k percentage of negative samples are labeled as positive. So the TPR is always equal to FPR for a random classifier, independent of the composition of the samples.

Although the ROC for a classifier does not depend on the the composition of the dataset, a ROC curve can be misleadingly good if most of the samples are negative in the datasets. The reason for this is that the FPR just increases slightly as a classifier label more samples as positive. In this case, we need another metric to measure the performance of a classifier, which is precision. Precision is equal to TP/PP, where TP is the same as before and PP is the number of samples predicted to be positive. Likewise as the cut-off varies, we can get a precision recall curve and the area under the curve or average precision can be used to measure the overall performance.