Selecting a threshold
Selecting a threshold that translates predicted probabilities into predicted classifications can be a difficult choice. The decision requires consideration of how results will be used in decision-making, as well as information about the distribution of the results. The following discussion provides some different potential approaches.
1. Specify a fixed value for the threshold.
Specifying a value, such as 0.5, to distinguish between the classes is straightforward and may be most useful in some contexts. For example, in a marketing campaign where the target audience is balanced between interested and not interested groups, a 0.5 threshold could be used to classify customers who are likely to respond to the campaign. Note that it is important to examine the distribution of the predicted probabilities before deciding on a threshold. If the predicted probabilities are clustered primarily below 0.5, for instance, then a threshold of 0.5 will result in few units being classified as positive (‘1’ or ‘yes’ on the binary outcome). Context also matters of course. For example, when classifying cancer, even if a patient has a 0.3 probability of having cancer, you would classify them to be 1.
2. Match the predicted proportion to the observed proportion.
Here, the threshold is set such that the proportion of predicted positive cases is equal to the observed proportion of positives in the dataset. It involves sorting predicted probabilities and selecting a threshold that aligns the predicted and observed ratios. This approach can be more adaptive than a fixed threshold, especially for imbalanced datasets. However, it doesn’t directly consider the costs associated with different types of errors. This approach is often used when maintaining the natural class distribution in predictions is a priority. For example, in credit scoring, if 20% of the applicants historically default, this approach would set a threshold to classify the riskiest 20% of new applicants as “high risk” for default.
3. Set the threshold equal to the the median of predicted probabilities.
This approach assumes that the distribution of probabilities is symmetric and aids in ensuring a balance in classification. While helping in certain scenarios to balance classifications, it’s not suitable for highly skewed or multi-modal distributions of predicted probabilities. For example, imagine a service wants to predict which current subscribers are less likely to renew their subscriptions so they can decide who to offer special renewal incentives. Given a symmetric distribution of predicted probabilities for renewal, the median threshold can be applied so that the incentive program is neither too broad (which could be costly and less effective) nor too narrow (missing out on potential non-renewals).
4. Find a threshold that targets resource constraints.
It may be helpful to find a threshold that takes resource constraints into account. For instance, if resources are available to intervene with only 100 individuals, you can find the threshold to identify the top 100 most at-risk individuals. This approach makes the model operationally relevant.
5. Target a desired positive rate.
In cases like rare disease detection, a high true positive rate is critical, even if it increases false positives. For example, in medical diagnostics for a rare disease, setting a threshold to ensure at least 95% of actual cases are identified, even if it results in more false positives, to ensure maximum detection of the disease. Of course, the trade-off of finding the threshold that ensures a high detection rate of positive cases can also lead to increased false alarms.
Code chunks in 02_Learning_Training_and_Validation.Rmd
compute many (but not all) of these threshold targets. We will go over this in class.