Getting specific about empirically defining bias

There are different ways to define bias, or the desired goal: fairness. Different stakeholders may put more priority on different definitions of fairness. The literature on the bias/fairness of predictive models is vast and it lacks consensus on the best definitions or measures. (It also lacks consistency in the names assigned to different concepts and measures.) It is important to focus on the questions various stakeholders would want answered to assess fairness.

Sticking with our example about pretrial risk assessment, consider the questions below. Each question illustrates a different notion or definition of fairness in this context, and each question maps to a different way of measuring bias empirically. As above, for ease of presentation these questions focus on race, but the same questions can be asked for other demographic groups for which it is important to assess bias/fairness. As will be discussed below, it is not possible to satisfy all definitions of bias simultaneously. Also, some ways of defining and measuring fairness have limitations.

For a particular risk score, is there the same chance of failure (of failure to appear, new criminal activity, or other adverse outcomes) in all racial groups during the pretrial period (or during a fixed period)?

In other words, do risk scores mean the same thing regardless of race? The statistical concept that answers this question is usually referred to as calibration or predictive parity.¹ Here, “failure” rates are computed for each risk score and subgroup of interest. Subgroup failure rates are then compared within risk scores.

Note that this question about fairness focuses on the risk scores (for example, scores that might range from 1 to 6) and not defendants’ risk categories (for example, high risk or not), and the categories are what help judges decide whether or not to detain a person. Moving from a risk score to a risk category can result in very different conclusions about whether or not a risk assessment is fair. Therefore, when gauging the bias of a risk assessment, it is important to look not only at bias in the assessment but also at bias in the recommendations for how to use the results of the assessment.

It is also important to note that if the outcome itself is measured with bias, then perceived differences between two subgroups with the same risk score could be due to that bias. For example, if Black people are more likely to be arrested for the same behavior (for example, drug possession) then any assessment based on this data is fundamentally flawed.

Among those who would not fail if released, is the percentage assigned a particular risk score the same across all racial groups? This statistical concept has been referred to as error rate balance (among other terms).

This definition, while desirable for complete fairness, introduces some challenges. To achieve this type of fairness, a risk-assessment tool would need to treat people differently on the basis of race, which may not be ethical or even constitutional. Moreover, optimizing a model for this form of fairness makes it mathematically impossible to also achieve predictive parity, the first form of fairness discussed above. (For example, see Corbett-Davies, Pierson, Feller, Goel, & Huq (2017) and Chouldechova (2017).)

Do all racial groups have the same distribution of risk scores?

It is understandable that stakeholders would want to know the answer to this question. However, it is impossible to untangle the extent to which different distributions of risk scores are due to bias, or actual differences in underlying distributions across races. Bias would arise from different arrest rates due to differential policing strategies or racism. Alternatively, different underlying distributions could occur if there are differential rates of prior criminal activity and behavior that in turn are due to socioeconomic factors correlated with race.

Do the risk scores produced by a risk assessment discern between failures and nonfailures equally well across all racial groups?

There are several measures that approach this form of fairness question, such as relative rates of predictive accuracy, false positive rates, false negative rates, F1 scores (which combine both false positive and false negative rates), and area under the curve (AUC) measures. Unfortunately, these measures also are susceptible to the tensions between (1) and (2), above, and furthermore can be hard to interpret if the true distribution of risk is different for the different groups. Some of these measures (in particular the AUC) do not require a specification of a decision threshold for categorizing low and high risk, which can avoid some biases introduced by establishing such thresholds. Not specifying thresholds, however, could conceal biases that would arise when the risk tool was used in practice, as these tools are linked to an eventual hard threshold of detention or not.

Note: the text on this page is largely taken from Porter, Redcross, & Miratrix (2020).

References

Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 153–163. Retrieved from https://arxiv.org/abs/1703.00056

Corbett-Davies, Pierson, S. E., Feller, A., Goel, S., & Huq, A. (2017). Algorithmic decision making and the cost of fairness. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Retrieved from https://arxiv.org/pdf/1701.08230.pdf

Porter, K. E., Redcross, C., & Miratrix, L. (2020). Balancing promise and caution in pretrial risk assessments.

Footnotes

The terminology for different types of bias is not consistent in the literature and other terms have been used as well.↩︎