Missing data - outcomes

Outcome values could be missing for multiple reasons:

  1. The outcome has not been sufficiently defined. For example, imagine the case in which a company wants to predict whether a customer signs up for an advertised service. The binary outcome for the outcome may be equal to 1 if a customer signs up, and 0 if they decline to sign up. But what about those who haven’t indicated “yes” (1) or “no” (0) yet. The may have a missing value. It is critical here to define a window of time for observing the outcome. That is, we need to define by when the sign-up occurs (e.g. within 1 month). Then the missing values would take the value of 0, indicating that “no” the customer did not sign up for the services within 1 month of advertising it.

  2. All the same reasons predictors can be missing. If an outcome is missing due to “random” typos, technical glitches or messiness, then this is probably not a problem. The corresponding unit/observation would have to be excluded from modeling. However, if an outcome variable’s missingness is nonnegligible and due to nonrandom reasons (e.g. some individuals were hesitant to report the information for the outcome variable), then this could introduce a big problem for the analytics. If the outcome is observed for a subgroup that differs from the overall population in your data in some meaningful way, then your modeling will only apply to that subgroup. And, if you can only identify that subgroup by nonmissing outcome data, then you will be applying that model to make predictions for a broader population, introducing unreliable results when the model is deployed with data in which the outcomes are unknown.

  3. The outcome cannot be observed for a subgroup. This another version of #2 but occurs somewhat differently. Consider the following example: Across the country, release and detention decisions for defendants in the pretrial period — that is, the period after arrest while a criminal case is being adjudicated — are increasingly guided by risk assessments, which rely on data to estimate defendants’ risk of failing to appear for a court date or of being charged with new criminal activity if released pending trial. In this setting, predictive models are generally fit to only those defendants not detained while awaiting trial. Those detained, potentially because they were unable to make bail, are not included as there is no outcome related to new criminal activity or failure to appear (this is referred to as “censorship”). This censorship can cause multiple problems. First, if the people who are detained are systematically different from the people who are not detained, the final models may not generalize: the models may not accurately predict risk for those people who were detained. Additionally, if detention patterns differ by racial group, bias may be introduced by fitting models using different subsets of the racial groups. One of the approaches to address the censoring problem when validating pretrial risk assessments is to impute missing outcomes for detained or partly detained defendants. This imputation could include race and other characteristics not used for the final risk model. (As imputation is only used to build the model, these defendant characteristics will not, in the end, be used as risk factors by the final predictive model and thus the overall risk assessment.) As long as the imputation captures any differential relationships between detention and subgroups, the subsequent model-fitting process will not be as vulnerable to biases from censoring.

Back to top