When are we making predictions?

Deciding when to make predictions is not always straightforward, but is critical to developing a valid prediction model.

The key is to balance two important considerations:

First, we want to run predictive models early enough for the findings to be actionable. In our case, we want our caseworkers to be able to help clients early enough that they can address challenges.
But running predictive models later may mean we have more data measures available to include in our models and therefore potentially have more accurate models.

In our TANF program example, through conversations with the program managers and caseworkers on our project team, we learned (a) that interventions should be introduced very early, so they wanted risk scores available soon after client intake and approval for the education and training program. But we also learned (b) that there can be substantial lag in the entry of most of data collected during the intake and approval processes for TANF clients.

We therefore decided to specify our prediction timepoint as one month after the approval process for the education and training program.

This allowed time for most measures collected up through approval to be entered into the system.
This means that our models can only include measures that would feasibly be entered before our timepoint.
If the model is deployed, we would need to be careful to wait to apply the model to new data only at the correct timepoint – not earlier. Otherwise, we would have a lot of missing data, which could lead to poor predictions. We will discuss this topic more later.

It is a common mistake to not give adequate thought to when predictions are being made and to which measures are available at that time. If we ignore this issue, we might generate our risk scores (a) too early, which means our models might use measures that would not yet be available in practice, or (b) too late, which means our models might use measures that are collected very close to the outcome.

First, we discuss (a), generating risk scores too early in the process. If we are not careful about understanding not just the data, but the data collection process, then we might accidentally use data that would not realistically be available at our chosen time point. If we had not spoken in depth with the program managers, we would not have known that client intake data was not immediately available, but instead could take a few weeks to be input into the system. As discussed above, if we make this mistake, our models would look effective in our test phase, but would have poor performance in practice due to missing data.

Second, we discuss (b), generating risk scores too late in the process. This strategy could lead to excellent performance - great accuracy - but useless predictions. To illustrate this type of mistake, consider if our data includes a measure of whether a TANF client has submitted a job application at the end of a job preparation class. Including this measure in our model could improve its predictive performance. However, because this measure is collected towards the end of the education and training program, if we wait to generate risk scores after this measure is collected (setting our timepoint at the end of the education and training program), there is little time left to take action to help those at highest risk of not finding and sustaining employment.

Think about this issue when you read or hear about the high accuracy of predictive models. If a company boasts that it predicts an outcome with very high accuracy, it is important to ask:

When are predictions being made?
Is it possible that measures are collected so close to the outcome that some predictors are actually proxies for the outcome?
What trade-offs are being made in terms of predictive accuracy versus having time to act on predictive analytics results?