Identifying data
We want to be thoughtful about the data that we will use for our proof-of-concept and that we will split for training, validating and testing. Here are some important considerations for identifying data:
The data should be representative of the population and context for the predictive analytics objective. Here we want to make sure we have data for units that are experiencing a similar context as what we expect for the units for whom we will ultimately deploy predictive analytics.
Recall our earlier example that centered on predicting TANF training program participants’ risks of not finding and sustaining employment. When determining which years of past data to incorporate into our learner workflow, it is essential to examine if and when the eligibility criteria for the training program underwent significant alterations. It may be prudent to restrict our data to a timeframe in which the eligibility criteria for the training program closely resemble those currently applicable to the TANF training program.
In our other example, in which the focus is on predicting the risk of 10th grade students not graduating on time, it is important to investigate the graduation guidelines. If the student cohorts used to train our models had more stringent or more lenient graduation standards compared to the cohorts to which we apply the models, the models might not be widely applicable in years beyond our training-validation-testing workflow.
Important measures should be consistently available. Here we want to assess whether we have sufficient consistency in the way measures are defined and entered into data systems. For example:
Have the data integrations been maintained consistently over time? Specifically, if the data we intend to use is a combination of multiple underlying data sources, have these merges been conducted consistently? If not, crucial measures may occasionally be absent. Additionally, are these integrations/merges sustainable in the future? If certain data cannot be reliably obtained and integrated on an ongoing basis, they should be omitted from the predictive analytics proof-of-concept. It is imperative to consider not only historical data but also prospective scenarios. If particular data will be unavailable when implementing predictive analytics, then a proof-of-concept relying on that data will not be authentic.
Was there a rollout of a new data system that had early implementation challenges? If so, we may want to eliminate data from those early years in our training-validation-testing workflow.
Is survey data part of our data universe for predictive analytics? If so, do response rates vary substantially over time? Variation in response rates could (but not necessarily) mean that different populations are being captured with each survey. We would need to investigate this.
Were there broad changes in how measures where defined or coded? We may want to limit the data we will use to train, validate and test models to a period in which the measures are consistent to the present (time close to when any model would be deployed).
Note that inconsistencies in some measures could have implications in measures we include in our prediction sets rather than implications for how we restrict our data for our proof-of-concept. That is, if some measures change over time, we may need to exclude just those measures from our models. If there is sufficient consistency across key predictors in data we may not need to restrict the data. We will return to measure consistency later, when we discuss data quality issues.