ID SchoolA SchoolB GradOnTime Math.Proficient Math.AboveProficient
1 1 1 0 1 0 0
2 2 1 0 0 1 1
3 3 0 1 1 0 0
4 4 0 1 0 0 1
5 5 0 1 1 1 0
Math.Missing Chronic.Absent Race
1 0 0 1
2 0 0 2
3 1 0 3
4 0 0 2
5 0 1 1
Data model for the code templates
Data model for training, validation and testing data
The code templates rely on data sets that follow a particular “data model,” or format. The following provides some required specifications for the data model:
Each row corresponds to the unit of analysis. For example:
If you are predicting students’ probabilities of graduating on time, each row corresponds to a unique student.
If you are predicting a community’s probability of being impacted by devistating flooding, each row corresponds to a community.
If you are predicting a customer’s probability of signing up for an offer, then each row corresponds to a customer by promotion (i.e. if the customer is marketed multiple promotions, then they may have a row for each promotion). In this case, the unit is not customer, it is “customer/promotion.”
Each unit of analysis appears in only one row. There should be no repeated rows for the same unit. That is, if there are variables with repeated measures over time, the data should be arranged in “wide format” rather than “long format.” For example, imagine your unit of analysis is indiviual, but your data has multiple records for each in, each corresponding to a job - with variables for wage and length of time at the job. In this case, the data need to be rearranged so that each record has variables such as “wage1,” “wage2,” etc. and “lengthjob1,” “lengthjob2,” etc.
The columns should contain:
A unique identification number or code (ID). This can be of class
numeric
orcharacter
.Outcome(s) of interest. This should be a binary variable with class
factor
. The variable should take values of 0 or 1. There should be no missing values.Pre-processed potential predictors. These should be either
numeric
orfactor
variables. There should be no missing values. That is, this data set does not include raw variables with missingness but rather new versions of the variables that address missingness according to the guidance [here.](data2_missingpredictors.qmd)Variables to be used for assessing equity. For example, you would include variables such as race, gender, or any other variables for which you want to compare model performance or estimate measures of bias. These measures should all be categorical and have class
factor
. Missingness of these variables should be handled with the same approach as used for predictors.Variables needed for stratification. For example, if cross-validation will be stratified by a location variable, make sure that variable is included. It should be categorical and have class
factor
. Missingness should be handled with the same approach as used for predictors.
Across the different data sets used for training, validating and testing, the columns should be identical.
Here is a mini example of a training data file that aligns with the data model:
Data model for meta data
Having a meta data file is also recommended. A meta data file is a machine readable codebook, which summarizes your data. A meta data file for the training/validation data is valuable for (1) documenting and describing the data being fed into predictive analytics steps and (2) in some cases, selecting variables based on information in the meta data rather than having to type them all in. A meta data file for the testing data is valuable for comparing the variable distributions to those in the training data. Here are specifications for the meta data files:
Each row corresponds to variable.
Columns may include but are not limited to the following:
Variable type. E.g., “ID”, “outcome”, “predictor”, “protectedAttribute”, etc.
Summary statistics. E.g., min, max, mean, median, percentiles (e.g. 5, 25, 75, 95)
Data source. If your data results from integrating multiple data sources, you might want to include a column that indicates the source of each variable.
Time point. If variables in your data are entered or integrated into your data at different time points, you can indicate that here. This provides a nice check to make sure variables are available for your prediction time point. If you are repeating predictive analytics at multiple time points, this will column may be essential.
Labels. You may want to include a column that provides a short description of each variable.
Other information. You may want to include other information that is helpful to document or that will help you specify predictor sets. For example, perhaps your data includes a set of variables that all come from three different various assessments. Perhaps you want to add variables from one of the assessments to a predictor set and then add variables from the other two assessments to a larger predictor set to see the predictive value they add. Rather than typing the assessment variables into your R notebook, you could use the meta data file to grab all the variables that correspond to each kind of assessment.
Here is a mini example of a meta data file that aligns with the data model. I have included just a few example summary statistics to keep the example short.
varName varType min max median mean source timepoint
1 ID ID 1 5 3 0.4 studRec <NA>
2 SchoolA predictor 0 1 0 0.4 studRec S1end
3 SchoolB predictor 0 1 1 0.6 studRec S1end
4 GradOnTime outcome 0 1 1 0.6 studRec S1end
5 Math.Proficient predictor 0 1 0 0.4 studRec S1end
6 Math.AboveProficient predictor 0 1 0 0.4 studRec S1end
7 Math.Missing predictor 0 1 0 0.2 studRec S1end
8 Chronic.Absent predictor 0 1 0 0.2 studRec S2end
9 Race equityAttrib 1 3 2 NA studRec S1start
label
1 ID
2 Attends School A
3 Attends School B
4 Graduated on time
5 Proficient but not above on fall state math test
6 Proficient on fall state math test
7 Missing score on fall state math test
8 Whether chronically absent across year
9 Self-reported race on enrollment form