Data model for the code templates

Data model for training, validation and testing data

The code templates rely on data sets that follow a particular “data model,” or format. The following provides some required specifications for the data model:

Each row corresponds to the unit of analysis. For example:
- If you are predicting students’ probabilities of graduating on time, each row corresponds to a unique student.
- If you are predicting a community’s probability of being impacted by devistating flooding, each row corresponds to a community.
- If you are predicting a customer’s probability of signing up for an offer, then each row corresponds to a customer by promotion (i.e. if the customer is marketed multiple promotions, then they may have a row for each promotion). In this case, the unit is not customer, it is “customer/promotion.”
Each unit of analysis appears in only one row. There should be no repeated rows for the same unit. That is, if there are variables with repeated measures over time, the data should be arranged in “wide format” rather than “long format.” For example, imagine your unit of analysis is indiviual, but your data has multiple records for each in, each corresponding to a job - with variables for wage and length of time at the job. In this case, the data need to be rearranged so that each record has variables such as “wage1,” “wage2,” etc. and “lengthjob1,” “lengthjob2,” etc.
The columns should contain:
- A unique identification number or code (ID). This can be of classnumeric or character.
- Outcome(s) of interest. This should be a binary variable with class factor. The variable should take values of 0 or 1. There should be no missing values.
- Pre-processed potential predictors. These should be either numeric or factor variables. There should be no missing values. That is, this data set does not include raw variables with missingness but rather new versions of the variables that address missingness according to the guidance [here.](data2_missingpredictors.qmd)
- Variables to be used for assessing equity. For example, you would include variables such as race, gender, or any other variables for which you want to compare model performance or estimate measures of bias. These measures should all be categorical and have class factor. Missingness of these variables should be handled with the same approach as used for predictors.
- Variables needed for stratification. For example, if cross-validation will be stratified by a location variable, make sure that variable is included. It should be categorical and have class factor. Missingness should be handled with the same approach as used for predictors.
Across the different data sets used for training, validating and testing, the columns should be identical.

Here is a mini example of a training data file that aligns with the data model:

  ID SchoolA SchoolB GradOnTime Math.Proficient Math.AboveProficient
1  1       1       0          1               0                    0
2  2       1       0          0               1                    1
3  3       0       1          1               0                    0
4  4       0       1          0               0                    1
5  5       0       1          1               1                    0
  Math.Missing Chronic.Absent Race
1            0              0    1
2            0              0    2
3            1              0    3
4            0              0    2
5            0              1    1

Data model for meta data

Having a meta data file is also recommended. A meta data file is a machine readable codebook, which summarizes your data. A meta data file for the training/validation data is valuable for (1) documenting and describing the data being fed into predictive analytics steps and (2) in some cases, selecting variables based on information in the meta data rather than having to type them all in. A meta data file for the testing data is valuable for comparing the variable distributions to those in the training data. Here are specifications for the meta data files:

Each row corresponds to variable.
Columns may include but are not limited to the following:
- Variable type. E.g., “ID”, “outcome”, “predictor”, “protectedAttribute”, etc.
- Summary statistics. E.g., min, max, mean, median, percentiles (e.g. 5, 25, 75, 95)
- Data source. If your data results from integrating multiple data sources, you might want to include a column that indicates the source of each variable.
- Time point. If variables in your data are entered or integrated into your data at different time points, you can indicate that here. This provides a nice check to make sure variables are available for your prediction time point. If you are repeating predictive analytics at multiple time points, this will column may be essential.
- Labels. You may want to include a column that provides a short description of each variable.
- Other information. You may want to include other information that is helpful to document or that will help you specify predictor sets. For example, perhaps your data includes a set of variables that all come from three different various assessments. Perhaps you want to add variables from one of the assessments to a predictor set and then add variables from the other two assessments to a larger predictor set to see the predictive value they add. Rather than typing the assessment variables into your R notebook, you could use the meta data file to grab all the variables that correspond to each kind of assessment.

Here is a mini example of a meta data file that aligns with the data model. I have included just a few example summary statistics to keep the example short.

               varName      varType min max median mean  source timepoint
1                   ID           ID   1   5      3  0.4 studRec      <NA>
2              SchoolA    predictor   0   1      0  0.4 studRec     S1end
3              SchoolB    predictor   0   1      1  0.6 studRec     S1end
4           GradOnTime      outcome   0   1      1  0.6 studRec     S1end
5      Math.Proficient    predictor   0   1      0  0.4 studRec     S1end
6 Math.AboveProficient    predictor   0   1      0  0.4 studRec     S1end
7         Math.Missing    predictor   0   1      0  0.2 studRec     S1end
8       Chronic.Absent    predictor   0   1      0  0.2 studRec     S2end
9                 Race equityAttrib   1   3      2   NA studRec   S1start
                                             label
1                                               ID
2                                 Attends School A
3                                 Attends School B
4                                Graduated on time
5 Proficient but not above on fall state math test
6               Proficient on fall state math test
7            Missing score on fall state math test
8           Whether chronically absent across year
9            Self-reported race on enrollment form