Extracting predictive value

Here we discuss data preparation steps focused on extracting helpful information from data in order to enhance the performance of our predictive models. This set of steps is often referred to as “feature engineering” (as predictors are often referred to “features.”) The data preparation process can be guided by EDA in some cases, but it also involves domain knowledge, intuition, and experimentation to represent data in a manner that makes it more suitable or informative - often leading to improved model accuracy and generalization. After addressing missingness according to the guidance on the previous pages, data preparation, or feature engineering, involves the following:

Variable Creation: This involves deriving new variables from existing ones.

This process is inherently creative and often benefits from domain expertise, both in terms of understanding the dataset’s content and discerning factors that might influence the outcome in question.
- Single Variable Extraction: Sometimes, rich information can be gleaned from just one raw variable. Take, for instance, the “Titanic” dataset on Kaggle. Here, a key variable is “name”, which frequently contains titles such as Mr., Mrs., Miss., Master, Dr., Lady, Sir, Countess, and so on. These titles can hint at a passenger’s gender, marital status, social class, and occasionally, their profession. With programming languages like R, string manipulation techniques can be employed to break down the Name column and distill these titles. Subsequently, these titles can be categorized in various ways to encapsulate essential details.
- Aggregating Across Variables: At times, it’s useful to amalgamate information from multiple raw variables. Consider a dataset detailing daily student attendance over an academic year. From this raw data, one could derive: (a) an overall attendance rate; (b) a binary indicator denoting if a student surpasses a set number of absences, labeling them as “chronically absent”; (c) an assessment of whether a student’s attendance improved or deteriorated over the course of the year, and so on.
Variable Transformations: Such as taking the logarithm of a variable, standardizing/normalizing variables or binning a continuous variable.
- If a predictor is highly skewed, it may benefit model performance to do a transformation to make the distribution more symmetric. Some algorithms including linear regression, support vector machines, and neural networks use polynomial calculations that can benefit from non-skewed data. Taking the log of a skewed variable or using a Box-Cox transformation can help reduce skew. Other transformations (e.g. smoothing for time-series data or basis expansion and splines for variables with nonlinear patterns) are also used but are beyond the scope of this class. For an optional deeper dive, you can check out Chapter 6 of Kuhn & Johnson (2020).
- For some machine learning algorithms, it will be desirable to “mean- center” continuous variables. This involves subtracting the mean of that variable from each observation. This process results in the variable having a new mean of zero. Sometimes, we also “standardize” the variables (compute a Z-score), which means that after we mean-center, we divide by the standard deviation.¹ When and why to do this gets pretty complicated. I will try to give you guidance about which machine learning algorithms need which kind of transformations so that you can align predictor sets and modeling approaches accordingly.
- As discussed when handling missing data, another common type of predictor transformation is to discretize a numeric predictor into a series of categorical predictors. Common reasons for this transformation are to simplify interpretation, to allow for more flexible relationships between the predictor and outcome, or to address missingness. But there are cautions for this. It may be better to avoid discretization when the cut points are arbitrary. Try to rely on meaningful cut-points (e.g. scores that indicate proficiency on a math test). Often, numeric predictors in their continuous form are more predictive. The good news is that you do not necessarily have to choose one approach. You can have both a continuous and a discretized version of a predictor and put different versions in different predictor sets for your learners.
Interactions: Creating variables that interact two or more other variables.

Recall that interactions occur when combined predictors have a different relationship with the outcome than when each is considered separately. We will learn later that some machine learning algorithms are helping in finding interactions. However, sometimes the project team has subject-matter knowledge about potential interactions. They may want to specify these interactions in one or more learners, especially if they are using a regression model. One way to do this is to create interaction variables by multiplying them together or combining in a way that makes sense to the team.
Dimensionality Reduction: Techniques to reduce the number of predictors/independent variables/features.

Finally, multiple predictors can be engineered to a small number of predictors. There are multiple approahces for this. It is helpful to be aware of these approaches, but they are beyond the scope of this class. If you are curious to learn more, these blog posts are helpful optional reading:

A note about data leakage…

As mentioned earlier, data leakage refers to the (typically accidental) practice of “peeking” at validation or testing data when focused on modeling in an earlier stage. Data leakage can occur in data preparation and feature engineering when data transformations are carried out with information from combined data rather than data specific to the training set. For example, when mean-centering continuous variables, if we calculate the mean based on the full data set, information from the validation and test sets is informing our training process. This can lead to potential overfitting, although the effects are negligible if training, validation and test data are from similar distributions (Yang & Kastner, 2022). To avoid data leakage, it is considered good practice to save data preprocessing parameters, such as the mean and standard deviations derived from training data set, to apply to future incoming unseen data.

For an optional dive further into the topic of data leakage, see Data Leakage in Machine Learning: How It Can Be Detected and Minimize the Risk Or see Chapter 5 in Designing Machine Learning Systems (Huyen (2022)).

References

Huyen, C. (2022). Designing machine learning systems. Sebastopol, California: O’Reily.

Kuhn, M., & Johnson, K. (2020). Feature engineering and selection: A practical approach for predictive models. Boca Raton: Chapman; Hall, CRC Press.

Yang, R. A. B.-S., C., & Kastner, C. (2022). Data leakage in notebooks: Static detection and better processes. arXiv Https://Arxiv.org/Abs/2209.03345.

Footnotes

Mean centering supports numeric stability for machine learning algorithms that involve matrix inversion or optimization by scaling down the numbers. Algorithms that rely on gradient descent can converge faster when data is centered and normalized because it shapes the cost function into a more bowl-like form, making it easier to reach the global minimum. Some algorithms assume that the data is centered around zero. For instance, LASSO and Ridge regression add penalty terms based on the magnitude of the coefficients. If the predictors are not centered, the penalty might be applied in a biased manner because it would be influenced by the scale and mean of the predictors. Also, many machine learning algorithms, such as Support Vector Machines, k-means clustering, and deep learning networks, work better or converge faster when data variables are standardized to z-scores. Mean centering and standardizing might not always be necessary or beneficial. It largely depends on the context, the nature of the data, and the specific algorithm being employed. For example, in tree-based algorithms like Decision Trees or Random Forests, mean centering usually doesn’t offer any advantage because these algorithms are scale-invariant.↩︎