Guidance for using ML

Now that we have a basic understanding of some prevalent machine learning algorithms, you are probably wondering how to decide which ones to use when. Here is an attempt to provide some guidance:¹

Start with a simple benchmark. As discussed earlier, by specifying a simple benchmark learner (a small set of predictors combined with a logistic regression model or another simple technique), we can empirically investigate whether adding more complex methods (ML algorithms) improves predictive performance.
To prioritize explainable models, Lasso may be a good choice. Lasso can handle a much larger predictor set than logistic regression, including highly correlated predictors, but the method results in a model that is similar in interpretability at logistic regression. Naive Bayes is probably easier to explain than other ML approaches. Random Forest is also easier to explain than most, but does have a lot of steps to explain (how trees are grown, then how forests are created). Boosted trees gets more complicated. SVM is probably the least transparent of the methods we have learned.
If you want to your method to be open to nonlinear functional forms, try ML algorithms that are not logistic regression or Lasso. Because Lasso is an extension of regression, it assumes linearity. All of the other ML algorithms we have reviewed on the previous pages allow for nonlinear relationships between predictors and the outcome.
If you want to allow for potential interactions, try algorithms that are built on decision trees - Random Forest and XGBoost. SVM can also capture interactions with the use of the polynomial or RBF kernel. Logistic regression, Lasso and Naive Bayes do not capture interactions in a data-driven way, but the data scientist can pass in variables (features) that are created as interactions between two or more variables, based on subject-matter knowledge.
Some modeling methods will perform better when continuous predictors are standardized or normalized. When a variable is standardized, it has mean 0 and variance 1. This is achieved by subtracting the mean and dividing by the standard deviation. When a variable is normalized, it has min=0 and max=1. This normalization is achieved by subtracting the variable’s min and dividing by (max-min). Here is guidance for when these transformations are valuable:
- Lasso can benefit from standardizing or normalizing continuous predictors to help with convergence.
- Methods built on decision trees (Random Forest and XGBoost) are not sensitive to the scale of the data so standardizing and normalizing are not required.
- SVM is sensitive to scaling. It is necessary to standardize or normalize continuous predictors.
- For categorical variables, turn all categories into dummies. This process is called sometimes called “one-hot coding.” Some implementations require this instead of keeping the variables as categorical (factor).

Note that standardizing/normalizing will never hurt model performance compared to not doing so, but it may make model interpretation more challenging.

“Ensemble” approaches can harness multiple ML algorithms together. While beyond the scope of this class and not implemented in the templates, data scientists sometimes harness the strengths of multiple ML algorithms by combining them in various ways. Optionally, you can read more about this concept, here, here and here. The main motivation for ensembling different algorithms is that we cannot know the underlying patterns of the data and therefore it is hard to know which ML algorithms is going to excel when or where within the data. As we have seen, each algorithm has different strengths, so ensembling captures all of them. However, ensembling certainly adds complexity and diminishes transparency!
Without ensembling, try a mix of ML strategies and predictor sets. As was discussed in the section on setting up a predictive analytics proof-of-concept (which would be good to now review), I recommend specifying learners that incrementally add complexity. One way to add complexity is to add predictors (larger predictor sets). The other is to match those predictor sets with increasingly complex ML algorithms.

Footnotes

A note about combining the goals of predictive analytics and variable importance: Variable importance refers to the objective of measuring the predictive value (i.e. importance) of a variable in predicting the outcome - that is, the strength of the relationship between predictors and the outcome. Variable importance involves many different considerations than predictive analytics. There are many ways to define and measure variable importance, and there are many different methods for estimating variable importance. Some methods are agnostic to the ML algorithm used for prediction. Others are directly related. We will spend a little time talking about variable importance towards the end of the course. For now, I want to point out that sometimes data scientists select a particular machine learning algorithm because it allows them to simultaneously satisfy both goals of predictive analytics and variable importance. More on this later.↩︎