Overview

In this section, we focus on modeling methods. We begin with a short discussion about how logistic regression can be used to predict binary outcomes. Then, we take a quick tour through some prevalent machine learning (ML) algorithms.

Defining machine learning

Machine learning refers to models that use iterative algorithms — that is, a series of steps that continuously adjust for better predictive performance — rather than relying on functional forms (specifications of the independent variables and the types of relationships between the variables and the outcome) specified by an analyst.

Note that machine learning is divided into two primary categories. These categories are distinguished by the type of data they handle and their objectives.

Supervised Learning: Here, algorithms are trained using labeled data. Labeled data means every unit’s set of data points includes a corresponding outcome. The primary objective is to train a model that will then make informed predictions on previously unseen data. This is our task in this class. With supervised learning, there are two different tasks:
- Classification: This approach is employed when the outcome is categorical. The prediction might be a ‘yes’ or ‘no’, ‘0’ or ‘1’. Though we will focus on binary classification, classification can also include multi-category outcomes, such as predicting what product a user will buy next or what movie a user might like.
- Regression: This approach is used for continuous outcomes (not the focus of this course). However, for binary outcomes, which is our primary concern, regression can estimate the probability of one outcome over the other (e.g., the probability of the outcome being ‘1’).
Unsupervised learning: Here, algorithms work with unlabeled data (data without an output or outcome), aiming to find hidden structures or relationships within the data. A classic application is clustering, where data points are grouped based on inherent similarities, like customer segmentation. Unsupervised techniques can complement supervised ones, for instance, in dimensionality reduction by clustering raw predictors. However, topics related to unsupervised learning are beyond the scope of this course.

To understand the difference between supervised and unsupervised learning, consider an example of classifying images that contain either cats or dogs. In supervised learning, the algorithm is given the series of images and whether the image contains a cat or a dog. A model is trained to predict whether future images contain a cat or dog. In unsupervised learning, the algorithm is given the series of images, but is given no information about what is contained in the picture. The algorithm clusters the images in two groups, with the goal of producing one cluster of cats and one cluster of dogs, but the two groups are not characterized or labelled.

Understanding machine learning algorithms

As you will read in the pages that follow, machine learning algorithms can vary substantially in their underlying approaches. At the same time, some have very similar underlying approaches but vary in their details. Machine learning algorithms also vary in how easy they are to understand and explain - both in terms of their empirical approach and in terms of the models they produce. This lack of transparency is why many machine learning algorithms are described as “black box” modeling methods.

The pages that follow give a concise overviews of select ML algorithms designed for predictive tasks (supervised learning). I don’t go into a lot of details, but instead try to provide the basic intuitions for how they work. If certain aspects seem complex, I’d advise concentrating on the listed advantages and disadvantages for each method.

Note that the code templates implement all of the machine learning algorithms for you. This means you won’t need to grapple with specific R packages or functions that have been developed for specific algorithms. However, at the end of each algorithm’s summary, I provide a corresponding R package and function, in case you are curious about diving deeper. All of the machine learning algorithms I mention can also be implemented with tidymodels, and there are many tutorials available online.