Logistic regression
For some starting context: With multiple linear regression, we predict a continuous outcome based on multiple predictors. The model for multiple linear regression looks like this:
\(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k + \epsilon\)
where
- \(Y\) is the outcome variable,
- \(X_1\),\(X_2\), …,\(X_k\) are the predictor variables,
- \(\beta_0\) is the \(Y\)-intercept,
- \(\beta_1\), \(\beta_2\),…,\(\beta_k\) are coefficients corresponding to the predictor variables, and
- \(\epsilon\) is the error term.
The coefficients \(\beta_0\), \(\beta_1\),…,\(\beta_k\) tell us the strength of the relationship between the predictors and the outcome.1
Logistic regression can be used to predict the probability of a particular category of a binary outcome.
If we use a linear regression model to predict probabilities, we can get predicted values that are outside the [0,1] range, which doesn’t make sense for probabilities. To ensure predictions stay in the [0,1] range, logistic regression uses the logistic function:
\(P(Y=1) = \frac{e^{\beta_0 + \beta_1X_1 + ... + \beta_kX_k}}{1 + e^{\beta_0 + \beta_1X_1 + ... + \beta_kX_k}}\)
where
\(P(Y=1)\) is the probability that the outcome equals 1,
\(X_1\),\(X_2\),…,\(X_k\) are the predictor variables,
\(\beta_0\) is the \(Y\)-intercept,
\(\beta_1\), \(\beta_2\),…,\(\beta_k\) are coefficients corresponding to the predictor variables, and
\(e\) is the base of natural logarithms (approximately equal to 2.71828).
This equation can also be written as follows:
\(\log\left(\frac{P(Y=1)}{P(Y=0)}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k\)
Here, the log odds (or the logarithm of the odds) of the outcome being 1 (versus 0) is expressed as a linear combination of predictor variables. As we plug in values for the predictors, the linear combination gives us the log odds, which can then be transformed to get the actual probabilities (\(P(Y=1)\) ).2
Unlike many machine learning algorithms, when a data scientist uses logistic regression, they are specifying the specific functional form of the relationship between the predictors and the outcome. Specifically, with logistic regression, we are specifying that the log odds is a linear function of the predictors.
Advantages of Logistic Regression
Simple: Logistic regression models are easy to interpret and explain.
No tuning: There are no hyperparameters to tune.
Fast: The models take a short time to train.
Disdvantages of Logistic Regression
Not flexible: If the linear assumption is wrong, the model could be a poor fit.
Interactions must be specified: The user must manually incorporate interaction terms if they are of interest.
Implementation of Logistic Regression in R
In R, we implement logistic regression with glm()
in Base R, setting family=binomial(link="logit")
. The code templates will do this for you.