Naive Bayes

Naive Bayes is based on Bayes theorem from probability theory. It’s called “naive” because it makes a naive assumption that each predictor in your data is independent of the others, given the outcome. While this assumption is typically false, in practice, it still allows the algorithm to work fairly well.

The Naive Bayes algorithm tries to determine the likelihood of the predictors given both classes of the binary outcome (yes and no). It then uses these likelihoods and some prior knowledge about the general frequency of the outcome (an initial estimate based on the data, before considering the predictors) to make a prediction.

So first, the algorithm checks the frequency of your predictors given the outcome. For instance, when predicting on-time high school graduation, it may ask: How many students were chronically absent among students who graduated on time? And: How many students were chronically absent among students who did not graduate on time?¹

For continuous predictors (e.g. attendace rate), one approach is to “bin” or discretize the variable’s values into categories, but this approach can be limiting. Therefore, the Naive Bayes algorithm relies on “kernels,” which provide a way to estimate likelihoods without discretizing the data.²

The next step is based on Bayes Theorem, which is:

\[\begin{equation} P(Y=y|X=x) = \frac{P(Y=y)P(X=x|Y=y)}{P(X=x)} \end{equation}\]

Therefore, Naive Bayes algorithm combines the prior with the likelihoods to produce a “posterior” probability for both outcomes: the probability of yes and no given the predictors in your predictor set. The outcome with the higher posterior probability is the algorithm’s prediction if classification is the goal. Otherwise, the computed posterior probability can be directly used as the predicted probability of the outcome.

Advantages of Naive Bayes

Simplicity and speed: Naive Bayes is computationally efficient and easy to implement. For large datasets, especially those with a high number of predictors, Naive Bayes can be faster than more complex methods.
Text classification: Naive Bayes is famously effective for certain text classification tasks, like spam detection, sentiment analysis, and topic assignment. This effectiveness is in part because the “naive” assumption that all features (words) are independent given the class often works surprisingly well in this context.
Multiclass problems: Naive Bayes naturally extends to multiclass classification problems (e.g. predicting an outcome with more than two classes) without needing any special modifications.

Advantages of Naive Bayes

Naive independence assumption: The main “naive” assumption of this method is that predictors are conditionally independent given the class. This assumption is rarely true in real-world applications.
Can struggle with continuous predictors: Naive Bayes requires modifications to handle continuous predictors (e.g., using Gaussian distributions or kernel density estimation). Even with these modifications, it might not perform as well as regression models or tree-based algorithms on some continuous datasets.
Does not capture interactions: Because Naive Bayes treats each feature independently, it doesn’t capture interactions between predictors. Algorithms that rely on decision trees can often better handle data where interactions are critical.
Estimation of probabilities: While Naive Bayes can give calibrated probabilities for well-represented classes and predictors, the probability estimates can be unreliable for underrepresented classes, even with smoothing.

Implementing of Naive Bayes in R

The Naive Bayes algorithm can be implemented using the naiveBayes() function in the e1071 package.

Footnotes

Note than when certain categories have zero counts, a Laplace correction is used. It involves adding 1 to both the numerator and the denominator to avoid ending up with unrealistic zero conditional probabilities.↩︎
Essentially, a kernel is a function that “smears” or “spreads out” the observed data points over the continuous space, allowing for a smooth estimate of the likelihood of observing a particular value. That is, instead of relying on a histogram-style blocky representation of a predictor’s distribution, the data is represented using a smooth curve. Commonly used kernels are the Gaussian kernel, Epanechnikov kernel, and the Rectangular kernel, among others.↩︎