Introduction to Naive Bayes Classification Algorithm in Python and R

Rashmi Jain

Author

10 mins read

February 2, 2017

Naive Bayes is a machine learning algorithm for classification problems. It is based on Bayes’ probability theorem. It is primarily used for text classification which involves high dimensional training data sets. A few examples are spam filtration, sentimental analysis, and classifying news articles.

It is not only known for its simplicity, but also for its effectiveness. It is fast to build models and make predictions with Naive Bayes algorithm. Naive Bayes is the first algorithm that should be considered for solving text classification problem. Hence, you should learn this algorithm thoroughly.

Basics of Naive Bayes
The mathematics of the Naive Bayes
Variations of Naive Bayes
Advantages and Disadvantages
Python and R implementation
Applications of Naive Bayes

What is Naive Bayes algorithm?

Naive Bayes algorithm is the algorithm that learns the probability of an object with certain features belonging to a particular group/class. In short, it is a probabilistic classifier. You must be wondering why is it called so?

The Naive Bayes algorithm is called “naive” because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features.

For instance, if you are trying to identify a fruit based on its color, shape, and taste, then an orange colored, spherical, and tangy fruit would most likely be an orange. Even if these features depend on each other or on the presence of the other features, all of these properties individually contribute to the probability that this fruit is an orange and that is why it is known as “naive.”

As for the “Bayes” part, it refers to the statistician and philosopher, Thomas Bayes and the theorem named after him, Bayes’ theorem, which is the base for Naive Bayes Algorithm.

The Mathematics of the Naive Bayes Algorithm

As already said, the basis of Naive Bayes algorithm is Bayes’ theorem or alternatively known as Bayes’ rule or Bayes’ law. It gives us a method to calculate the conditional probability, i.e., the probability of an event based on previous knowledge available on the events. More formally, Bayes’ Theorem is stated as the following equation:

\(\displaystyle P(A|B)=\frac{P(B|A)P(A)}{P(B)}\)

Let us understand the statement first and then we will look at the proof of the statement. The components of the above statement are:

\(P(A|B)\): Probability (conditional probability) of occurrence of event \(A\) given the event \(B\) is true
\(P(A)\) and \(P(B)\): Probabilities of the occurrence of event \(A\) and \(B\) respectively
\(P(B|A)\): Probability of the occurrence of event \(B\) given the event \(A\) is true

The terminology in the Bayesian method of probability (more commonly used) is as follows:

\(A\) is called the proposition and \(B\) is called the evidence.
\(P(A)\) is called the prior probability of proposition and \(P(B)\) is called the prior probability of evidence.
\(P(A|B)\) is called the posterior.
\(P(B|A)\) is the likelihood.

This sums the Bayes’ theorem as

\(\mbox{Posterior}=\frac{\mbox{(Likelihood)}.\mbox{(Proposition prior probability)}}{\mbox{Evidence prior probability}}\)

Let us take an example to better understand Bayes’ theorem.

Suppose you have to draw a single card from a standard deck of 52 cards. Now the probability that the card is a Queen is \(P\left(\mbox{Queen}\right)=\frac{4}{52}=\frac{1}{13}\). If you are given evidence that the card that you have picked is a face card, the posterior probability \(P(\mbox{Queen}|\mbox{Face})\) can be calculated using Bayes’ Theorem as follows:

\(\displaystyle P(\mbox{Queen}|\mbox{Face})=\frac{P(\mbox{Face}|\mbox{Queen})}{P(\mbox{Face})}.P(\mbox{Queen})\)

Now \(P(\mbox{Face}|\mbox{Queen})=1\) because given the card is Queen, it is definitely a face card. We have already calculated \(P\left(\mbox{Queen}\right)\). The only value left to calculate is \(P\left(\mbox{Face}\right)\), which is equal to \(\frac{3}{13}\) as there are three face cards for every suit in a deck. Therefore,

\(\displaystyle P(\mbox{Queen}|\mbox{Face})=\frac{1}{13}.\frac{13}{3}=\frac{1}{3}\)

Derivation of Bayes’ Theorem

For a joint probability distribution of two events \(A\) and \(B\), \(P(A\cap B)\), the conditional probability,

\(\displaystyle P(A|B)=\frac{P(A\cap B)}{P(B)}\)

Similarly,

\(\displaystyle P(B|A)=\frac{P(B\cap A)}{P(A)}\)

Therefore,

\(\displaystyle P(B|A).P(A)=P(A|B).P(B)\implies P(A|B)=\frac{P(B|A).P(A)}{P(B)}\)

Bayes’ Theorem for Naive Bayes Algorithm

In a machine learning classification problem, there are multiple features and classes, say, \(C_1, C_2, \ldots, C_k \). The main aim in the Naive Bayes algorithm is to calculate the conditional probability of an object with a feature vector \(x_1, x_2,\ldots, x_n\) belongs to a particular class \(C_i\),

\(\displaystyle P(C_i|x_1, x_2,\ldots, x_n)=\frac{P(x_1, x_2,\ldots, x_n|C_i).P(C_i)}{P(x_1, x_2,\ldots, x_n)}\) for \(1\leq i\leq k\)

Now, the numerator of the fraction on right-hand side of the equation above is \(\displaystyle P(x_1, x_2,\ldots, x_n|C_i).P(C_i)=P(x_1, x_2,\ldots, x_n, C_i)\)

The conditional probability term, \(P(x_j|x_{j+1},\ldots, x_n, C_i)\) becomes \(P(x_j|C_i)\) because of the assumption that features are independent.

From the calculation above and the independence assumption, the Bayes theorem boils down to the following easy expression:

\(\displaystyle P(C_i|x_1, x_2,\ldots, x_n)=\left(\prod_{j=1}^{j=n}P(x_j|C_i)\right).\frac{P(C_i)}{P(x_1, x_2,\ldots, x_n)}\) for \(1\leq i\leq k\)

The expression \(P(x_1, x_2,\ldots, x_n)\) is constant for all the classes, we can simply say that

\(\displaystyle P(C_i|x_1, x_2,\ldots, x_n)\propto\left(\prod_{j=1}^{j=n}P(x_j|C_i)\right).P(C_i)\) for \(1\leq i\leq k\)

How does the Naive Bayes Algorithm work?

So far, we learned what the Naive Bayes algorithm is, how the Bayes theorem is related to it, and what the expression of the Bayes’ theorem for this algorithm is. Let us take a simple example to understand the functionality of the algorithm. Suppose, we have a training data set of 1200 fruits. The features in the data set are these: is the fruit yellow or not, is the fruit long or not, and is the fruit sweet or not. There are three different classes: mango, banana, and others.

Step 1: Create a frequency table for all the features against the different classes.

Name	Yellow	Sweet	Long	Total
Mango	350	450	0	650
Banana	400	300	350	400
Others	50	100	50	150
Total	800	850	400	1200

What can we conclude from the above table?

Out of 1200 fruits, 650 are mangoes, 400 are bananas, and 150 are others.
350 of the total 650 mangoes are yellow and the rest are not and so on.
800 fruits are yellow, 850 are sweet and 400 are long from a total of 1200 fruits.

Let’s say you are given with a fruit which is yellow, sweet, and long and you have to check the class to which it belongs.

Step 2: Draw the likelihood table for the features against the classes.

Name	Yellow	Sweet	Long	Total
Mango	350/800=P(Mango\|Yellow)	450/850	0/400	650/1200=P(Mango)
Banana	400/800	300/850	350/400	400/1200
Others	50/800	100/850	50/400	150/1200
Total	800=P(Yellow)	850	400	1200

Step 3: Calculate the conditional probabilities for all the classes, i.e., the following in our example:

Step 4: Calculate \(\displaystyle\max_{i}{P(C_i|x_1, x_2,\ldots, x_n)}\). In our example, the maximum probability is for the class banana, therefore, the fruit which is long, sweet and yellow is a banana by Naive Bayes Algorithm.

In a nutshell, we say that a new element will belong to the class which will have the maximum conditional probability described above.

Variations of the Naive Bayes algorithm

There are multiple variations of the Naive Bayes algorithm depending on the distribution of \(P(x_j|C_i)\). Three of the commonly used variations are

Gaussian: The Gaussian Naive Bayes algorithm assumes distribution of features to be Gaussian or normal, i.e.,
\(\displaystyle P(x_j|C_i)=\frac{1}{\sqrt{2\pi\sigma_{C_i}^2}}\exp{\left(-\frac{(x_j-\mu_{C_j})^2}{2\sigma_{C_i}^2}\right)}\)
Read more about it here.
Multinomial: The Multinomial Naive Bayes algorithm is used when the data is distributed multinomially, i.e., multiple occurrences matter a lot. You can read more here.
Bernoulli: The Bernoulli algorithm is used when the features in the data set are binary-valued. It is helpful in spam filtration and adult content detection techniques. For more details, click here.

Pros and Cons of Naive Bayes algorithm

Every coin has two sides. So does the Naive Bayes algorithm. It has advantages as well as disadvantages, and they are listed below:

Pros

It is a relatively easy algorithm to build and understand.
It is faster to predict classes using this algorithm than many other classification algorithms.
It can be easily trained using a small data set.

Cons

If a given class and a feature have 0 frequency, then the conditional probability estimate for that category will come out as 0. This problem is known as the “Zero Conditional Probability Problem.” This is a problem because it wipes out all the information in other probabilities too. There are several sample correction techniques to fix this problem such as “Laplacian Correction.”
Another disadvantage is the very strong assumption of independence class features that it makes. It is near to impossible to find such data sets in real life.

Naive Bayes with Python and R

Let us see how we can build the basic model using the Naive Bayes algorithm in R and in Python.

R Code

To start training a Naive Bayes classifier in R, we need to load the e1071 package.

library(e1071)

To split the data set into training and test data we will use the caTools package.

library(caTools)

The predefined function used for the implementation of Naive Bayes in R is called naiveBayes(). There are only a few parameters that are of use:

naiveBayes(formula, data, laplace = 0, subset, na.action = na.pass)

formula: The traditional formula \(Y\sim X_1+X_2+\ldots+X_n\)
data: The data frame containing numeric or factor variables
laplace: Provides a smoothing effect
subset: Helps in using only a selection subset of the data based on some Boolean filter
na.action: Helps in determining what is to be done when a missing value in the data set is encountered

Let us take the example of the iris data set.

> library(e1071)
> library(caTools)

> data(iris)

> iris$spl=sample.split(iris,SplitRatio=0.7)
# By using the sample.split() we are creating a vector with values TRUE and FALSE and by setting
  the SplitRatio to 0.7, we are splitting the original Iris dataset of 150 rows to 70% training
  and 30% testing data. 
> train=subset(iris, iris$spl==TRUE)#the subset of iris dataset for which spl==TRUE
> test=subset(iris, iris$spl==FALSE)

> nB_model <- naiveBayes(train[,1:4], train[,5]) 

> table(predict(nB_model, test[,-5]), test[,5]) #returns the confusion matrix
                setosa versicolor virginica
  setosa         17          0         0
  versicolor      0         17         2
  virginica       0          0        14

Python Code

We will use the Python library scikit-learn to build the Naive Bayes algorithm.

>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn import datasets
>>> from sklearn.metrics import confusion_matrix
>>> from sklearn.model_selection import train_test_split

>>> iris = datasets.load_iris()
>>> X = iris.data
>>> y = iris.target

# Split the data into a training set and a test set
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
>>> gnb = GaussianNB()
>>> mnb = MultinomialNB()

>>> y_pred_gnb = gnb.fit(X_train, y_train).predict(X_test)
>>> cnf_matrix_gnb = confusion_matrix(y_test, y_pred_gnb)

>>> print(cnf_matrix_gnb)
[[16 0 0]
 [ 0 18 0]
 [ 0 0 11]]

>>> y_pred_mnb = mnb.fit(X_train, y_train).predict(X_test)
>>> cnf_matrix_mnb = confusion_matrix(y_test, y_pred_mnb)

>>> print(cnf_matrix_mnb)
[[16 0 0]
 [ 0 0 18]
 [ 0 0 11]]

Applications

The Naive Bayes algorithm is used in multiple real-life scenarios such as

Text classification: It is used as a probabilistic learning method for text classification. The Naive Bayes classifier is one of the most successful known algorithms when it comes to the classification of text documents, i.e., whether a text document belongs to one or more categories (classes).
Spam filtration: It is an example of text classification. This has become a popular mechanism to distinguish spam email from legitimate email. Several modern email services implement Bayesian spam filtering.
Many server-side email filters, such as DSPAM, SpamBayes, SpamAssassin, Bogofilter, and ASSP, use this technique.
Sentiment Analysis: It can be used to analyze the tone of tweets, comments, and reviews—whether they are negative, positive or neutral.
Recommendation System: The Naive Bayes algorithm in combination with collaborative filtering is used to build hybrid recommendation systems which help in predicting if a user would like a given resource or not.

Conclusion

This article is a simple explanation of the Naive Bayes Classification algorithm, with an easy-to-understand example and a few technicalities.

Despite all the complicated math, the implementation of the Naive Bayes algorithm involves simply counting the number of objects with specific features and classes. Once these numbers are obtained, it is very simple to calculate probabilities and arrive at a conclusion.

Hope you are now familiar with this machine learning concept you most like would have heard of before.

Hire top tech talent with our recruitment platform

Access Free Demo

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

By Role

By need

Product

Content

Introduction to Naive Bayes Classification Algorithm in Python and R

Table of Contents

What is Naive Bayes algorithm?

The Mathematics of the Naive Bayes Algorithm

Derivation of Bayes’ Theorem

Bayes’ Theorem for Naive Bayes Algorithm

How does the Naive Bayes Algorithm work?

Variations of the Naive Bayes algorithm

Pros and Cons of Naive Bayes algorithm

Pros

Cons

Naive Bayes with Python and R

R Code

Python Code

Applications

Conclusion

Get advanced recruiting insights delivered every month

Hire top tech talent with our recruitment platform

Get advanced recruiting insights delivered every month

Get insightful articles from the world of tech recruiting straight to your inbox

Related reads

Guide to Conducting Successful System Design Interviews in 2025

How Candidates Use Technology to Cheat in Online Technical Assessments

Talent Acquisition Strategies For Rehiring Former Employees

Automation in Talent Acquisition: A Comprehensive Guide

Predictive Analytics for Talent Management

How To Create A Positive Virtual Onboarding Experience?

Get advanced recruiting insights delivered every month

Top Products

Engage global developers through innovation

AI-driven advanced coding assessments

Real-time code editor for effective coding interviews

Tailored learning paths for continuous assessments