Loading web-font TeX/Math/Italic

Introduction to Naive Bayes Classification Algorithm in Python and R

Naive Bayes Algorithm in Python and R

Naive Bayes is a machine learning algorithm for classification problems. It is based on Bayes’ probability theorem. It is primarily used for text classification which involves high dimensional training data sets. A few examples are spam filtration, sentimental analysis, and classifying news articles.

It is not only known for its simplicity, but also for its effectiveness. It is fast to build models and make predictions with Naive Bayes algorithm. Naive Bayes is the first algorithm that should be considered for solving text classification problem. Hence, you should learn this algorithm thoroughly.

Table of Contents

  1. Basics of Naive Bayes
  2. The mathematics of the Naive Bayes
  3. Variations of Naive Bayes
  4. Advantages and Disadvantages
  5. Python and R implementation
  6. Applications of Naive Bayes

What is NaiveBayes algorithm?

Naive Bayes algorithm is the algorithm that learns the probability of an object with certain featuresbelonging to a particular group/class. In short, it is a probabilistic classifier. You must be wondering why is it called so?

The Naive Bayesalgorithm is called “naive” because it makes theassumption that the occurrence of a certainfeature is independent of the occurrence of other features.

For instance, if you are trying to identify a fruit based on its color, shape, and taste, then an orange colored, spherical, and tangy fruit would most likely be an orange.Even if these features depend on each other or on the presenceof the other features, all of these properties individually contributeto the probability that this fruit is an orangeand that is why it is known as “naive.”

As for the “Bayes” part, itrefers to the statistician and philosopher,Thomas Bayes and the theorem named after him, Bayes’ theorem, which is the base for Naive Bayes Algorithm.

The Mathematics of the Naive Bayes Algorithm

As already said, the basis of Naive Bayes algorithm is Bayes’ theorem or alternatively known as Bayes’ rule or Bayes’ law. It gives us a method to calculate theconditional probability, i.e.,the probability of an event based on previous knowledge available on the events. More formally, Bayes’ Theorem is stated as the following equation:

\displaystyle P(A|B)=\frac{P(B|A)P(A)}{P(B)}

Let us understand the statement first and then we will look at the proof of the statement. The components of the above statement are:

  • P(A|B): Probability (conditional probability) of occurrenceof event A given the event B is true
  • P(A) and P(B): Probabilities of the occurrence of event A and B respectively
  • P(B|A): Probability of theoccurrenceof event B given the event A is true

The terminology in the Bayesian method of probability (more commonly used) is as follows:

  • A is called the proposition andB is called the evidence.
  • P(A) is called the prior probability of proposition andP(B) is called the prior probability of evidence.
  • P(A|B) is called the posterior.
  • P(B|A) is the likelihood.

This sums the Bayes’ theorem as

\mbox{Posterior}=\frac{\mbox{(Likelihood)}.\mbox{(Proposition prior probability)}}{\mbox{Evidence prior probability}}

Let us take an example to better understand Bayes’ theorem.

Suppose you have to draw a single card from a standard deck of 52 cards. Now the probability that the card is a Queen is P\left(\mbox{Queen}\right)=\frac{4}{52}=\frac{1}{13}. If you are given evidence that the card that you have picked is a face card, the posterior probability P(\mbox{Queen}|\mbox{Face}) can be calculated using Bayes’ Theorem as follows:

\displaystyle P(\mbox{Queen}|\mbox{Face})=\frac{P(\mbox{Face}|\mbox{Queen})}{P(\mbox{Face})}.P(\mbox{Queen})

Now P(\mbox{Face}|\mbox{Queen})=1 because given the card is Queen, it is definitely a face card. We have already calculated P\left(\mbox{Queen}\right). The only value left to calculate is P\left(\mbox{Face}\right), which is equal to \frac{3}{13} as there are threeface cards for every suit in adeck. Therefore,

\displaystyle P(\mbox{Queen}|\mbox{Face})=\frac{1}{13}.\frac{13}{3}=\frac{1}{3}

Derivation of Bayes’ Theorem

Machine learning challenge, ML challenge

For a joint probability distribution of two events A and B, P(A\cap B), the conditional probability,

\displaystyle P(A|B)=\frac{P(A\cap B)}{P(B)}

Similarly,

\displaystyle P(B|A)=\frac{P(B\cap A)}{P(A)}

Therefore,

\displaystyle P(B|A).P(A)=P(A|B).P(B)\implies P(A|B)=\frac{P(B|A).P(A)}{P(B)}

Bayes’ Theorem for Naive Bayes Algorithm

In a machine learning classification problem, there are multiple features and classes, say, C_1, C_2, \ldots, C_k . The main aim in the Naive Bayes algorithm is to calculate the conditional probability ofan object with a feature vectorx_1, x_2,\ldots, x_n belongs to a particular class C_i,

\displaystyle P(C_i|x_1, x_2,\ldots, x_n)=\frac{P(x_1, x_2,\ldots, x_n|C_i).P(C_i)}{P(x_1, x_2,\ldots, x_n)} for 1\leq i\leq k

Now, the numerator ofthe fraction on right-hand side of the equation above is \displaystyle P(x_1, x_2,\ldots, x_n|C_i).P(C_i)=P(x_1, x_2,\ldots, x_n, C_i)

Naive Bayes algorithm

The conditional probability term, P(x_j|x_{j+1},\ldots, x_n, C_i) becomes P(x_j|C_i) because of the assumption that features are independent.

From the calculation above and the independence assumption, the Bayes theorem boils down to the following easy expression:

\displaystyle P(C_i|x_1, x_2,\ldots, x_n)=\left(\prod_{j=1}^{j=n}P(x_j|C_i)\right).\frac{P(C_i)}{P(x_1, x_2,\ldots, x_n)} for 1\leq i\leq k

The expression P(x_1, x_2,\ldots, x_n) is constant for all the classes, we can simply say that

\displaystyle P(C_i|x_1, x_2,\ldots, x_n)\propto\left(\prod_{j=1}^{j=n}P(x_j|C_i)\right).P(C_i)for 1\leq i\leq k

How does the Naive Bayes Algorithm work?

So far, we learned what the Naive Bayes algorithm is, how theBayes theorem is related to it, and what the expression of the Bayes’ theorem for this algorithm is. Let us take a simple example to understand the functionality of the algorithm. Suppose, we have a training data set of 1200 fruits. The features in the data set are these: isthe fruit yellow or not, is the fruit long or not, and is the fruit sweet or not. There are threedifferent classes: mango, banana, and others.

Step 1: Create a frequency table for all the features against the different classes.

Name Yellow Sweet Long Total
Mango 350 450 0 650
Banana 400 300 350 400
Others 50 100 50 150
Total 800 850 400 1200

What can we conclude from the above table?

  • Out of 1200 fruits, 650 are mangoes, 400 are bananas, and 150 are others.
  • 350 of the total 650 mangoes are yellow and the rest are not and so on.
  • 800 fruits are yellow, 850 are sweet and 400 are long from a total of 1200 fruits.

Let’s say you are given with a fruit which is yellow, sweet, and long and you have to check the class to which it belongs.

Step 2: Draw the likelihood table for the features against the classes.

Name Yellow Sweet Long Total
Mango 350/800=P(Mango|Yellow) 450/850 0/400 650/1200=P(Mango)
Banana 400/800 300/850 350/400 400/1200
Others 50/800 100/850 50/400 150/1200
Total 800=P(Yellow) 850 400 1200

Step 3: Calculate the conditional probabilities for all the classes, i.e., the following in our example:

Step 4: Calculate \displaystyle\max_{i}{P(C_i|x_1, x_2,\ldots, x_n)}. In our example, the maximum probability is for the class banana, therefore, the fruit which is long, sweet and yellow is a banana by Naive Bayes Algorithm.

In a nutshell, we say that a new element will belong to the class which will have the maximum conditional probability described above.

Variations of the Naive Bayes algorithm

There are multiple variations of the Naive Bayes algorithm depending on the distribution of P(x_j|C_i). Three of the commonly used variations are

  1. Gaussian: The Gaussian Naive Bayes algorithm assumes distribution of features to be Gaussian or normal, i.e.,
    \displaystyle P(x_j|C_i)=\frac{1}{\sqrt{2\pi\sigma_{C_i}^2}}\exp{\left(-\frac{(x_j-\mu_{C_j})^2}{2\sigma_{C_i}^2}\right)}
    Read more about it here.
  2. Multinomial: The Multinomial Naive Bayes algorithm is used when the data is distributed multinomially, i.e., multiple occurrences matter a lot. You can read more here.
  3. Bernoulli: The Bernoulli algorithm is used when the features in the data set are binary-valued. It is helpful inspam filtration and adult content detection techniques. For more details, click here.

Pros and Cons of Naive Bayes algorithm

Every coin has two sides. So does theNaive Bayes algorithm. It has advantages as well as disadvantages, and they are listed below:

Pros

  • It is a relatively easy algorithm to build and understand.
  • It is faster to predict classes using this algorithm than many other classification algorithms.
  • It can be easily trained using a small data set.

Cons

  • If a given class and a feature have 0 frequency, then the conditional probability estimate for that category will come out as0. This problem is known as the “Zero Conditional Probability Problem.” This is a problem because itwipes out all the information in other probabilities too. There are several sample correction techniques to fix this problem such as “LaplacianCorrection.”
  • Another disadvantage is the very strong assumption of independence class features that it makes. It is near to impossible to find such data sets in real life.

Naive Bayes with Python and R

Let us see how we can build the basic model using the Naive Bayes algorithm in R and in Python.

R Code

To start training a Naive Bayes classifier in R, we need to load the e1071 package.

library(e1071)

To split the data set into training and test data we will use the caTools package.

library(caTools)

The predefined function used for the implementation of Naive Bayes in R is called naiveBayes(). There are only a few parameters that are of use:

naiveBayes(formula, data, laplace = 0, subset, na.action = na.pass)
  • formula: The traditional formula Y\sim X_1+X_2+\ldots+X_n
  • data:The data frame containing numeric or factor variables
  • laplace: Provides a smoothing effect
  • subset:Helps in using only a selection subset of the data based on some Boolean filter
  • na.action:Helps in determining what is to be done when a missing value in the data set is encountered

Let us take the example of the iris data set.

> library(e1071)
> library(caTools)

> data(iris)

> iris$spl=sample.split(iris,SplitRatio=0.7)
# By using the sample.split() we are creating a vector with values TRUE and FALSE and by setting
  the SplitRatio to 0.7, we are splittingthe original Iris dataset of 150 rows to 70% training
  and 30% testing data. 
> train=subset(iris, iris$spl==TRUE)#the subset of iris dataset for which spl==TRUE
> test=subset(iris, iris$spl==FALSE)

> nB_model <- naiveBayes(train[,1:4], train[,5]) 

> table(predict(nB_model, test[,-5]), test[,5]) #returns the confusion matrix
                setosa versicolor virginica
  setosa         17          0         0
  versicolor      0         17         2
  virginica       0          0        14

Python Code

We will use the Python libraryscikit-learnto build the Naive Bayes algorithm.

>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn import datasets
>>> from sklearn.metrics import confusion_matrix
>>> from sklearn.model_selection import train_test_split

>>> iris = datasets.load_iris()
>>> X = iris.data
>>> y = iris.target

# Split the data into a training set and a test set
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
>>> gnb = GaussianNB()
>>> mnb = MultinomialNB()

>>> y_pred_gnb = gnb.fit(X_train, y_train).predict(X_test)
>>> cnf_matrix_gnb = confusion_matrix(y_test, y_pred_gnb)

>>> print(cnf_matrix_gnb)
[[16 0 0]
 [ 0 18 0]
 [ 0 0 11]]

>>> y_pred_mnb = mnb.fit(X_train, y_train).predict(X_test)
>>> cnf_matrix_mnb = confusion_matrix(y_test, y_pred_mnb)

>>> print(cnf_matrix_mnb)
[[16 0 0]
 [ 0 0 18]
 [ 0 0 11]]

Applications

The Naive Bayes algorithm is used in multiple real-life scenarios such as

  1. Text classification: It is used as a probabilistic learning method for text classification. The Naive Bayes classifier is one of the most successful known algorithms when it comes to the classification of text documents, i.e., whether a text document belongs to one or more categories (classes).
  2. Spam filtration: It is anexample of text classification. This has become a popular mechanism to distinguish spam email from legitimate email. Severalmodern email services implement Bayesian spam filtering.
    Many server-side email filters, such as DSPAM, SpamBayes, SpamAssassin, Bogofilter, and ASSP, use this technique.
  3. Sentiment Analysis: It can be used to analyze the tone of tweets, comments, and reviews—whether they are negative, positive or neutral.
  4. Recommendation System: The Naive Bayes algorithm in combination with collaborative filtering isused to build hybrid recommendation systems which helpin predicting if a user would like a given resource or not.

Conclusion

This article isa simple explanation of the Naive Bayes Classification algorithm, with an easy-to-understand example and a few technicalities.

Despite all the complicated math, the implementation of the Naive Bayes algorithm involvessimply counting the number of objects with specificfeatures and classes. Once these numbers are obtained, it is very simple to calculate probabilities and arrive ata conclusion.

Hope you are now familiar with this machine learning concept you most like would have heardof before.

Hackerearth Subscribe

Get advanced recruiting insights delivered every month

Hire top tech talent with our recruitment platform

Access Free Demo

Related reads

Guide to Conducting Successful System Design Interviews in 2025
Guide to Conducting Successful System Design Interviews in 2025

Guide to Conducting Successful System Design Interviews in 2025

Article Summary Introduction to Systems Design Common System Design interview questions The difference between a System Design interview and a coding interview Best…

How Candidates Use Technology to Cheat in Online Technical Assessments
How Candidates Use Technology to Cheat in Online Technical Assessments

How Candidates Use Technology to Cheat in Online Technical Assessments

Article Summary How online assessments have transformed hiring Current state of cheating in online technical assessments Popular techniques candidates use to cheat Steps…

Talent Acquisition Strategies For Rehiring Former Employees
Talent Acquisition Strategies For Rehiring Former Employees

Talent Acquisition Strategies For Rehiring Former Employees

Former employees who return to work with the same organisation are essential assets. In talent acquisition, such employees are also termed as ‘Boomerang…

Automation in Talent Acquisition: A Comprehensive Guide
Automation in Talent Acquisition: A Comprehensive Guide

Automation in Talent Acquisition: A Comprehensive Guide

Automation has become a major element in the modern-day hiring process. The automated hiring process gained momentum since the advent of remote work…

Predictive Analytics for Talent Management
Predictive Analytics for Talent Management

Predictive Analytics for Talent Management

The job landscape in today’s age is highly competitive for both job seekers and hiring managers. Finding the right talent under such conditions…

How To Create A Positive Virtual Onboarding Experience?
How To Create A Positive Virtual Onboarding Experience?

How To Create A Positive Virtual Onboarding Experience?

The advent of the pandemic changed the hiring industry in many ways. One of the biggest outcomes of this global phenomenon was that…

Hackerearth Subscribe

Get advanced recruiting insights delivered every month

View More

Top Products

Hackathons

Engage global developers through innovation

Hackerearth Hackathons Learn more

Assessments

AI-driven advanced coding assessments

Hackerearth Assessments Learn more

FaceCode

Real-time code editor for effective coding interviews

Hackerearth FaceCode Learn more

L & D

Tailored learning paths for continuous assessments

Hackerearth Learning and Development Learn more