Naive Bayes is a machine learning algorithm for classification problems. It is based on Bayes’ probability theorem. It is primarily used for text classification which involves high dimensional training data sets. A few examples are spam filtration, sentimental analysis, and classifying news articles.
It is not only known for its simplicity, but also for its effectiveness. It is fast to build models and make predictions with Naive Bayes algorithm. Naive Bayes is the first algorithm that should be considered for solving text classification problem. Hence, you should learn this algorithm thoroughly.
Naive Bayes algorithm is the algorithm that learns the probability of an object with certain features belonging to a particular group/class. In short, it is a probabilistic classifier. You must be wondering why is it called so?
The Naive Bayes algorithm is called “naive” because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features.
For instance, if you are trying to identify a fruit based on its color, shape, and taste, then an orange colored, spherical, and tangy fruit would most likely be an orange. Even if these features depend on each other or on the presence of the other features, all of these properties individually contribute to the probability that this fruit is an orange and that is why it is known as “naive.”
As for the “Bayes” part, it refers to the statistician and philosopher, Thomas Bayes and the theorem named after him, Bayes’ theorem, which is the base for Naive Bayes Algorithm.
As already said, the basis of Naive Bayes algorithm is Bayes’ theorem or alternatively known as Bayes’ rule or Bayes’ law. It gives us a method to calculate the conditional probability, i.e., the probability of an event based on previous knowledge available on the events. More formally, Bayes’ Theorem is stated as the following equation:
\(\displaystyle P(A|B)=\frac{P(B|A)P(A)}{P(B)}\)
Let us understand the statement first and then we will look at the proof of the statement. The components of the above statement are:
The terminology in the Bayesian method of probability (more commonly used) is as follows:
This sums the Bayes’ theorem as
\(\mbox{Posterior}=\frac{\mbox{(Likelihood)}.\mbox{(Proposition prior probability)}}{\mbox{Evidence prior probability}}\)
Let us take an example to better understand Bayes’ theorem.
Suppose you have to draw a single card from a standard deck of 52 cards. Now the probability that the card is a Queen is \(P\left(\mbox{Queen}\right)=\frac{4}{52}=\frac{1}{13}\). If you are given evidence that the card that you have picked is a face card, the posterior probability \(P(\mbox{Queen}|\mbox{Face})\) can be calculated using Bayes’ Theorem as follows:
\(\displaystyle P(\mbox{Queen}|\mbox{Face})=\frac{P(\mbox{Face}|\mbox{Queen})}{P(\mbox{Face})}.P(\mbox{Queen})\)
Now \(P(\mbox{Face}|\mbox{Queen})=1\) because given the card is Queen, it is definitely a face card. We have already calculated \(P\left(\mbox{Queen}\right)\). The only value left to calculate is \(P\left(\mbox{Face}\right)\), which is equal to \(\frac{3}{13}\) as there are three face cards for every suit in a deck. Therefore,
\(\displaystyle P(\mbox{Queen}|\mbox{Face})=\frac{1}{13}.\frac{13}{3}=\frac{1}{3}\)
For a joint probability distribution of two events \(A\) and \(B\), \(P(A\cap B)\), the conditional probability,
\(\displaystyle P(A|B)=\frac{P(A\cap B)}{P(B)}\)
Similarly,
\(\displaystyle P(B|A)=\frac{P(B\cap A)}{P(A)}\)
Therefore,
\(\displaystyle P(B|A).P(A)=P(A|B).P(B)\implies P(A|B)=\frac{P(B|A).P(A)}{P(B)}\)
In a machine learning classification problem, there are multiple features and classes, say, \(C_1, C_2, \ldots, C_k \). The main aim in the Naive Bayes algorithm is to calculate the conditional probability of an object with a feature vector \(x_1, x_2,\ldots, x_n\) belongs to a particular class \(C_i\),
\(\displaystyle P(C_i|x_1, x_2,\ldots, x_n)=\frac{P(x_1, x_2,\ldots, x_n|C_i).P(C_i)}{P(x_1, x_2,\ldots, x_n)}\) for \(1\leq i\leq k\)
Now, the numerator of the fraction on right-hand side of the equation above is \(\displaystyle P(x_1, x_2,\ldots, x_n|C_i).P(C_i)=P(x_1, x_2,\ldots, x_n, C_i)\)
The conditional probability term, \(P(x_j|x_{j+1},\ldots, x_n, C_i)\) becomes \(P(x_j|C_i)\) because of the assumption that features are independent.
From the calculation above and the independence assumption, the Bayes theorem boils down to the following easy expression:
\(\displaystyle P(C_i|x_1, x_2,\ldots, x_n)=\left(\prod_{j=1}^{j=n}P(x_j|C_i)\right).\frac{P(C_i)}{P(x_1, x_2,\ldots, x_n)}\) for \(1\leq i\leq k\)
The expression \(P(x_1, x_2,\ldots, x_n)\) is constant for all the classes, we can simply say that
\(\displaystyle P(C_i|x_1, x_2,\ldots, x_n)\propto\left(\prod_{j=1}^{j=n}P(x_j|C_i)\right).P(C_i)\) for \(1\leq i\leq k\)
So far, we learned what the Naive Bayes algorithm is, how the Bayes theorem is related to it, and what the expression of the Bayes’ theorem for this algorithm is. Let us take a simple example to understand the functionality of the algorithm. Suppose, we have a training data set of 1200 fruits. The features in the data set are these: is the fruit yellow or not, is the fruit long or not, and is the fruit sweet or not. There are three different classes: mango, banana, and others.
Step 1: Create a frequency table for all the features against the different classes.
Name | Yellow | Sweet | Long | Total |
Mango | 350 | 450 | 0 | 650 |
Banana | 400 | 300 | 350 | 400 |
Others | 50 | 100 | 50 | 150 |
Total | 800 | 850 | 400 | 1200 |
What can we conclude from the above table?
Let’s say you are given with a fruit which is yellow, sweet, and long and you have to check the class to which it belongs.
Step 2: Draw the likelihood table for the features against the classes.
Name | Yellow | Sweet | Long | Total |
Mango | 350/800=P(Mango|Yellow) | 450/850 | 0/400 | 650/1200=P(Mango) |
Banana | 400/800 | 300/850 | 350/400 | 400/1200 |
Others | 50/800 | 100/850 | 50/400 | 150/1200 |
Total | 800=P(Yellow) | 850 | 400 | 1200 |
Step 3: Calculate the conditional probabilities for all the classes, i.e., the following in our example:
Step 4: Calculate \(\displaystyle\max_{i}{P(C_i|x_1, x_2,\ldots, x_n)}\). In our example, the maximum probability is for the class banana, therefore, the fruit which is long, sweet and yellow is a banana by Naive Bayes Algorithm.
In a nutshell, we say that a new element will belong to the class which will have the maximum conditional probability described above.
There are multiple variations of the Naive Bayes algorithm depending on the distribution of \(P(x_j|C_i)\). Three of the commonly used variations are
Every coin has two sides. So does the Naive Bayes algorithm. It has advantages as well as disadvantages, and they are listed below:
Let us see how we can build the basic model using the Naive Bayes algorithm in R and in Python.
To start training a Naive Bayes classifier in R, we need to load the e1071 package.
library(e1071)
To split the data set into training and test data we will use the caTools package.
library(caTools)
The predefined function used for the implementation of Naive Bayes in R is called naiveBayes(). There are only a few parameters that are of use:
naiveBayes(formula, data, laplace = 0, subset, na.action = na.pass)
Let us take the example of the iris data set.
> library(e1071) > library(caTools) > data(iris) > iris$spl=sample.split(iris,SplitRatio=0.7) # By using the sample.split() we are creating a vector with values TRUE and FALSE and by setting the SplitRatio to 0.7, we are splitting the original Iris dataset of 150 rows to 70% training and 30% testing data. > train=subset(iris, iris$spl==TRUE)#the subset of iris dataset for which spl==TRUE > test=subset(iris, iris$spl==FALSE) > nB_model <- naiveBayes(train[,1:4], train[,5]) > table(predict(nB_model, test[,-5]), test[,5]) #returns the confusion matrix setosa versicolor virginica setosa 17 0 0 versicolor 0 17 2 virginica 0 0 14
We will use the Python library scikit-learn to build the Naive Bayes algorithm.
>>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.naive_bayes import MultinomialNB >>> from sklearn import datasets >>> from sklearn.metrics import confusion_matrix >>> from sklearn.model_selection import train_test_split >>> iris = datasets.load_iris() >>> X = iris.data >>> y = iris.target # Split the data into a training set and a test set >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) >>> gnb = GaussianNB() >>> mnb = MultinomialNB() >>> y_pred_gnb = gnb.fit(X_train, y_train).predict(X_test) >>> cnf_matrix_gnb = confusion_matrix(y_test, y_pred_gnb) >>> print(cnf_matrix_gnb) [[16 0 0] [ 0 18 0] [ 0 0 11]] >>> y_pred_mnb = mnb.fit(X_train, y_train).predict(X_test) >>> cnf_matrix_mnb = confusion_matrix(y_test, y_pred_mnb) >>> print(cnf_matrix_mnb) [[16 0 0] [ 0 0 18] [ 0 0 11]]
The Naive Bayes algorithm is used in multiple real-life scenarios such as
This article is a simple explanation of the Naive Bayes Classification algorithm, with an easy-to-understand example and a few technicalities.
Despite all the complicated math, the implementation of the Naive Bayes algorithm involves simply counting the number of objects with specific features and classes. Once these numbers are obtained, it is very simple to calculate probabilities and arrive at a conclusion.
Hope you are now familiar with this machine learning concept you most like would have heard of before.
Organizations of all industries struggle with employee turnover. The high turnover rates cause increased hiring…
Virtual hiring events are becoming vital for modern recruitment, and the hiring world is changing…
The competition for talent today is intense, and this makes it very important for organizations…
Hiring trends are continuously evolving over the ages to keep pace with the latest technological…
Hiring practices have changed significantly over the past 30 years. Technological advancements and changing workforce…
In the current world, where the hiring process is ever-evolving, it has become crucial to…