Data classification is a very important task in machine learning. Support Vector Machines (SVMs) are widely applied in the field of pattern classifications and nonlinear regressions. The original form of the SVM algorithm was introduced by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963. Since then, SVMs have been transformed tremendously to be used successfully in many real-world problems such as text (and hypertext) categorization, image classification, bioinformatics (Protein classification, Cancer classification), handwritten character recognition, etc.
A Support Vector Machine is a supervised machine learning algorithm which can be used for both classification and regression problems. It follows a technique called the kernel trick to transform the data and based on these transformations, it finds an optimal boundary between the possible outputs.
In simple words, it does some extremely complex data transformations to figure out how to separate the data based on the labels or outputs defined.We will be looking only at the SVM classification algorithm in this article.
The main idea is to identify the optimal separating hyperplane which maximizes the margin of the training data. Let us understand this objective term by term.
We can see that it is possible to separate the data given in the plot above. For instance, we can draw a line in which all the points above the line are green and the ones below the line are red. Such a line is said to be a separating hyperplane.
Now the obvious confusion, why is it called a hyperplane if it is a line?
In the diagram above, we have considered the simplest of examples, i.e., the dataset lies in the 2-dimensional plane(\(\mathbb{R}^2\)). But the support vector machine can work for a general n-dimensional dataset too. And in the case of higher dimensions, the hyperplane is the generalization of a plane.
More formally, it is an n-1 dimensional subspace of an n-dimensional Euclidean space. So for a
We have said that the objective of an SVM is to find the optimal separating hyperplane. When is a separating hyperplane said to be optimal?
The fact that there exists a hyperplane separating the dataset doesn’t mean that it is the best one.
Let us understand the optimal hyperplane through a set of diagrams.
Therefore, maximizing the distance between the nearest points of each class and the hyperplane would result in an optimal separating hyperplane. This distance is called the margin.
The goal of SVMs is to find the optimal hyperplane because it not only classifies the existing dataset but also helps predict the class of the unseen data. And the optimal hyperplane is the one which has the biggest margin.
Now that we have understood the basic setup of this algorithm, let us dive straight into the mathematical technicalities of SVMs.
I will be assuming you are familiar with basic mathematical concepts such as vectors, vector arithmetic(addition, subtraction, dot product) and the orthogonal projection. Some of these concepts can also be found in the article, Prerequisites of linear algebra for machine learning.
You must have come across the equation of a straight line as \(y=mx+c\), where \(m\) is the slope and \(c\) is the y-intercept of the line.
The generalized equation of a hyperplane is as follows:
\(\displaystyle w^Tx=0\)
Here \(w\) and \(x\) are the vectors and \(w^Tx\) represents the dot product of the two vectors. The vector \(w\) is often called as the weight vector.
Consider the equation of the line as \(y-mx-c=0\). In this case,
\(w=\begin{pmatrix}-c\\-m\\1 \end{pmatrix}\) and \(x=\begin{pmatrix}1\\x\\y \end{pmatrix}\)
\(\displaystyle w^Tx=-c\times 1-m\times x+y=y-mx-c=0\)It is just two different ways of representing the same thing. So why do we use \(w^Tx=0\)? Simply because it is easier to deal with this representation in the case of higher dimensional dataset and \(w\) represents the vector which is normal to the hyperplane. This property will be useful once we start computing the distance from a point to the hyperplane.
The training data in our classification problem is of the form \(\{(x_1,y_1),(x_2,y_2),…,(x_n,y_n)\} \in \mathbb{R}^n \times {-1,1}\). This means that the training dataset is a pair of \(x_i\), an n-dimensional feature vector and \(y_i\), the label of \(x_i\). When \(y_i=1\) implies that the sample with the feature vector \(x_i\) belongs to class 1 and if \(y_i=-1\) implies that the sample belongs to class -1.
In a classification problem, we thus try to find out a function, \( y=f(x): \mathbb{R}^n \longrightarrow \{-1,1\}\). \(f(x)\) learns from the training data set and then applies its knowledge to classify the unseen data.
There are an infinite number of functions, \(f(x)\) that can exist, so we have to restrict the class of functions that we are dealing with. In the case of SVM’s, this class of functions is that of the hyperplane represented as \(w^Tx=0\).
It can also be represented as \(\vec{w}.\vec{x}+b=0; \vec{w}\in \mathbb{R}^n \mbox{ and } b \in \mathbb{R}\)
This divides the input space into two parts, one containing vectors of class ?1 and the other containing vectors of class +1.
For the rest of this article, we will consider 2-dimensional vectors. Let \(\mathcal{H}_0\) be a hyperplane separating the dataset and satisfying the following:
\(\displaystyle \vec{w}.\vec{x}+b=0\)Along with \(\mathcal{H}_0\), we can select two others hyperplanes \(\mathcal{H}_1\) and \(\mathcal{H}_2\) such that they also separate the data and have the following equations:
\(\vec{w}.\vec{x}+b=\delta\) and \(\vec{w}.\vec{x}+b=\mbox{-}\delta\)
This makes \(\mathcal{H}_o\) equidistant from \(\mathcal{H}_1\) as well as \(\mathcal{H}_2\).
The variable ? is not necessary so we can set ?=1 to simplify the problem as \(\vec{w}.\vec{x}+b=1\) and \(\vec{w}.\vec{x}+b = \mbox{-}1\)
Next, we want to ensure that there is no point between them. So for this, we will select only those hyperplanes which satisfy the following constraints:
For every vector \(x_i\) either:
Both the constraints stated above can be combined into a single constraint.
Constraint 1:
For \(x_i\) having the class -1, \(\vec{w}.\vec{x}+b\leq \mbox{-}1\)
Multiplying both sides by \(y_i\) (which is always -1 for this equation)
\(y_i\left(\vec{w}.\vec{x}+b\right)\geq y_i(-1)\) which implies \(y_i\left(\vec{w}.\vec{x}+b\right) \geq 1\) for \(x_i\) having the class?1.
Constraint 2: \(y_i=1\)
\(y_i\left(\vec{w}.\vec{x}+b\right) \geq 1\) for \(x_i\) having the class 1
Combining both the above equations, we get \(y_i\left(\vec{w}.\vec{x}+b\right)\geq 1 \mbox{ for all }1\leq i\leq n\)
This leads to a unique constraint instead of two which are mathematically equivalent. The combined new constraint also has the same effect, i.e., no points between the two hyperplanes.
For the sake of simplicity, we will skip the derivation of the formula for calculating the margin, \(m\) which is
\(\displaystyle m=\frac{2}{||\vec{w}||}\)
The only variable in this formula is \(w\), which is indirectly proportional to \(m\), hence to maximize the margin we will have to minimize \(||\vec{w}||\). This leads to the following optimization problem:
Minimize in \(\displaystyle (\vec{w},b) \{ \frac{||\vec{w}||^2}{2}\) subject to \( y_i\left(\vec{w}.\vec{x}+b\right) \geq 1 \mbox{ for any } i=1,\dots, n \)
The above is the case when our data is linearly separable. There are many cases where the data can not be perfectly classified through linear separation. In such cases, Support Vector Machine looks for the hyperplane that maximizes the margin and minimizes the misclassifications.
For this, we introduce the slack variable,\(\zeta_i\) which allows some objects to fall off the margin but it penalizes them.
In this scenario, the algorithm tries to maintain the slack variable to zero while maximizing the margin. However, it minimizes the sum of distances of the misclassification from the margin hyperplanes and not the number of misclassifications.
Constraints now changes to \(y_i(\vec{w}.\vec{x_i}+b)\geq 1-\zeta_i \mbox{ for all } 1\leq i\leq n, \zeta_i\geq 0\)
and the optimization problem changes to
Minimize in \(\displaystyle (\vec{w},b) \{ \frac{||\vec{w}||^2}{2}+C\sum_i\zeta_i\) subject to \( y_i\left(\vec{w}.\vec{x}+b\right) \geq 1-\zeta_i \mbox{ for any } i=1,\dots, n \)
Here, the parameter \(C\) is the regularization parameter that controls the trade-off between the slack variable penalty (misclassifications) and width of the margin.
The easiest way to separate two classes of data is a line in case of 2D data and a plane in case of 3D data. But it is not always possible to use lines or planes and one requires a nonlinear region to separate these classes. Support Vector Machines handle such situations by using a kernel function which maps the data to a different space where a linear hyperplane can be used to separate classes. This is known as the kernel trick where the kernel function transforms the data into the higher dimensional feature space so that a linear separation is possible.
If \( \phi \) is the kernel function which maps \(x_i \)to \(\phi(x_i)\), the constraints change to \(y_i(\vec{w}.\phi(x_i)+b)\geq 1-\zeta_i \mbox{ for all } 1\leq i \leq n, \zeta_i\geq 0\)
And the optimization problem is
Minimize in \(\displaystyle (\vec{w},b) \{ \frac{||\vec{w}||^2}{2} + C\sum_i\zeta_i \) subject to \(y_i(\vec{w}.\phi(x_i)+b)\geq 1-\zeta_i\) \( \mbox{ for all } 1\leq i \leq n, \zeta_i\geq 0\)
We will not get into the solution of these optimization problems. The most common method used to solve these optimization problems is Convex Optimization.
Every classification algorithm has its own advantages and disadvantages that are come into play according to the dataset being analyzed. Some of the advantages of SVMs are as follows:
Disadvantages of SVMs are as follows:
Let us look at the libraries and functions used to implement SVM in Python and R.
The most widely used library for implementing machine learning algorithms in Python is scikit-learn. The class used for SVM classification in scikit-learn is svm.SVC()
sklearn.svm.SVC (C=1.0, kernel=’rbf’, degree=3, gamma=’auto’)
Parameters are as follows:
There are many advanced parameters too which I have not discussed here. You can check them out here.
https://gist.github.com/HackerEarthBlog/07492b3da67a2eb0ee8308da60bf40d9
One can tune the SVM by changing the parameters \(C, \gamma\) and the kernel function. The function for tuning the parameters available in scikit-learn is called gridSearchCV().
sklearn.model_selection.GridSearchCV(estimator, param_grid)
Parameters of this function are defined as:
To know more about other parameters of GridSearch.CV(), click here.
https://gist.github.com/HackerEarthBlog/a84a446810494d4ca0c178e864ab2391
In the above code, the parameters we have considered for tuning are kernel, C, and gamma. The values from which the best value is to be are the ones written in the bracket. Here, we have only given a few values to be considered but a whole range of values can be given for tuning but it will take a longer time for execution.
The package that we will use for implementing SVM algorithm in R is e1071. The function used will be svm().
https://gist.github.com/HackerEarthBlog/0336338c5d93dc3d724a8edb67ad0a05
In this article, I have gone through a very basic explanation of SVM classification algorithm. I have left out a few mathematical complications such as calculating distances and solving the optimization problem. But I hope this gives you enough know-how about how a machine learning algorithm, that is, SVM, can be modified based on the type of dataset provided.
Defining a Recruitment Management System In today's competitive talent landscape, attracting and retaining top performers…
Understanding the Recruitment Funnel Imagine a broad opening at the top, gradually narrowing down to…
With the growing demand for highly skilled professionals, traditional hiring methods such as reviewing resumes…
Finding the perfect fit for your team can feel like searching for a unicorn. But…
In today's competitive job market, attracting and keeping top talent is crucial for small businesses…
The tech industry thrives on innovation, and at the heart of that innovation lies a…