Support Vector Machine Explained

Theory, Implementation, and Visualization

Support Vector Machine (SVM) is probably one of the most popular ML algorithms used by data scientists. SVM is powerful, easy to explain, and generalizes well in many cases. In this article, I’ll explain the rationales behind SVM and show the implementation in Python. For simplicity, I’ll focus on binary classification problems in this article. However, SVM supports multi-classification.

1.1 General Ideas Behind SVM
SVM seeks the best decision boundary which separates two classes with the highest generalization ability (why focus on generalization? I’ll explain it later in the kernel section). The first question one asks is how does SVM define optimality.

Which decision boundary should we pick? from MITUnlike logistic regression, which defines optimality by overall probability, SVM wants the smallest distance between data points and the decision boundary to be as large as possible. In other words, if you imagine the decision boundary as the central line of a street, SVM prefers an 8-line highway rather than a country road. The width of the street is called the margin.

from Vapnik, Support-Vector Network1.2 Math
Define Margin

Okay, the idea is pretty intuitive. Let’s find out the expression for margin. Before we get started, please remember the symbols used in this article:

If we know the weights and intercepts of the decision boundary, the boundary can be expressed by the following equation:

The equation for the broken line in the last figureDistance from a data point to the boundary is:

from Lecture 14-Support Vector Machine, CaltechThe margin is the distance from the closet point to the boundary:

Final Objective

Now we know how to calculate the margin, let’s try to formalize the problem which needs to be optimized:

Final ObjectiveI have to admit this is a weird function. Just remember this: it is like the RMSE for linear regression, cross-entropy for logistic regression, a function that needs to be optimized.

1.3 Python Implementation
Now we understand how SVM works, let’s try the svm.SVC function offered by Scikit-Learn

from scikit-learn.orgWait a minute, what are these parameters? There’s no C or ‘kernel’ in the objective. Are you sure it is the right function? Unfortunately yes, we just need to dig a little deeper into the math behind SVM.

1.4 Theory Again…
Now we need to talk about how to optimize the objective function. Recall that the function is:

How on earth do we maximize this equation? We can’t just take derivatives and be done with it since the final W and b may not satisfy the constraints. Luckily a guy from high school math class comes to rescue — Lagrange.

Eliminate the constrains

> The method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied exactly by the chosen values of the variables).[1] The basic idea is to convert a constrained problem into a form such that the derivative test of an unconstrained problem can still be applied — Wikipedia

from MITWith the help of Lagrangian multipliers, we can find the solution for the question above by solving this function:

Unconstrained objective functionFigure out W and b

Right now, W, b, and alpha are all unknown. Oh God, what do we do? Luckily Vapnik Vladimir figured out this part for us:

Minimize L with respect to W and b first (pretend we know alpha)

W and b are easy to find as long as we know alphaMaximize L w.r.t alphas with W and b we just found.

Wait, why all of a sudden does this becomes maximization? This is called the Lagrangian Duality.

> The Lagrangian Dual Problem: instead of minimizing over w, b, subject to constraints involving alphas, we can maximize over alpha (the dual variable) subject to the relations obtained previously for w and b — An Idiot’s Guide to SVM

The above function has a nice form, which can be solved by Quadratic Programming software. Just pass this function and the constraints to any commercial QP software, it’ll return a list of alphas. Now with alpha solved, W and b can be calculated. Done!

1.5 What is the parameter ‘C’
In the above definition, we assume that all points must be on the border of the margin or beyond it (a so-called hard SVM). In practice, this is usually not the case. We need to define a soft SVM which allows minor violations of the rule with some penalties:

violation of constraints, from CaltechAs you can see, C determines how serious the SVM is about violations. If C is 0, then SVM doesn’t care about violations at all since the penalty term is gone. If C is very huge, minor violations will lead to a huge increment in the objective function. The solution to this function can be derived with the same methodology explained above.

1.6 What is the parameter ‘kernel’
Kernel tricks are powerful transformation techniques that project data from its original space to a transformed space, hoping to make data more linear separable.

Radial kernel, from An Idiot’s Guide to SVMMore rigorously, we want a transform function:

Recall the objective function we derived earlier:

In the new space, we will have to compute this function:

With the help of kernel transformations, we will be able to map the data into much higher dimensions and have a higher chance to separate them with hyperplanes.

Decision boundary obtained with kernel methods, 3 classes, from Scikit-LearnWait a minute, why are we getting non-linear boundaries? SVM should only find linear boundaries. Yes, you are partially right! The decision boundaries you see are projections from high-dimensioned spaces. After we’ve done classification in the transformed space, we can map the hyperplane back, and the boundaries may become nonlinear!

transformation of space, from Sebastian RaschkaMinor issues with the current transformation

The transformation function projects data points to a different space. When the projection space becomes more complex, this process is very time-consuming. In some cases, we don’t even know the transformation equation. There should be no way to map a data point to an infinite dimensioned space. Or is there?

If we take another look at the transformed objective function, we can minorly modify it:

You would probably say: okay, another definition, so what? But behold, this is the genius part. As long as we know what K(xi,xj) is, we don’t need the transformation function. This means we can choose the desired space without worrying about the nitty-gritty details in between. With this, I can define what dot products in the transformed space looks like without a transformation function Phi. For instance, an infinite one.

the See the proof hereThere are a few famous and well-tested kernels that you may want to know:

from An Idiot’s Guide to SVMThis time we understand what those parameters mean. Let’s see the effects of each parameter in action.

Effects of C
C controls the attention to misclassified points. Large C will force the model to sacrifice margin in favor of correct prediction (may lead to overfitting)

Effects of Kernel
Effects of different kernelfrom Scikit-Learn.orgSVM seeks the balance between the margin of the decision boundary and # of misclassified points. Kernel tricks enable SVM to incorporate powerful nonlinearity without adding local minima to the objective function. Now you understand how SVM works, it is time to try it in real projects!