These days I have indulged in the learning of SVM (support vector machine), a widely used tool in machine learning algorithms. After reading some papers and tutorials, I have been equipped with thorough concepts and method about it. Below I would like to share my naive understanding of the linear SVM.

Introduction

SVM is a widely used supervised learning tool in statistical learning problems like classification. The goal of a classification problem is try to categorize one sample to one of multiple classes as accurate as possible, by delving into the sample’s  features. There are many kinds of SVM, linear-separable, linear non-separble, non-linear, multi-classes etc. But they are all the descendants of the basic 2-class SVM in linear-serparable case.

The Intuition

Suppose we have a set of samples with N features each. If we plot those samples in a N-dimension Euclidian space,  we can imagine they are randomly distributed as dots in such space,  where the value of N features determines the position of each sample.  What SVM intends to do is find out a super-plane in the space to divide the it into 2 parts,  by which the points are classified. So the question is how to find the super-plane to discriminate the samples as clear-cut as possible (the linear separable case assumes such a super plane can always be found and is called the optimal plane).

The Primal Optimization Problem and Dual

So what does clear-cut super-plane mean here? Of course we can define it in a number of ways. In linear separable SVM, however, it is defined as a super-plane that has the max margin. Therefore, the optimization problem turns into the maximization of the distance between the nearest correctly classified sample and the super-plane . By some math operations, the problem can be transformed into a quadratic optimization problem as follows,

$\min_{ \mathbf{w},b} \quad \frac{1}{2} {\|\mathbf{w}\|}^{2}$

$s.t.\quad y_{i}({\mathbf{w}^{T}}\mathbf{x_{i}}+b) \leq 1, \forall i = 1, 2, ..., N$

Now let us go a step further. By adding a slack variable $\zeta_{i}$, we get the optimization problem for linear non-separable cases. This model allows classification errors for some training data, with respect to the portion of $\zeta_{i}$, in order to find a a solution.  Obviously, the non-separable case degenerates to the separable when $\zeta_{i} = 0$ for all examples. Therefore,  it is the generation of linear SVM.

To efficiently solve the problem, we utilize the concept of Lagranian dual function, and transform the original problem into the dual problem as below,

$\min_{w,b,\alpha}~~\frac{1}{2} \sum_{i}\sum_{j} \alpha_{i}\alpha_{j}y_{i}y_{j} <\mathbf{x_{i}} \cdot \mathbf{x_{j}}> - \sum_{i} \alpha_{i}$

$s.t. : \quad \alpha_{i} \geq 0, i = 1, 2, ..., N$

$\quad\quad \sum_{i=1}^{N}\alpha_{i}y_{i} = 0$

Note that a beautiful feature of the dual optimization problem is that all the samples are calculated in a inner product way, which provides great convenience when extending linear SVM into non-linear cases.  Also,  the dual problem gives the analytical solutions of w and b, and we will discussed them later.

From the Perspective of Loss Function

It is time for us to look back at the process again.  We have deduced the primal and dual optimization problems from the geometric intuition of SVM, and hence we can calculate the solution of SVM accordingly. But one may would like to find its link to the optimization of other classification problems where the loss function is minimized.  Fortunately, by re-writting the original problem it is natural to arrive at following expression,

$\min_{\mathbf{w},b,\mathbf{\alpha}}\quad \sum_{i=1}^{N}[0, 1-y_{i}(\mathbf{w}^{T}\mathbf{x_{i}}+b)]_{+}+ \lambda \|\mathbf{w}\|^{2}$

where $[0, x]_{+}$ symbolizes the max between 0 and x.  From the loss function perspective, we see what SVM actually does is to minimize the sum of the hinge loss function (the tightest convex upper bound of 0-1 loss) on all samples plus a L-2 norm penalty.  This is an SRM expression we familiar with in most classification problems.

Inference form SVM Solutions

From the dual optimization process, we know the following expression of w (linear case),

$\mathbf w = \sum^{N}_{i=1} \alpha^{*}_{i} y_{i} ~\mathbf{x_{i}}$.

If we notice that the solution is decided only by for part of samples with $\alpha_{i} \neq 0$, then we can have a new understanding of the solution of the SVM. In other word, the position of the optimal super plane is not influenced by samples far away from the class boundary, provided they are correctly classified.  The solution is only determined by samples nearby the decision bound, and they are hereby called ‘supporting vectors’. This is a substantial distinction between SVM and other classification methods, which enhances the robustness of SVM when there are far-away outliers .

Furthermore, support vectors with distinct $\alpha_{i}$ value contribute differently weight to the solution. In the non-separable case,  $0 < \alpha_{i} < C$ correspond to those right at the margin of the super-plane. (Here I am still confused how come SV have different $alpha$ value and what does the difference mean? Any body knows why? ) $\alpha_{i} = C$ stands for samples located between the margin and the plane or right at the plane. Because they are nearer than the SVs to the plane, so they have larger weight and exert greater impact to the solution.

Probing SVM Classification form the Engineering Background

The solution to an SVM has some interesting explanations if we try to look at it from an engineering angle.  Previous section has already told us the solution is only determined by SVs, so the output with test sample x can be written as:

$y = \sum^{N}_{i=1} \alpha^{*}_{i} y_{i} <\mathbf{x_{i}} \cdot \mathbf{x_{j}}>$

where $N_{s}$ means the number of SVs. It can be inferred that the decision of SVM is composed by a series of comparisons between the input sample (vector) and supporting vectors. For example, for an unknown feature in a computer vision system, the SVM compares the feature with the linear combination of supporting features and then output based on those comparisons, regardless other non-supporting features. Therefore, SVM’s computation complexity only rely on the number of SVs on the margin, rather than increasing with the size of training data.

Methods for Solving SVM

Plenty of methods have been proposed to solve the SVM. For SVM with small data set, we can expect it be solved in an analytical way, since the matrix computation is available on most desktops. When SVM is faced with large data set, however, it can only be solved in numerical way. For the SVM learning process is a QP (quadratic optimization) problem, traditional QP solver in optimization is suitable for the SVM, such as quasi-newton method, gradient descent, interior method etc. Moreover, there are other methods based on SVM like SMO (one of my recent post on SMO is here) and chunck method.

Kernel Trick: from Linear to Non-Linear Cases

Kernel trick is a method used to construct a non-linear super plane to classify two kinds of samples in the space. The kernel function is a mapping from the low-dimension input space to a hign-dimension feature space. Here I use an analogy to explain how the kernel trick works. We all know from the plane geometry that two lines must intersect if they are not parallel. That is true in a 2-D space. However, if we elevate one of the line to another plane parallel to each other, the two lines are separated and do not intersect. That is the basic idea underlying the kernel method. Original samples that are non-linear separable in the input space can become linear separable in high dimension space. But how to find an appropriate kernel function is a topic out of the scope of this post.

Features of SVM

SVM is one of the most popular classification algo since its convenience and efficiency. Particularly, it has three distinctive features that are key to its popularity. Here we list them as follows,

• Feature selection and over-fitting avoidance : Since SVM can be solved as an optimization problem, so the features are automatically decided by the solution, where the value of $\alpha$ shows the weight of each feature in the model. The explanation on the over-fitting avoidance is needed. Please share with me your ideas.
• Robust to far-away outliners: since the optimal superplane is only decided by points near the plane, so the outliners that far way from the plane will not affect the result, and make the solution more robust.

REFERENCE

1. Burges, Christopher JC. “A Tutorial on Support Vector Machines for Pattern Recognition.” Data mining and knowledge discovery 2, no. 2 (1998): 121-67.

2. Platt, John. “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines.”  (1998).

3. Joachims, Thorsten. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Springer, 1998.

4. Cortes, Corinna, and Vladimir Vapnik. “Support-Vector Networks.” Machine learning 20, no. 3 (1995): 273-97.

5. Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. “A Training Algorithm for Optimal Margin Classifiers.” In Proceedings of the fifth annual workshop on Computational learning theory, 144-52. Pittsburgh, Pennsylvania, USA: ACM, 1992.