A Primer
There are a multitude of ways to perform classification. One of the foundational techniques for this is the support vector machine (SVM). SVMs are extensions of separating hyperplanes to classes that cannot be perfectly linearly separated.
For any $\beta \in \mathbb{R}^p$, we can represent all hyperplanes in $\mathbb{R}^p$ as the set of vectors satisfying:
\[\{\mathbf{x} \mid \mathbf{x}^\top \beta = 0 \}\]We can then consider the hyperplane $L$ defined by $\beta_L = (\beta_0, \beta^\top)^\top$:
\[L = \{ \mathbf{x} \mid \mathbf{x}^\top \beta + \beta_0 = 0 \}\]Note that, by definition, $\mathbf{x}^\top \beta = -\beta_0$ for any $\mathbf{x} \in L$. Furthermore, for any $\mathbf{x}_1, \mathbf{x}_2 \in L$, we have that:
\[(\mathbf{x}_1 - \mathbf{x}_2)^\top \beta = \mathbf{x}_1^\top \beta - \mathbf{x}_2^\top \beta = - \beta_0 + \beta_0 = 0\]This implies that $\beta$ is orthogonal to the surface of $L$, and $\beta^* = \frac{\beta}{\rvert \rvert \beta \rvert \rvert_2}$ is the vector normal to $L$.
Let $\mathbf{x}$ be an arbitrary point in $L$ and consider, now, the (signed) distance between $\mathbf{x}_0 \in \mathbb{R}^{p}$ and $L$:
\[\begin{aligned} \frac{\beta^* \cdot (-(\mathbf{x} - \mathbf{x}_0))}{\rvert \rvert \beta^* \rvert \rvert_2} &= \frac{1}{\rvert \rvert \frac{\beta}{\rvert \rvert \beta \rvert \rvert_2} \rvert \rvert_2} (\beta^*)^\top (\mathbf{x}_0 - \mathbf{x}) \\ &= (\beta^*)^\top \mathbf{x}_0 - (\beta^*)^\top \mathbf{x} \\ &= \frac{1}{\rvert \rvert \beta \rvert \rvert_2} \left[ \beta^\top (\mathbf{x}_0 - \mathbf{x}) \right]\\ &= \frac{1}{\rvert \rvert \beta \rvert \rvert_2} \left[ \beta^\top \mathbf{x}_0 - (- \beta_0) \right]\\ &= \frac{1}{\rvert \rvert \beta \rvert \rvert_2} \left[ \beta^\top \mathbf{x}_0 + \beta_0 \right] \end{aligned}\]This shows that the function $f(\mathbf{x}) = \beta^\top \mathbf{x} + \beta_0 = 0$ that defines $L$ above is proportional to the signed distance between $\mathbf{x}$ and $L$.
Now, we’ll discuss the support vector classifier, which is a classifier that constructs the optimal separating hyperplane between two perfectly linearly separable classes. By optimal, we mean that the distance between the closest point to the hyperplane and the hyperplane is maximized.
Let’s say we have a training set of $N$ observations with class labels $y_i \in \{ -1, 1 \}$ and $p$-dimensional predictors $\mathbf{x}_i \in \mathbb{R}^{p}$. Let $\mathbf{X}$ be the $n \times p$ matrix with rows equal to the $\mathbf{x}_i$ vectors.
Suppose we have the hyperplane $L$ again, and we classify points with positive distance to $L$ to one class and those with negative distance to $L$ to another. We can represent this binary classifier as:
\[g(\mathbf{x}) = \text{sign}\left(\mathbf{x}^\top \beta + \beta_0 \right) = \begin{cases} 1 & \text{ if } \mathbf{x}^\top \beta + \beta_0 \geq 0 \\ -1 & \text{ else } \end{cases}\]To find the optimal hyperplane, we want to maximize the distance between the points and $L$. Notice that if $y_i$ is correctly classified, then:
\[y_i(\mathbf{x}_i^\top \beta + \beta_0) \geq 0\]Thus, optimization consists of solving:
\[\begin{equation} \label{eq:first-opt} \begin{aligned} &\underset{\beta, \beta_0, \rvert \rvert \beta \rvert \rvert_2 = 1}{\max} \left\{ M \right \} \\ &\text{ subject to } y_i(\mathbf{x}_i^\top \beta + \beta_0) \geq M \text{ } \forall i = 1, \dots, N \end{aligned} \end{equation}\]Via some rescaling and manipulation, this is equivalent to:
\[\begin{aligned} &\underset{\beta, \beta_0}{\min} \left\{ \frac{1}{2}\rvert \rvert \beta \rvert \rvert_2^2 \right \} \\ &\text{ subject to } 0 \geq 1 - y_i(\mathbf{x}_i^\top \beta + \beta_0) \text{ } \forall i = 1, \dots, N \\ \end{aligned}\]Note that is a convex optimization problem (quadratic objective with linear inequality constraints), so we can use duality to rewrite this as:
\[\begin{equation} \label{eq:lagrange-dual} \begin{aligned} &\underset{\lambda}{\max} \left\{ \underset{\beta, \beta_0}{\inf} \left\{ \frac{1}{2} \rvert \rvert\beta \rvert \rvert_2^2 + \sum_{i = 1}^N \lambda_i \left(1 - y_i (\mathbf{x}_i^\top \beta + \beta_0) \right) \right\} \right\} \\ &\text{ subject to } \lambda_i \geq 0 \text{ } \forall i = 1, \dots, N \end{aligned} \end{equation}\]Assuming the objective and all of the constraints are convex and continuously differentiable functions, we can recast the problem in terms of the gradients. We have:
\[\begin{aligned} \frac{\partial }{\partial \beta} \left[ \frac{1}{2} \rvert \rvert\beta \rvert \rvert_2^2 + \sum_{i = 1}^N \lambda_i \left(1 - y_i (\mathbf{x}_i^\top \beta + \beta_0) \right) \right] &= \beta - \sum_{i= 1}^N \lambda_i y_i \mathbf{x}_i \\ \implies \beta &= \sum_{i = 1}^N \lambda_i y_i \mathbf{x}_i \\ \frac{\partial}{\partial \beta_0} \left[ \frac{1}{2} \rvert \rvert\beta \rvert \rvert_2^2 + \sum_{i = 1}^N \lambda_i \left(1 - y_i (\mathbf{x}_i^\top \beta + \beta_0) \right) \right] &= - \sum_{i = 1}^N \lambda_i y_i \\ \implies 0 &= \sum_{i = 1}^N \lambda_i y_i \end{aligned}\]We substitute these into the objective function in Eq. \eqref{eq:lagrange-dual} to get:
\[\begin{equation} \label{eq:wolfe-dual} \begin{aligned} &\underset{\lambda}{\max} \left\{ \sum_{i = 1}^N \lambda_i - \frac{1}{2} \sum_{i = 1}^N \sum_{j = 1}^N \lambda_i \lambda_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j \right\} \\ &\text{ subject to } \sum_{i = 1}^N \lambda_i y_i = 0 \text{ and } \lambda_i \geq 0 \text{ } \forall i = 1, \dots, N \end{aligned} \end{equation}\]This is a quadratic optimization problem with a linear constraint and a cone (the non-negative orthant) constraint, and it can be solved with standard packages (e.g. cvxopt).
In order to be optimal, the solution must satisfy the Karush-Kuhn-Tucker conditions. That is, they must satisfy:
The stationarity condition states that:
\[\beta = \sum_{i = 1}^N \lambda_i y_i \mathbf{x}_i\]which shows that the optimal $\beta$ must be a linear combination of the $\mathbf{x}_i$ vectors. The complementary slackness condition implies that if $\lambda_i > 0$, then $y_i(\mathbf{x}_i^\top \beta + \beta_0) = 1$. On the other hand, if $\lambda_i = 0$, then the $\mathbf{x}_i$ does not contribute to $\beta$. Thus, the optimal $\beta$ is a linear combination of some of the $\mathbf{x}_i$ vectors, which are called support vectors.
Thus, the optimal separating hyperplane (i.e. the support vector classifier) is the one represented by:
\[\begin{aligned} G(\mathbf{x}) &= \text{sign}\left(\mathbf{x}^\top \hat{\beta} + \hat{\beta}_0 \right) \\ &= \text{sign}\left( \mathbf{x}^\top \sum_{i = 1}^N \lambda_i y_i \mathbf{x}_i + \hat{\beta}_0 \right) \\ &= \text{sign}\left( \sum_{i = 1}^N \lambda_i y_i \mathbf{x}^\top \mathbf{x}_i + \hat{\beta}_0 \right) \\ \end{aligned}\]where $\hat{\beta}$ and $\hat{\beta}_0$ are the solutions to the dual problem in Eq. \eqref{eq:wolfe-dual}.
The support vector machine extends the support vector classifier to classes that are not linearly separable (in the feature space). In this case, we must adjust the optimization problem because there is no longer a separating hyperplane (let alone an optimal one).
To get around this fact, we introduce slack variables $\xi = (\xi_1, xi_2, \dots, \xi_N)^\top$. We adjust the constraint to include these slack variables:
\[\begin{equation} \label{eq:svm-1} y_i(\mathbf{x}_i^\top \beta + \beta_0) \geq M(1 - \xi_i) \end{equation}\]where $\xi_i \geq 0$ for all $i = 1, \dots, N$ and $\sum_{i = 1}^N \xi_i \leq c$ for some constant $c$. By incorporating $\xi$ in this way, we permit each observation to cross the appropriate margin by an amount proportional to $\xi_i$ (i.e. only by $M \xi_i$). The constraint that the sum of slack variables does not exceed $c$ forces the total amount of “crossing” to only be so large. Since misclassifications occur if $\xi_i > 1$, the number of misclassifications cannot exceed $c$.
We can write the optimization problem as:
\[\begin{equation} \label{eq:svm-2} \begin{aligned} &\underbrace{\lambda, \mu}{\max} \left\{ \sum_{i = 1}^N \lambda_i -\frac{1}{2} \sum_{i = 1}^N \sum_{j = 1}^N \lambda_i \lambda_j y_i y_j \mathbf{x}_i \mathbf{x}_j \right\} \\ &\text{ subject to } \sum_{i = 1}^N \lambda_i y_i = 0 \text{ and } 0 \leq \lambda_i \leq C \text{ } \forall i = 1, \dots, N \end{aligned} \end{equation}\]The KKT conditions are:
By a similar argument to the SVC case, the optimal $\beta$ is given by:
\[\hat{\beta} = \sum_{i = 1}^N \lambda_i y_i \mathbf{x}_i\]By the complementary slackness requirement, $\lambda_i > 0$ only for those points $\mathbf{x}_i$ that satisfy:
\[y_i(\mathbf{x}_i^\top \beta + \beta_0) = 1 - \xi_i\]These are called support vectors, and those with $\xi_i = 0$ will lie exactly on the margin. If $\lambda_i = C$, then if $\xi_i \leq 1$, the point will lie within the margin but be classified correctly. Otherwise, it is incorrectly classified.
The support vector machine is the classifier:
\[\begin{aligned} G(\mathbf{x}) &= \text{sign} \left( \mathbf{x}^\top \hat{\beta} + \hat{\beta}_0 \right) \\ &= \text{sign} \left( \mathbf{x}^\top \sum_{i = 1}^N \lambda_i y_i\mathbf{x}_i + \hat{\beta}_0 \right) \\ &= \text{sign} \left( \sum_{i = 1}^N \lambda_i y_i \mathbf{x}^\top \mathbf{x}_i + \hat{\beta}_0 \right) \\ \end{aligned}\]where $C$ is left as a tuning parameter, and $\hat{\beta}$ and $\hat{\beta}_0$ are the solutions to Eq. \eqref{eq:svm-2}.
The SVM can be extended beyond the inner product with respect to the Euclidean norm via the kernel trick. The classifier then becomes:
\[G(\mathbf{x}) = \text{sign} \left( \sum_{i = 1}^N \lambda_i y_i \mathcal{K}(h(\mathbf{x}), h(\mathbf{x}_i)) + \beta_0 \right)\]for some choice of symmetric positive semi-definite kernel function $\mathcal{K}(\cdot, \cdot)$.