+
+
+
The primal SVM problem is:
+
+\[\min_{\boldsymbol w,b,\boldsymbol\xi} \frac{1}{2}\boldsymbol w^\top\boldsymbol w+C\sum_{i=1}^m\xi_i;\quad\text{s.t. }\ y_i f(\boldsymbol x_i) \ge 1-\xi_i,\ \xi_i \ge 0 \,,\tag{1}\]
+
+
where the decision function $f(\boldsymbol x) \equiv \boldsymbol w^\top\phi(\boldsymbol x) + b$.
+The Lagrangian is:
+
+\[L(\boldsymbol w,b,\boldsymbol\xi,\boldsymbol\alpha,\boldsymbol\mu) = \frac{1}{2}\boldsymbol w^\top\boldsymbol w + C\sum_{i=1}^m\xi_i + \sum_{i=1}^m\alpha_i\big(1-\xi_i-y_i f(\boldsymbol x_i)\big) - \sum_{i=1}^m\mu_i\xi_i\,,\]
+
+
where $\alpha_i \ge 0$, $\mu_i \ge 0$.
+The Karush-Kuhn-Tucker (KKT) conditions are:
+
+\[\begin{cases}
+\boldsymbol w=\sum_{i=1}^m\alpha_i y_i \phi(\boldsymbol x_i) &\text{(stationarity)}\\
+0=\sum_{i=1}^m\alpha_i y_i &\text{(stationarity)}\\
+C=\alpha_i+\mu_i &\text{(stationarity)}\\
+0=\alpha_i(y_i f(\boldsymbol x_i)-1+\xi_i) &\text{(complementary)}\\
+0=\mu_i\xi_i &\text{(complementary)}\\
+y_i f(\boldsymbol x_i)-1+\xi_i \ge 0 &\text{(primal feasibility)}\\
+\xi_i \ge 0 &\text{(primal feasibility)}\\
+\alpha_i \ge 0 &\text{(dual feasibility)}\\
+\mu_i \ge 0 &\text{(dual feasibility)}\\
+\end{cases}\,.\]
+
+
Thus, we have
+
+\[\begin{cases}
+y_i f(\boldsymbol x_i) \ge 1 &(\alpha_i=0)\\
+y_i f(\boldsymbol x_i) \le 1 &(\alpha_i=C)\\
+y_i f(\boldsymbol x_i) = 1 &(\text{otherwise})\\
+\end{cases}\,.\tag{2}\]
+
+
When $S=\{j \mid 0 < \alpha_j < C\} \neq \varnothing$, for each such $j$,
+
+\[\begin{aligned}
+y_j (\boldsymbol w^\top\phi(\boldsymbol x_j)+b) &= 1\\
+b &= y_j - \boldsymbol w^\top\phi(\boldsymbol x_j)\,;\\
+\end{aligned}\]
+
+
The second equality holds since $y_j = \pm 1$.
+For numerical stability, we take the mean of all $b$’s as the final value of the intercept:
+
+\[b = \frac{1}{\\|S\\|}\sum_{j \in S} (y_j-\boldsymbol w^\top\phi(\boldsymbol x_j))\,.\]
+
+
When $S=\varnothing$, taking the first two cases of Equation $(2)$, it follows that
+
+\[\begin{cases}
+f(\boldsymbol x_i) \ge 1 &(\alpha_i=0,y_i=1)\\
+f(\boldsymbol x_i) \le -1 &(\alpha_i=0,y_i=-1)\\
+f(\boldsymbol x_i) \le 1 &(\alpha_i=C,y_i=1)\\
+f(\boldsymbol x_i) \ge -1 &(\alpha_i=C,y_i=-1)\\
+\end{cases}\,.\]
+
+
Equivalently, we have
+
+\[\max_{j \in T_1}\{y_j - \boldsymbol w^\top\phi(\boldsymbol x_j)\} \le b \le \min_{j \in T_2}\{y_j - \boldsymbol w^\top\phi(\boldsymbol x_j)\}\,,\]
+
+
where
+
+\[\begin{cases}
+T_1 = \{j \mid \alpha_j=0,y_j=1\text{ or }\alpha_j=C,y_j=-1\}\\
+T_2 = \{j \mid \alpha_j=0,y_j=-1\text{ or }\alpha_j=C,y_j=1\}\\
+\end{cases}\,,\]
+
+
The intercept is taken as the mean of the lower and upper bounds.
+
+
To compute $\boldsymbol w^\top\phi(\boldsymbol x)$ in above equations, simply plug in $\boldsymbol w=\sum_{i=1}^m\alpha_i y_i \phi(\boldsymbol x_i)$ and compute the $\phi(\boldsymbol x_i)^\top\phi(\boldsymbol x)$ with the underlying kernel function $\kappa(\boldsymbol x_i,\boldsymbol x)$.
+
+
+
+
+ Reference: Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
+
+
+
The dual SVM problem is:
+
+\[\min_{\boldsymbol\alpha}\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_j y_i y_j \phi(\boldsymbol x_i)^\top\phi(\boldsymbol x_j)-\sum_{i=1}^m\alpha_i\,;\quad\text{s.t. }\sum_{i=1}^m\alpha_i y_i=0,\ 0 \le \alpha_i \le C\,.\tag{3}\]
+
+
The Lagrangian is:
+
+\[\hat L(\boldsymbol\alpha,\beta,\boldsymbol\lambda,\boldsymbol\nu) = \frac{1}{2}\boldsymbol\alpha^\top\mathbf Q\boldsymbol\alpha - \boldsymbol\alpha^\top\mathbf 1+\beta\boldsymbol\alpha^\top\boldsymbol y-\boldsymbol\alpha^\top\boldsymbol\lambda+(\boldsymbol\alpha-C\mathbf 1)^\top\boldsymbol\nu\,,\tag{4}\]
+
+
where $\lambda_i \ge 0$, $\nu_i \ge 0$, $\mathbf 1$ is an all-$1$ vector, $\mathbf Q$ is an $m \times m$ matrix such that $Q_{ij} = y_i\phi(\boldsymbol x_i)^\top\phi(\boldsymbol x_j)y_j$, and $\beta$ is actually the intercept.
+We’ll assume it for now, and reveal why it is in the end.
+The KKT conditions are:
+
+\[\begin{cases}
+\mathbf Q\boldsymbol\alpha=1-\beta\boldsymbol y+\boldsymbol\lambda-\boldsymbol\nu &\text{(stationarity)}\\
+\lambda_i\alpha_i = 0 &\text{(complementary)}\\
+\nu_i(C-\alpha_i) = 0 &\text{(complementary)}\\
+\boldsymbol\alpha^\top\boldsymbol y = 0 &\text{(primal feasibility)}\\
+0 \le \alpha_i \le C &\text{(primal feasibility)}\\
+\lambda_i \ge 0 &\text{(dual feasibility)}\\
+\nu_i \ge 0 &\text{(dual feasibility)}\\
+\end{cases}\,.\tag{5}\]
+
+
Thus, we have
+
+\[\begin{cases}
+(\mathbf Q\boldsymbol\alpha)_i \ge 1 - \beta y_i &(\alpha_i=0)\\
+(\mathbf Q\boldsymbol\alpha)_i \le 1 - \beta y_i &(\alpha_i=C)\\
+(\mathbf Q\boldsymbol\alpha)_i = 1 - \beta y_i &(\text{otherwise})\\
+\end{cases}\,.\tag{6}\]
+
+
where $(\mathbf Q\boldsymbol\alpha)_i$ is the $i$th element of the vector $\mathbf Q\boldsymbol\alpha$.
+When $S=\{j \mid 0 < \alpha_j < C\} \neq \varnothing$, for each such $j$,
+
+\[\beta = y_j(1 - (\mathbf Q\boldsymbol\alpha)_j)\,;\]
+
+
which holds since $y_j = \pm 1$.
+For numerical stability, we take the mean of all $\beta$’s as the final value of the intercept:
+
+\[\beta = \frac{1}{\\|S\\|}\sum_{j \in S} y_j (1 - (\mathbf Q\boldsymbol\alpha)_j)\,.\]
+
+
When $S=\varnothing$, taking the first two cases of Equation $(6)$, it follows that
+
+\[\begin{cases}
+\beta \ge 1-(\mathbf Q\boldsymbol\alpha)_i &(\alpha_i=0,y_i=1)\\
+\beta \le -(1-(\mathbf Q\boldsymbol\alpha)_i) &(\alpha_i=0,y_i=-1)\\
+\beta \le 1-(\mathbf Q\boldsymbol\alpha)_i &(\alpha_i=C,y_i=1)\\
+\beta \ge -(1-(\mathbf Q\boldsymbol\alpha)_i) &(\alpha_i=C,y_i=-1)\\
+\end{cases}\,.\]
+
+
Equivalently, we have
+
+\[\max_{j \in T_1}\{y_j(1-(\mathbf Q\boldsymbol\alpha)_j)\} \le \beta \le \min_{j \in T_2}\{y_j(1-(\mathbf Q\boldsymbol\alpha)_j)\}\,,\]
+
+
where
+
+\[\begin{cases}
+T_1 = \{j \mid \alpha_j=0,y_j=1\text{ or }\alpha_j=C,y_j=-1\}\\
+T_2 = \{j \mid \alpha_j=0,y_j=-1\text{ or }\alpha_j=C,y_j=1\}\\
+\end{cases}\,,\]
+
+
The intercept is taken as the mean of the lower and upper bounds.
+
+
$\beta$ is the intercept
+
+
To show that $\beta$ is in fact the intercept in primal problem, we go further from Equation $(4)$, plugging in the stationarity conditions of Equation $(5)$, and it follows that
+
+\[\hat L(\boldsymbol\alpha,\beta,\boldsymbol\lambda,\boldsymbol\nu) = -\frac{1}{2}\boldsymbol\alpha^\top\mathbf Q\boldsymbol\alpha-C\mathbf 1^\top\boldsymbol\nu\,,\]
+
+
where
+
+\[\boldsymbol\alpha=\mathbf Q^{-1}(1-\beta\boldsymbol y+\boldsymbol\lambda-\boldsymbol\nu)\,.\]
+
+
assuming the inverse of $\mathbf Q$ exists.
+Due to the structure of $\mathbf Q$, there exists a unique matrix $\mathbf Q^\frac{1}{2}$:
+
+\[Q^\frac{1}{2} =
+\begin{pmatrix}
+y_1\phi(\boldsymbol x_1) & \dots & y_m\phi(\boldsymbol x_m)\\
+\end{pmatrix}\]
+
+
such that $\mathbf Q=(\mathbf Q^\frac{1}{2})^\top\mathbf Q^\frac{1}{2}$.
+Let $\boldsymbol w \triangleq \mathbf Q^{-\frac{1}{2}}(1-\beta\boldsymbol y+\boldsymbol\lambda-\boldsymbol\nu)=\mathbf Q^\frac{1}{2}\boldsymbol\alpha$.
+The stationarity condition of Equation $(5)$ can be rewritten as:
+
+\[\begin{aligned}
+\mathbf Q\boldsymbol\alpha &= 1-\beta\boldsymbol y+\boldsymbol\lambda-\boldsymbol\nu\\
+(\mathbf Q^\frac{1}{2})^\top\boldsymbol w &= 1-\beta\boldsymbol y+\boldsymbol\lambda-\boldsymbol\nu\\
+y_i\phi(\boldsymbol x_i)^\top\boldsymbol w+\beta y_i &\ge 1-\nu_i\quad\forall 1 \le i \le m\\
+y_i(\phi(\boldsymbol x_i)^\top\boldsymbol w+\beta) &\ge 1-\nu_i\\
+\end{aligned}\]
+
+
Therefore, we have the dual of the dual problem as:
+
+\[\max_{\boldsymbol w,\beta,\boldsymbol\nu}-\frac{1}{2}\boldsymbol w^\top\boldsymbol w-C\mathbf 1^\top\boldsymbol\nu\,;\quad\text{s.t. }y_i(\phi(\boldsymbol x_i)^\top\boldsymbol w+\beta) \ge 1-\nu_i,\ \nu_i \ge 0\,.\]
+
+
Clearly, $\beta$ is the intercept, and $\nu_i$ is the slack variable $\xi_i$ bounded to each sample in the dataset.
+
+
Show that the two apporaches are equivalent
+
+
Recall that in primal and dual formulations,
+
+\[\begin{aligned}
+b &= y_j - \sum_{i=1}^m\alpha_i y_i \phi(\boldsymbol x_i)^\top\phi(\boldsymbol x_j) &\text{(primal formulation)}\\
+b &= y_j (1-(\mathbf Q\boldsymbol\alpha)_j) &\text{(dual formulation)}\\
+\end{aligned}\]
+
+
If we plug in the definitions of $\boldsymbol w$ and $\mathbf Q$, it follows that
+
+\[\begin{aligned}
+b &= y_j - \sum_{i=1}^m \alpha_i y_i \phi(\boldsymbol x_i)^\top\phi(\boldsymbol x_j)\\
+b &= y_j (1 - y_j\sum_{i=1}^m \alpha_i y_i \phi(\boldsymbol x_i)^\top\phi(\boldsymbol x_j))\\
+\end{aligned}\]
+
+
But $y_j^2=1$.
+Therefore, it can be easily shown that the two equations are the same.
+
+
Verify the conclusion by experiment
+
+
We will need numpy
and scikit-learn
to perform the experiment.
+
+
Get to know SVC
class in scikit-learn
here.
+In summary, given a classifier clf = SVC(...).fit(X, y)
,
+
+
+ -
+
clf.dual_coef_
holds the product $y_i \alpha_i$ for each $\alpha_i > 0$;
+ -
+
clf.support_vector_
holds the support vectors of shape (n_SV, n_feature)
where n_SV
is the number of support vectors;
+ -
+
clf.intercept_
holds the intercept term.
+
+
+
In addition,
+
+
+ -
+
clf.coef_
holds the $\boldsymbol w$ in primal problem. We will use it for convenience below (linear kernel).
+
+
+
Codes:
+
+
import numpy as np
+from sklearn.svm import SVC
+from sklearn.datasets import load_iris
+
+
+X, y = load_iris(return_X_y=True)
+# Restrict the classification problem to two-class;
+# otherwise, the problem will become unnecessarily complex.
+i = (y == 0) | (y == 2)
+X, y = X[i], y[i]
+# Make y take values {0, 1} rather than {0, 2}.
+y //= 2
+
+clf = SVC(kernel='linear', random_state=123)
+clf.fit(X, y)
+# The y for support vectors.
+# The `*2-1` operation is used to make it pick the values {1, -1}
+# rather than {1, 0}.
+y_supp = y[clf.support_] * 2 - 1
+# The filter that removes upper bounded alpha's.
+S = np.ravel(np.abs(clf.dual_coef_)) < 1
+
+# Verify that the `clf.coef_` is indeed computed from `clf.dual_coef_`.
+# We'll use `clf.coef_` for convenience below.
+assert np.allclose(
+ np.ravel(clf.coef_),
+ np.sum(np.ravel(clf.dual_coef_) * clf.support_vectors_.T, axis=1))
+# The intercept estimations in primal formulation. Only support vectors are
+# required, since otherwise the dual coefficients will be zero and won't count
+# any.
+b_estimates_primal = y_supp[S] - np.dot(clf.support_vectors_[S], np.ravel(clf.coef_))
+### Verify that the mean of the estimations is indeed the intercept. ###
+assert np.allclose(np.mean(b_estimates_primal), clf.intercept_)
+
+# The kernel matrix.
+K = np.dot(clf.support_vectors_, clf.support_vectors_.T)
+# The Q matrix times alpha. Notice that when computing Q, only support vectors
+# are required for the same reason as above.
+Q_alpha = np.sum(np.ravel(clf.dual_coef_)[:, np.newaxis] * K, axis=0) * y_supp
+# The intercept estimations in dual formulation.
+b_estimates_dual = y_supp[S] * (1 - Q_alpha[S])
+### Verify that the mean of the estimations is indeed the intercept. ###
+assert np.allclose(clf.intercept_, np.mean(b_estimates_dual))
+
+
+
The following has been mentioned in the comment above, but I feel it necessary to redeclare them formally here:
+Recall that $\boldsymbol w = \sum_{i=1}^m\alpha_i y_i \phi(\boldsymbol x_i)$, and all $m$ $\alpha$’s are involved when computing $\mathbf Q\boldsymbol\alpha$.
+In fact, only those $i$ such that $\alpha_i > 0$ (corresponding to the support vectors) are necessary.
+That’s why we are able to find $\boldsymbol w$ and $\mathbf Q\boldsymbol\alpha$ even if scikit-learn
stores only data related to support vectors.
+
+
Caveat:
+I find it quite hard to construct an example where there’s no free $\alpha$’s (i.e. those $\alpha_i$ such that $0 < \alpha_i < C$) at all.
+So strictly speaking, such edge case is not verified empirically in this post.
+
+
+