LD =
|
|
|
|
|
|
|
|
|
|
(10.50)
|
|
λi −
|
|
|
λiλj yiyj Xi · Xj .
|
|
|
2
|
|
|
i=1
|
|
|
|
i=1 j=1
|
|
|
|
The dual problem maximizes LD subject to the constraints λi ≥ 0 and
|
n
|
λiyi = 0.
|
|
i=1
|
|
Note that LD is expressed only in terms of λi, the class labels, and the pairwise dot products Xi · Xj between training data points. Therefore, solving for the Lagrangian multipliers requires knowledge of only the class variables and dot products between training instances but it does not require direct knowledge of the feature values Xi. The dot products between training data points can be viewed as a kind of similar-ity between the points, which can easily be defined for data types beyond numeric domains. This observation is important for generalizing linear SVMs to nonlinear decision boundaries and arbitrary data types with the kernel trick.
The value of b can be derived from the constraints in the original SVM formulation, for which the Lagrangian multipliers λr are strictly positive. For these training points, the margin constraint yr(W · Xr + b) = +1 is satisfied exactly according to the Kuhn– Tucker conditions. The value of b can be derived from any such training point (Xr, yr) as follows:
|
|
·
|
|
+ b = +1
|
∀r : λr > 0
|
(10.51)
|
|
yr
|
W
|
Xr
|
|
|
n
|
|
|
|
|
( λiyi
|
|
·
|
|
) + b = +1
|
∀r : λr > 0.
|
(10.52)
|
|
yr
|
Xi
|
Xr
|
|
i=1
The second relationship is derived by substituting the expression for W in terms of the Lagrangian multipliers according to Eq. 10.49. Note that this relationship is expressed only in terms of Lagrangian multipliers, class labels, and dot products between training instances. The value of b can be solved from this equation. To reduce numerical error, the value of b may be averaged over all the support vectors with λr > 0.
For a test instance Z, its class label F (Z) is defined by the decision boundary obtained by substituting for W in terms of the Lagrangian multipliers (Eq. 10.49):
|
|
|
|
|
|
n
|
|
|
F (
|
|
) = sign{
|
|
·
|
|
+ b} = sign{( λiyi
|
|
·
|
|
) + b}.
|
(10.53)
|
|
Z
|
W
|
Z
|
Xi
|
Z
|
|
i=1
318 CHAPTER 10. DATA CLASSIFICATION
It is interesting to note that F (Z) can be fully expressed in terms of the dot product between training instances and test instances, class labels, Lagrangian multipliers, and bias b. Because the Lagrangian multipliers λi and b can also be expressed in terms of the dot products between training instances, it follows that the classification can be fully performed using knowledge of only the dot product between different instances (training and test), without knowing the exact feature values of either the training or the test instances.
The observations about dot products are crucial in generalizing SVM methods to nonlinear decision boundaries and arbitrary data types with the use of a technique known as the kernel trick. This technique simply substitutes dot products with kernel similarities (cf. Sect. 10.6.4).
It is noteworthy from the derivation of W (see Eq. 10.49) and the aforementioned deriva-tion of b, that only training data points that are support vectors (with λr > 0) are used to define the solution W and b in SVM optimization. As discussed in Chap. 11, this observation is leveraged by scalable SVM classifiers, such as SVMLight. Such classifiers shrink the size of the problem by discarding irrelevant training data points that are easily identified to be far away from the separating hyperplanes.
10.6.1.1 Solving the Lagrangian Dual
The Lagrangian dual LD may be optimized by using the gradient ascent technique in terms of the n-dimensional parameter vector λ.
∂LD
|
|
n
|
|
|
= 1 − yi
|
|
|
|
|
(10.54)
|
|
yj λj Xi · Xj
|
|
∂λi
|
|
|
|
j=1
|
|
|
Therefore, as in logistic regression, the corresponding gradient-based update equation is as follows:
(λ1 . . . λn) ← (λ1 . . . λn) + α
|
∂LD
|
. . .
|
∂LD
|
.
|
(10.55)
|
|
∂λ1
|
|
∂λn
|
|
The step size α may be chosen to maximize the improvement in objective function. The initial solution can be chosen to be the vector of zeros, which is also a feasible solution for λ.
One problem with this update is that the constraints λi ≥ 0 and
|
n
|
|
|
|
|
|
|
|
i=1 λiyi = 0 may be
|
|
violated after an update. Therefore, the gradient vector is projected along the hyperplane
|
|
n
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
i=1 λiyi = 0 before the update to create a modified gradient vector. Note that the projec-
|
|
tion of the gradient ∇LD
|
along the normal to this hyperplane is simply
|
|
= (
|
|
· ∇
|
|
|
)
|
|
,
|
|
H
|
L
|
|
|
y
|
D
|
y
|
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where
|
|
is the unit vector
|
|
√
|
|
|
(y1 . . . yn). This component is subtracted from ∇LD to create
|
|
y
|
|
|
n
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
projection, updating along the
|
|
a modified gradient vector G = ∇LD − H. Because of the
|
|
n
|
|
|
|
|
|
|
|
|
|
|
|
modified gradient vector
|
|
will not violate the constraint
|
i=1 λiyi = 0. In addition, any
|
|
G
|
|
negative values of λi after an update are reset to 0.
|
|
|
|
|
|
|
|
|
|
|
|
|
Note that the constraint
|
|
n
|
|
|
|
|
|
|
|
with
|
|
|
i=1 λiyi = 0 is derived by setting the gradient of LP
|
|
respect to b to 0. In some alternative formulations of SVMs, the bias vector b can be included within W by adding a synthetic dimension to the data with a constant value of 1. In such cases, the gradient vector update is simplified to Eq. 10.55 because one no longer needs to
worry about the constraint
|
|
Dostları ilə paylaş: |