TEST INSTANCE
|
HYPERPLANE 2 SUPPORT VECTOR
|
|
.
|
|
|
|
.. ..
|
|
|
|
... .
|
|
|
. CLASS A
|
.
|
CLASS B
|
|
.
|
|
|
.. .
|
|
|
|
... .
HYPERPLANE 1
CHAPTER 10. DATA CLASSIFICATION
... .. ..
... .
. .
MARGIN VIOLATION WITH PENALTY BASED SLACK VARIABLES
(a) Hard separation (b) Soft separation
Figure 10.7: Hard and soft SVMs
Consider a hyperplane that cleanly separates two linearly separable classes. The margin of the hyperplane is defined as the sum of its distances to the closest training points belong-ing to each of the two classes on the opposite side of the hyperplane. A further assumption is that the distance of the separating hyperplane to its closest training point of either class is the same. With respect to the separating hyperplane, it is possible to construct parallel hyperplanes that touch the training data of opposite classes on either side, and have no data point between them. The training data points on these hyperplanes are referred to as the support vectors, and the distance between the two hyperplanes is the margin. The separating hyperplane, or decision boundary, is precisely in the middle of these two hyper-planes in order to achieve the most accurate classification. The margins for hyperplane 1 and hyperplane 2 are illustrated in Fig. 10.7a by dashed lines. It is evident that the margin for hyperplane 1 is larger than that for hyperplane 2. Therefore, the former hyperplane provides better generalization power for unseen test instances in the “difficult” uncertain region separating the two classes where classification errors are most likely. This is also con-sistent with our earlier example-based observation about the more accurate classification with hyperplane 1.
How do we determine the maximum margin hyperplane? The way to do this is to set up a nonlinear programming optimization formulation that maximizes the margin by expressing it as a function of the coefficients of the separating hyperplane. The optimal coefficients can be determined by solving this optimization problem. Let the n data points in the training set
be denoted by (X1, y1) . . . (Xn, yn), where Xi is a d-dimensional row vector corresponding to the ith data point, and yi ∈ {−1, +1} is the binary class variable of the ith data point. Then, the separating hyperplane is of the following form:
Here, W = (w1 . . . wd) is the d-dimensional row vector representing the normal direction to the hyperplane, and b is a scalar, also known as the bias. The vector W regulates the orientation of the hyperplane and the bias b regulates the distance of the hyperplane from the origin. The (d + 1) coefficients corresponding to W and b need to be learned from the training data to maximize the margin of separation between the two classes. Because it
10.6. SUPPORT VECTOR MACHINES
|
315
|
is assumed that the classes are linearly separable, such a hyperplane can also be assumed to exist. All data points Xi with yi = +1 will lie on one side of the hyperplane satisfying W · Xi + b ≥ 0. Similarly, all points with yi = −1 will lie on the other side of the hyperplane satisfying W · Xi + b ≤ 0.
|
·
|
|
+ b ≥ 0
|
∀i : yi = +1
|
(10.36)
|
|
W
|
Xi
|
|
|
·
|
|
+ b ≤ 0
|
∀i : yi = −1
|
(10.37)
|
|
W
|
Xi
|
|
These constraints do not yet incorporate the margin requirements on the data points. A stronger set of constraints are defined using these margin requirements. It may be assumed that the separating hyperplane W · X + b = 0 is located in the center of the two margin-defining hyperplanes. Therefore, the two symmetric hyperplanes touching the support vec-tors can be expressed by introducing another parameter c that regulates the distance between them.
|
·
|
|
+ b = +c
|
(10.38)
|
|
W
|
X
|
|
|
·
|
|
+ b = −c
|
(10.39)
|
|
W
|
X
|
|
It is possible to assume, without loss of generality, that the variables W and b are appropri-ately scaled, so that the value of c can be set to 1. Therefore, the two separating hyperplanes can be expressed in the following form:
|
|
|
·
|
|
|
+ b = +1
|
(10.40)
|
|
|
W
|
X
|
|
|
·
|
|
+ b = −1.
|
(10.41)
|
|
W
|
X
|
|
These constraints are referred to as margin constraints. The two hyperplanes segment the data space into three regions. It is assumed that no training data points lie in the uncertain decision boundary region between these two hyperplanes, and all training data points for each class are mapped to one of the two remaining (extreme) regions. This can be expressed as pointwise constraints on the training data points as follows:
|
|
|
·
|
|
|
|
+ b ≥ +1
|
∀i : yi = +1
|
(10.42)
|
|
|
W
|
Xi
|
|
|
|
·
|
|
+ b ≤ −1
|
∀i : yi = −1.
|
(10.43)
|
|
W
|
Xi
|
|
Note that the constraints for both the positive and negative classes can be written in the following succinct and algebraically convenient, but rather cryptic, form:
yi(
|
|
·
|
|
+ b) ≥ +1 ∀i.
|
(10.44)
|
|
W
|
Xi
|
|
The distance between the two hyperplanes for the positive and negative instances is also referred to as the margin. As discussed earlier, the goal is to maximize this margin. What is the distance (or margin) between these two parallel hyperplanes? One can use linear algebra to show that the distance between two parallel hyperplanes is the normalized difference between their constant terms, where the normalization factor is the L2-norm
||W || =
|
d
|
of the coefficients. Because the difference between the constant terms
|
|
i=1 wi2
|
|
of the two aforementioned hyperplanes is 2, it follows that the distance between them is 2/||W ||. This is the margin that needs to be maximized with respect to the aforementioned constraints. This form of the objective function is inconvenient because it incorporates a
316
|
|
|
|
|
|
CHAPTER 10. DATA CLASSIFICATION
|
|
square root in the denomin
|
ator of the objective function. However, maximizing 2/
|
||
|
|
|| is
|
|
2
|
|
|
|
|
W
|
|
|
|
|
|
|
is a convex quadratic programming problem, because
|
|
the same as minimizing ||W || /2. This
|
|
2
|
/2 needs to be minimized subject to a set of linear
|
|
the quadratic objective function ||
|
|
|
Dostları ilə paylaş: |