Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	191/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 187 188 189 190 191 192 193 194 ... 423

1-Data Mining tarjima

O =	\|\|W \|\|²	+ C	(10.56)
	\|\|W \|\|²	+ C	(10.56)
ξ_i.
	2

i=1

As before, this is a convex quadratic optimization problem that can be solved using Lagrangian methods. A similar approach is used to set up the Lagrangian relaxation of the problem with penalty terms and additional multipliers β_i ≥ 0 for the slack constraints ξ_i ≥ 0:

||W ||²

+ C

_i−

(

+ b)

−

1 + ξ

−

(10.57)

β ξ

i=1

i i

i=1

A similar approach to the hard SVM case can be used to eliminate the minimization variables W , ξ_i, and b from the optimization formulation and create a purely maximization dual formulation. This is achieved by setting the gradient of L_P with respect to these variables to 0. By setting the gradients of L_P with respect to W and b to 0, it can be respectively shown that the value of W is identical to the hard-margin case (Eq. 10.49), and the same

320		CHAPTER 10. DATA CLASSIFICATION
multiplier constraint	n	λ_iy_i = 0 is satisfied. This is because the additional slack terms in
multiplier constraint	i=1

L_P involving ξ_i do not aﬀect the respective gradients with respect to W and b. Furthermore, it can be shown that the objective function L_D of the Lagrangian dual in the soft-margin case is identical to that of the hard-margin case, according to Eq. 10.50, because the linear terms involving each ξ_i evaluate5 to 0. The only change to the dual optimization problem is that the nonnegative Lagrangian multipliers satisfy additional constraints of the form C − λ_i = β_i ≥ 0. This constraint is derived by setting the partial derivative of L_P with

respect to ξ_i to 0. One way of viewing this additional

constraint λ

i ^≤

C is that the influence

of any training data point

on the weight vector

^λi^yi

is capped by C because

^Xi

i=1

^Xi

of the softness of the margin. The dual problem in soft SVMs maximizes L_D (Eq. 10.50)

subject to the constraints 0 ≤ λ_i ≤ C and

= 0.

_i₌₁^λi^yi

The Kuhn–Tucker optimality conditions for the

slack

nonnegativity constraints are

β_iξ_i = 0. Because we have already derived β_i = C − λ_i , we obtain (C − λ_i )ξ_i = 0. In other words, training points X _i with λ_i < C correspond to zero slack ξ_i and they might either lie on the margin or on the correct side of the margin. However, in this case, the support vectors are defined as data points that satisfy the soft SVM constraints exactly and some of them might have nonzero slack. Such points might lie on the margin, between the margin, or on the wrong side of the decision boundary. Points that satisfy λ_i > 0 are always support vectors. The support vectors that lie on the margin will therefore satisfy 0 < λ_i < C. These points are very useful in solving for b. Consider any such support vector X_r with zero slack, which satisfies 0 < λ_r < C. The value of b may be obtained as before:

n
y_r ( λ_iy_iX_i · X_r) + b = +1.	(10.58)

i=1

Note that this expression is the same as for the case of hard SVMs, except that the relevant training points are identified by using the condition 0 < λ_r < C. The gradient-ascent update is also identical to the separable case (cf. Sect. 10.6.1.1), except that any multiplier λ_i exceeding C because of an update needs to be reset to C. The classification of a test instance also uses Eq. 10.53 in terms of Lagrangian multipliers because the relationship between the weight vector and the Lagrangian multipliers is the same in this case. Thus, the soft SVM formulation with hinge loss is strikingly similar to the hard SVM formulation. This similarity is less pronounced for other slack penalty functions such as quadratic loss.

The soft version of SVMs also allows an unconstrained primal formulation by eliminating the margin constraints and slack variables simultaneously. This is achieved by substituting ξ_i = max{0, 1 − y_i[W · X_i + b]} in the primal objective function of Eq. 10.56. This results in an unconstrained optimization (minimization) problem purely in terms of W and b:

O =

||W ||²

+ C

max

0, 1

−

[

+ b]

(10.59)

{

}

i=1

One can use a gradient descent approach, which is analogous to the gradient ascent method used in logistic regression. The partial derivatives of nondiﬀerentiable function O with respect to w₁, . . . w_d and b are approximated on a casewise basis, depending on whether or not the term inside the maximum function evaluates to a positive quantity. The precise derivation of the gradient descent steps is left as an exercise for the reader. While the dual

The additional term in L_P involving ξ_i is (C − β_i − λ_i)ξ_i. This term evaluates to 0 because the partial derivative of L_P with respect to ξ_i is (C − β_i − λ_i). This partial derivative must evaluate to 0 for optimality of L_P .

10.6. SUPPORT VECTOR MACHINES

321

approach is more popular, the primal approach is intuitively simpler, and it is often more eﬃcient when an approximate solution is desired.

10.6.2.1 Comparison with Other Linear Models

The normal vector to a linear separating hyperplane can be viewed as a direction along which the data points of the two classes are best separated. Fisher’s linear discriminant also achieves this goal by maximizing the ratio of the between-class scatter to the within-class scatter along an optimally chosen vector. However, an important distinguishing feature of SVMs is that they focus extensively on the decision boundary region between the two classes because this is the most uncertain region, which is prone to classification error. Fisher’s discriminant focuses on the global separation between the two classes and may not necessarily provide the best separation in the uncertain boundary region. This is the reason that SVMs often have better generalization behavior for noisy data sets that are prone to overfitting.

It is instructive to express logistic regression as a minimization problem by using the negative of the log-likelihood function and then comparing it with SVMs. The coeﬃcients (θ ₀, . . . θ_d ) in logistic regression are analogous to the coeﬃcients (b, W ) in SVMs. SVMs have a margin component to increase the generalization power of the classifier, just as logistic regression uses regularization. Interestingly, the margin component ||W ||2/2 in SVMs has

an identical form to the regularization term	d	θ_i2/2 in logistic regression. SVMs have
	i=1

slack penalties just as logistic regression implicitly penalizes the probability of mistakes in the log-likelihood function. However, the slack is computed using margin violations in SVMs, whereas the penalties in logistic regression are computed as a smooth function of the distances from the decision boundary. Specifically, the log-likelihood function in logistic regression creates a smooth loss function of the form log(1 + e−yi[θ0+θ ·Xi]), whereas the hinge loss max{0, 1 − y_i[W · X _i + b]} in SVMs is not a smooth function. The nature of the misclassification penalty is the only diﬀerence between the two models. Therefore, there are several conceptual similarities among these models, but they emphasize diﬀerent aspects of optimization. SVMs and regularized logistic regression show similar performance in many practical settings with poorly separable classes. However, SVMs and Fisher’s discriminant generally perform better than logistic regression for the special case of well-separated classes. All these methods can also be extended to nonlinear decision boundaries in similar ways.

10.6.3 Nonlinear Support Vector Machines

In many cases, linear solvers are not appropriate for problems in which the decision boundary is not linear. To understand this point, consider the data distribution illustrated in Fig. 10.8. It is evident that no linear separating hyperplanes can delineate the two classes. This is because the two classes are separated by the following decision boundary:

8(x₁ − 1)² + 50(x₂ − 2)² = 1.

(10.60)

Now, if one already had some insight about the nature of the decision boundary, one might transform the training data into the new 4-dimensional space as follows:

z₁ = x²₁
z₂ = x₁

z₃ = x²₂
z₄ = x₂.

322 CHAPTER 10. DATA CLASSIFICATION

	2.5
	2.4
	2.3
	2.2
Y	2.1
Y
FEATURE	2
FEATURE	1.9
	1.9
	1.8
	1.7

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 187 188 189 190 191 192 193 194 ... 423