Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə191/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   187   188   189   190   191   192   193   194   ...   423
1-Data Mining tarjima

O =

||W ||2

+ C

(10.56)




ξi.







2


















i=1

As before, this is a convex quadratic optimization problem that can be solved using Lagrangian methods. A similar approach is used to set up the Lagrangian relaxation of the problem with penalty terms and additional multipliers βi 0 for the slack constraints ξi 0:





























n




n





































n
















=




||W ||2

+ C

i




y

(




·







+ b)



1 + ξ









(10.57)











































L







ξ

λ




W

X




β ξ

.







P




2







i=1

i=1

i

i










i







i

i i









































































i=1












A similar approach to the hard SVM case can be used to eliminate the minimization variables W , ξi, and b from the optimization formulation and create a purely maximization dual formulation. This is achieved by setting the gradient of LP with respect to these variables to 0. By setting the gradients of LP with respect to W and b to 0, it can be respectively shown that the value of W is identical to the hard-margin case (Eq. 10.49), and the same

320




CHAPTER 10. DATA CLASSIFICATION




multiplier constraint

n

λiyi = 0 is satisfied. This is because the additional slack terms in




i=1




LP involving ξi do not affect the respective gradients with respect to W and b. Furthermore, it can be shown that the objective function LD of the Lagrangian dual in the soft-margin case is identical to that of the hard-margin case, according to Eq. 10.50, because the linear terms involving each ξi evaluate5 to 0. The only change to the dual optimization problem is that the nonnegative Lagrangian multipliers satisfy additional constraints of the form C − λi = βi 0. This constraint is derived by setting the partial derivative of LP with

respect to ξi to 0. One way of viewing this additional

constraint λ

i

C is that the influence







n













of any training data point




on the weight vector




=




λiyi




is capped by C because




Xi

W

i=1

Xi




of the softness of the margin. The dual problem in soft SVMs maximizes LD (Eq. 10.50)




subject to the constraints 0 ≤ λi ≤ C and

n

= 0.



















i=1 λiyi



















The Kuhn–Tucker optimality conditions for the

slack

nonnegativity constraints are




βiξi = 0. Because we have already derived βi = C − λi , we obtain (C − λi )ξi = 0. In other words, training points X i with λi < C correspond to zero slack ξi and they might either lie on the margin or on the correct side of the margin. However, in this case, the support vectors are defined as data points that satisfy the soft SVM constraints exactly and some of them might have nonzero slack. Such points might lie on the margin, between the margin, or on the wrong side of the decision boundary. Points that satisfy λi > 0 are always support vectors. The support vectors that lie on the margin will therefore satisfy 0 < λi < C. These points are very useful in solving for b. Consider any such support vector Xr with zero slack, which satisfies 0 < λr < C. The value of b may be obtained as before:



n




yr ( λiyiXi · Xr) + b = +1.

(10.58)



i=1

Note that this expression is the same as for the case of hard SVMs, except that the relevant training points are identified by using the condition 0 < λr < C. The gradient-ascent update is also identical to the separable case (cf. Sect. 10.6.1.1), except that any multiplier λi exceeding C because of an update needs to be reset to C. The classification of a test instance also uses Eq. 10.53 in terms of Lagrangian multipliers because the relationship between the weight vector and the Lagrangian multipliers is the same in this case. Thus, the soft SVM formulation with hinge loss is strikingly similar to the hard SVM formulation. This similarity is less pronounced for other slack penalty functions such as quadratic loss.


The soft version of SVMs also allows an unconstrained primal formulation by eliminating the margin constraints and slack variables simultaneously. This is achieved by substituting ξi = max{0, 1 − yi[W · Xi + b]} in the primal objective function of Eq. 10.56. This results in an unconstrained optimization (minimization) problem purely in terms of W and b:























n


































O =




||W ||2

+ C





































max

0, 1






[




·




+ b]




(10.59)







y

W

X

.







2







{




i







i

}












i=1

One can use a gradient descent approach, which is analogous to the gradient ascent method used in logistic regression. The partial derivatives of nondifferentiable function O with respect to w1, . . . wd and b are approximated on a casewise basis, depending on whether or not the term inside the maximum function evaluates to a positive quantity. The precise derivation of the gradient descent steps is left as an exercise for the reader. While the dual








  • The additional term in LP involving ξi is (C − βi − λi)ξi. This term evaluates to 0 because the partial derivative of LP with respect to ξi is (C − βi − λi). This partial derivative must evaluate to 0 for optimality of LP .

10.6. SUPPORT VECTOR MACHINES

321

approach is more popular, the primal approach is intuitively simpler, and it is often more efficient when an approximate solution is desired.


10.6.2.1 Comparison with Other Linear Models


The normal vector to a linear separating hyperplane can be viewed as a direction along which the data points of the two classes are best separated. Fisher’s linear discriminant also achieves this goal by maximizing the ratio of the between-class scatter to the within-class scatter along an optimally chosen vector. However, an important distinguishing feature of SVMs is that they focus extensively on the decision boundary region between the two classes because this is the most uncertain region, which is prone to classification error. Fisher’s discriminant focuses on the global separation between the two classes and may not necessarily provide the best separation in the uncertain boundary region. This is the reason that SVMs often have better generalization behavior for noisy data sets that are prone to overfitting.


It is instructive to express logistic regression as a minimization problem by using the negative of the log-likelihood function and then comparing it with SVMs. The coefficients (θ 0, . . . θd ) in logistic regression are analogous to the coefficients (b, W ) in SVMs. SVMs have a margin component to increase the generalization power of the classifier, just as logistic regression uses regularization. Interestingly, the margin component ||W ||2/2 in SVMs has



an identical form to the regularization term

d

θi2/2 in logistic regression. SVMs have




i=1




slack penalties just as logistic regression implicitly penalizes the probability of mistakes in the log-likelihood function. However, the slack is computed using margin violations in SVMs, whereas the penalties in logistic regression are computed as a smooth function of the distances from the decision boundary. Specifically, the log-likelihood function in logistic regression creates a smooth loss function of the form log(1 + e−yi[θ0+θ ·Xi]), whereas the hinge loss max{0, 1 − yi[W · X i + b]} in SVMs is not a smooth function. The nature of the misclassification penalty is the only difference between the two models. Therefore, there are several conceptual similarities among these models, but they emphasize different aspects of optimization. SVMs and regularized logistic regression show similar performance in many practical settings with poorly separable classes. However, SVMs and Fisher’s discriminant generally perform better than logistic regression for the special case of well-separated classes. All these methods can also be extended to nonlinear decision boundaries in similar ways.

10.6.3 Nonlinear Support Vector Machines


In many cases, linear solvers are not appropriate for problems in which the decision boundary is not linear. To understand this point, consider the data distribution illustrated in Fig. 10.8. It is evident that no linear separating hyperplanes can delineate the two classes. This is because the two classes are separated by the following decision boundary:





8(x1 1)2 + 50(x2 2)2 = 1.

(10.60)

Now, if one already had some insight about the nature of the decision boundary, one might transform the training data into the new 4-dimensional space as follows:


z1 = x21
z2 = x1


z3 = x22
z4 = x2.

322 CHAPTER 10. DATA CLASSIFICATION








2.5





































2.4





































2.3





































2.2


































Y

2.1






































































FEATURE

2


































1.9









































































1.8





































1.7




















Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   187   188   189   190   191   192   193   194   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin