Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə193/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   189   190   191   192   193   194   195   196   ...   423
1-Data Mining tarjima

10.6. SUPPORT VECTOR MACHINES

323

10.6.4 The Kernel Trick


The kernel trick leverages the important observation that the SVM formulation can be fully solved in terms of dot products (or similarities) between pairs of data points. One does not need to know the feature values. Therefore, the key is to define the pairwise dot product (or similarity function) directly in the d -dimensional transformed representation Φ(X), with the use of a kernel function K(Xi, Xj ).





K(




,




)=Φ(




)·Φ(




)

(10.61)




Xi

Xj

Xi

Xj




To effectively solve the SVM, recall that the transformed feature values Φ(X) need not be explicitly computed, as long as the dot product (or kernel similarity) K(Xi, Xj ) is known. This implies that the term Xi · Xj may be replaced by the transformed-space dot product K(Xi, Xj ) in Eq. 10.50, and the term Xi · Z in Eq. 10.53 can be replaced by K(Xi, Z) to perform SVM classification.








n

1




n n




















































LD =




λi

·

λiλj yiyj K(Xi, Xj )

(10.62)













2







i=1







i=1 j=1






















n







F (




) = sign{(

λiyiK(




,




)) + b}

(10.63)




Z

Xi

Z




i=1

Note that the bias b is also expressed in terms of dot products according to Eq. 10.58. These modifications are carried over to the update equations discussed in Sect. 10.6.1.1, all of which are expressed in terms of dot products.


Thus, all computations are performed in the original space, and the actual transfor-mation Φ(·) does not need to be known as long as the kernel similarity function K(·, ·) is known. By using kernel- based similarity with carefully chosen kernels, arbitrary nonlinear decision boundaries can be approximated. There are different ways of modeling similarity between Xi and Xj . Some common choices of the kernel function are as follows:





Function

Form







Gaussian radial basis kernel

K(













,













) = e−||

Xi



Xj

||2/2σ2




Xi

Xj







Polynomial kernel

K(

Xi

,

Xj

) = (

Xi

·

Xj

+ c)h







Sigmoid kernel

K(

Xi

,

Xj

) = tanh(κXi

·

Xj

δ)
























































































Many of these kernel functions have parameters associated with them. In general, these parameters may need to be tuned by holding out a portion of the training data, and using it to test the accuracy of different choices of parameters. Many other kernels are possible beyond the ones listed in the table above. Kernels need to satisfy a property known as Mercer’s theorem to be considered valid. This condition ensures that the n × n kernel similarity matrix S = [K(Xi, Xj )] is positive semidefinite, and similarities can be expressed as dot products in some transformed space. Why must the kernel similarity matrix always be positive semidefinite for similarities to be expressed as dot products? Note that if the n × n kernel similarity matrix S can be expressed as the n × n dot-product matrix AAT of some n × r transformed representation A of the points, then for any n-dimensional column vector V , we have V T SV = (AV )T (AV ) 0. In other words, S is positive semidefinite. Conversely, if the kernel matrix S is positive semi-definite then it can be expressed as a dot product





324 CHAPTER 10. DATA CLASSIFICATION

with the eigen decomposition S = QΣ2QT = (QΣ)(QΣ)T , where Σ2 is an n × n diagonal matrix of nonnegative eigenvalues and Q is an n × n matrix containing the eigenvectors of S in columns. The matrix QΣ is the n- dimensional transformed representation of the points, and it also sometimes referred to as the data-specific Mercer kernel map. This map is data set-specific, and it is used in many nonlinear dimensionality reduction methods such as kernel PCA.


What kind of kernel function works best for the example of Fig. 10.8? In general, there are no predefined rules for selecting kernels. Ideally, if the similarity values K(Xi, Xj ) were defined so that a space exists, in which points with this similarity structure are linearly separable, then a linear SVM in the transformed space Φ(·) will work well.


To explain this point, we will revisit the example of Fig. 10.8. Let X2i and X2j be the d-dimensional vectors derived by squaring each coordinate of Xi and Xj , respectively. In the case of Fig. 10.8, consider the transformation (z1, z2, z3, z4) in the previous section. It can be shown that the dot product between two transformed data points can be captured by the following kernel function:





Transformed-Dot-Product(




,




) =




·




+




·




.

(10.64)




Xi

Xj

Xi

Xj

X2i

X2j




This is easy to verify by expanding the aforementioned expression

in terms of




the transformed variables z1 . . . z4 of the two data points. The kernel function Transformed-Dot -Product(Xi, Xj ) would obtain the same Lagrangian multipliers and deci-sion boundary as obtained with the explicit transformation z1 . . . z4. Interestingly, this kernel is closely related to the second-order polynomial kernel.



K(




,




)=(0.5+




·




)2

(10.65)




Xi

Xj

Xi

Xj




Expanding the second-order polynomial kernel results in a superset of the additive terms in Transformed-Dot-Product(Xi, Xj ) . The additional terms include a constant term of 0.25 and some inter-dimensional products. These terms provide further modeling flexibility. In the case of the 2-dimensional example of Fig. 10.8, the use of the second-order polynomial kernel is equivalent to using an extra transformed variable z5 = 2x1x2 representing the product of the values on the two dimensions and a constant dimension z6 = 0.5. These variables are in addition to the original four variables (z1 , z2, z3, z4). Since these additional variables are redundant in this case, they will not affect the ability to discover the correct decision boundary, although they might cause some overfitting. On the other hand, a variable such as z5 = 2x1 x2 would have come in handy, if the ellipse of Fig. 10.8 had been arbitrarily oriented with respect to the axis system. A full separation of the classes would not have been possible with a linear classifier on the original four variables (z1, z2, z3, z4). Therefore, the second-order polynomial kernel can discover more general decision boundaries than the transformation of the previous section. Using even higher-order polynomial kernels can model increasingly complex boundaries but at a greater risk of overfitting.


In general, different kernels have different levels of flexibility. For example, a transformed feature space that is implied by the Gaussian kernel of width σ can be shown to have an infinite number of dimensions by using the polynomial expansion of the exponential term. The parameter σ controls the relative scaling of various dimensions. A smaller value of σ results in a greater ability to model complex boundaries, but it may also cause overfitting. Smaller data sets are more prone to overfitting. Therefore, the optimal values of kernel parameters depend not only on the shape of the decision boundary but also on the size of the training data set. Parameter tuning is important in kernel methods. With proper tuning, many kernel functions can model complex decision boundaries. Furthermore, kernels provide




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   189   190   191   192   193   194   195   196   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin