Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	189/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 185 186 187 188 189 190 191 192 ... 423

1-Data Mining tarjima

L_D =					(10.50)

λ_i −		λ_iλ_j y_iy_j X_i · X_j .
λ_i −	2	λ_iλ_j y_iy_j X_i · X_j .
i=1			i=1 j=1
The dual problem maximizes L_D subject to the constraints λ_i ≥ 0 and				n	λ_iy_i = 0.
				i=1	λ_iy_i = 0.

Note that L_D is expressed only in terms of λ_i, the class labels, and the pairwise dot products X_i · X_j between training data points. Therefore, solving for the Lagrangian multipliers requires knowledge of only the class variables and dot products between training instances but it does not require direct knowledge of the feature values X_i. The dot products between training data points can be viewed as a kind of similar-ity between the points, which can easily be defined for data types beyond numeric domains. This observation is important for generalizing linear SVMs to nonlinear decision boundaries and arbitrary data types with the kernel trick.

The value of b can be derived from the constraints in the original SVM formulation, for which the Lagrangian multipliers λ_r are strictly positive. For these training points, the margin constraint y_r(W · X_r + b) = +1 is satisfied exactly according to the Kuhn– Tucker conditions. The value of b can be derived from any such training point (X_r, y_r) as follows:

		·		+ b = +1					∀r : λ_r > 0	(10.51)
^yr	W		X_r
	n
	( λ_iy_i					·		) + b = +1	∀r : λ_r > 0.	(10.52)
^yr					X_i		X_r

i=1

The second relationship is derived by substituting the expression for W in terms of the Lagrangian multipliers according to Eq. 10.49. Note that this relationship is expressed only in terms of Lagrangian multipliers, class labels, and dot products between training instances. The value of b can be solved from this equation. To reduce numerical error, the value of b may be averaged over all the support vectors with λ_r > 0.

For a test instance Z, its class label F (Z) is defined by the decision boundary obtained by substituting for W in terms of the Lagrangian multipliers (Eq. 10.49):

F (

) = sign{

+ b} = sign{( λ_iy_i

) + b}.

(10.53)

X_i

i=1

318 CHAPTER 10. DATA CLASSIFICATION

It is interesting to note that F (Z) can be fully expressed in terms of the dot product between training instances and test instances, class labels, Lagrangian multipliers, and bias b. Because the Lagrangian multipliers λ_i and b can also be expressed in terms of the dot products between training instances, it follows that the classification can be fully performed using knowledge of only the dot product between diﬀerent instances (training and test), without knowing the exact feature values of either the training or the test instances.

The observations about dot products are crucial in generalizing SVM methods to nonlinear decision boundaries and arbitrary data types with the use of a technique known as the kernel trick. This technique simply substitutes dot products with kernel similarities (cf. Sect. 10.6.4).

It is noteworthy from the derivation of W (see Eq. 10.49) and the aforementioned deriva-tion of b, that only training data points that are support vectors (with λ_r > 0) are used to define the solution W and b in SVM optimization. As discussed in Chap. 11, this observation is leveraged by scalable SVM classifiers, such as SVMLight. Such classifiers shrink the size of the problem by discarding irrelevant training data points that are easily identified to be far away from the separating hyperplanes.

10.6.1.1 Solving the Lagrangian Dual

The Lagrangian dual L_D may be optimized by using the gradient ascent technique in terms of the n-dimensional parameter vector λ.

∂L_D		n
	= 1 − y_i		(10.54)
		y_j λ_j X_i · X_j
∂λ_i		y_j λ_j X_i · X_j
		j=1

Therefore, as in logistic regression, the corresponding gradient-based update equation is as follows:

(λ₁ . . . λ_n) ← (λ₁ . . . λ_n) + α	∂L_D	. . .	∂L_D		.	(10.55)
	∂λ₁			∂λ_n

The step size α may be chosen to maximize the improvement in objective function. The initial solution can be chosen to be the vector of zeros, which is also a feasible solution for λ.

One problem with this update is that the constraints λ_i ≥ 0 and

_i₌₁^λi^yi = 0 may be

violated after an update. Therefore, the gradient vector is projected along the hyperplane

_i₌₁λ_iy_i = 0 before the update to create a modified gradient vector. Note that the projec-

tion of the gradient ∇L_D

along the normal to this hyperplane is simply

= (

· ∇

)

where

is the unit vector

√

(y₁ . . . y_n). This component is subtracted from ∇L_D to create

projection, updating along the

a modified gradient vector G = ∇L_D − H. Because of the

modified gradient vector

will not violate the constraint

_i₌₁λ_iy_i = 0. In addition, any

negative values of λ_i after an update are reset to 0.

Note that the constraint

with

_i₌₁λ_iy_i = 0 is derived by setting the gradient of L_P

respect to b to 0. In some alternative formulations of SVMs, the bias vector b can be included within W by adding a synthetic dimension to the data with a constant value of 1. In such cases, the gradient vector update is simplified to Eq. 10.55 because one no longer needs to

worry about the constraint

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 185 186 187 188 189 190 191 192 ... 423