Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	216/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 212 213 214 215 216 217 218 219 ... 423

1-Data Mining tarjima

O = (

−

||².

(11.3)

X_i

− y_i)² = ||DW

i=1

Using2 matrix calculus, the gradient of O with respect

can be shown to be the

d-dimensional vector 2DT (DW ^T − y). Setting the gradient to 0 yields the following d-dimensional vector of optimization conditions:

	^T=_DT		(11.4)
D^T DW
		y.

If the symmetric matrix DT D is invertible, then the solution for W can be derived from the

aforementioned condition as W ^T = (DT D)−1DT y. The numerical class value of a previously unseen test instance T can then be predicted as the dot product between W and T .

It is noteworthy that the matrix (D^T D)⁻¹D^T is also referred to as the Moore–Penrose pseudoinverse D+ of the matrix D. Therefore, the solution to linear regression can also be expressed as D+ y. The pseudoinverse is more generally defined even for the case where DT D is not invertible:

D⁺ = lim_δ→₀(D^T D + δ²I)⁻¹D^T .

(11.5)

Excluding constant terms, the objective function O = (DW ^T − y)^T (DW ^T − y) can be expanded to the two additive terms W D^T DW ^T and −(W D^T y + y^T DW ^T ) = −2W D^T y. The gradients of these terms are 2D^T DW ^T and −2D^T y, respectively. In the event that the Tikhonov regularization term λ||W ||² is added to the objective function, an additional term of 2λW ^T will appear in the gradient.

11.5. REGRESSION MODELING WITH NUMERIC CLASSES

355

Here, I is a d × d identity matrix. When the number of training data points is small, all the training examples might lie on a hyperplane with dimensionality less than d. As a result, the d × d matrix DT D is not of full rank and therefore not invertible. In other words, the system of equations D^T DW ^T = D^T y is underdetermined and has infinitely many solutions. In this case, the general definition of the Moore–Penrose pseudoinverse in Eq. 11.5 is useful. Even though the inverse of DT D does not exist, the limit of Eq. 11.5 can still be computed. It is possible to compute D+ using the SVD of D (cf. Sect. 2.4.3.4 of Chap. 2). More eﬃcient computation methods are possible with the use of the following matrix identity (see Exercise 15):

D⁺ = (D^T D)⁺D^T = D^T (DD^T )⁺.

(11.6)

This identity is useful when d n or n d. Here, we will show only the case where d n

because	the other case is very similar. The first step to diagonalize the d		×	d symmetric
because	T	D:	×
matrix D		D:
		D^TD = PΛP^T.		(11.7)

The columns of P are the orthonormal eigenvectors of DT D and Λ is a diagonal matrix containing the eigenvalues. When the matrix DT D is of rank k < d, the pseudoinverse (DT D)+ of DT D is computed as follows:

(D^TD)⁺ = PΛ⁺P^T.

(11.8)

Λ⁺ is derived from Λ by setting it to 1/Λ_ii for the k nonzero entries, and 0, otherwise. ii

	^T= (D^T D)⁺D^T		(11.9)
W
		y.

Even though the underdetermined system of equations DT DW ^T = DT y has infinitely many solutions, the pseudoinverse always provides a solution W with the smallest L₂ -norm ||W || among the alternatives. Smaller coeﬃcients are desirable because they reduce overfitting. Overfitting is a significant problem in general, and especially so when the matrix DT D is not of full rank. A more eﬀective approach is to use Tikhonov regularization or Lasso. In Tikhonov regularization, also known as ridge regression, a penalty term λ||W ||² is added to the objective function O of Eq. 11.3, where λ > 0 is a regularization parameter. In

that case, the solution for W ^T becomes (DT D + λI )−1DT y, where I is a d × d identity matrix. The matrix (DT D + λI) can be shown to be always positive-definite and therefore invertible. The compact solution provided by the Moore–Penrose pseudoinverse is a special case of Tikhonov regularization in which λ is infinitesimally small (i.e., λ → 0). In general, the value of λ should be selected adaptively by cross-validation. In Lasso , an L₁-penalty, λ ^d |wi|, is used instead of the L2-penalty term. The resulting problem does not have a

i=1
closed form solution, and it is solved using iterative techniques such as proximal gradient methods and coordinate descent [256]. Lasso has the tendency to select sparse solutions (i.e., few nonzero components) for W , and it is particularly eﬀective for high-dimensional data with many irrelevant features. Lasso can also be viewed as an embedded model (cf. Sect. 10.2 of Chap. 10) for feature selection because features with zero coeﬃcients are eﬀectively discarded. The main advantage of Lasso over ridge regression is not necessarily one of performance, but that of its highly interpretable feature selection.

Although the use of a penalty term for regularization might seem arbitrary, it creates sta-bility by discouraging very large coeﬃcients. By penalizing all regression coeﬃcients, noisy features are often deemphasized to a greater degree. A common manifestation of overfitting

356 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS

in linear regression is that the additive contribution of a large coeﬃcient to W · X may be frequently canceled by another large coeﬃcient in a small training data set. Such features might be noisy. This situation can lead to inaccurate predictions over unseen test instances because the response predictions are very sensitive to small perturbations in feature values. Regularization prevents this situation by penalizing large coeﬃcients. Bayesian interpre-tations also exist for these regularization methods. For example, Tikhonov regularization assumes Gaussian priors on the parameters W and class variable. Such assumptions are helpful in obtaining a unique and probabilistically interpretable solution when the available training data is limited.

11.5.1.1 Relationship with Fisher’s Linear Discriminant

Fisher’s linear discriminant for binary classes (cf. Sect. 10.2.1.4 of Chap. 10) can be shown to be a special case of least-squares regression. Consider a problem with two classes, in which the two classes 0 and 1 contain a fraction p₀ and p₁, respectively, of the n data points. Assume that the d-dimensional mean vectors of the two classes are μ₀ and μ₁, and the covariance matrices are Σ₀ and Σ₁, respectively. Furthermore, it is assumed that the data matrix D is mean- centered. The response variables y are set to −1/p₀ for class 0 and +1/p₁ for class 1. Note that the response variables are also mean-centered as a result. Let us now examine the solution for W obtained by least-squares regression. The term DT y is proportional to μ ₁ ^T − μ₀^T , because the value of y is −1/p₀ for a fraction p ₀ of the data records belonging to class 0, and it is equal to 1/p₁ for a fraction p₁ of the data records belonging to class 1. In other words, we have:

(D^T D)

^T=_DT

∝

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 212 213 214 215 216 217 218 219 ... 423