O = (
|
|
·
|
|
|
|
T
|
−
|
|
|
||2.
|
(11.3)
|
|
W
|
Xi
|
− yi)2 = ||DW
|
|
y
|
|
i=1
|
|
|
|
|
|
|
|
Using2 matrix calculus, the gradient of O with respect
|
to
|
|
can be shown to be the
|
|
W
|
|
d-dimensional vector 2DT (DW T − y). Setting the gradient to 0 yields the following d-dimensional vector of optimization conditions:
If the symmetric matrix DT D is invertible, then the solution for W can be derived from the
aforementioned condition as W T = (DT D)−1DT y. The numerical class value of a previously unseen test instance T can then be predicted as the dot product between W and T .
It is noteworthy that the matrix (DT D)−1DT is also referred to as the Moore–Penrose pseudoinverse D+ of the matrix D. Therefore, the solution to linear regression can also be expressed as D+ y. The pseudoinverse is more generally defined even for the case where DT D is not invertible:
D+ = limδ→0(DT D + δ2I)−1DT .
|
(11.5)
|
Excluding constant terms, the objective function O = (DW T − y)T (DW T − y) can be expanded to the two additive terms W DT DW T and −(W DT y + yT DW T ) = −2W DT y. The gradients of these terms are 2DT DW T and −2DT y, respectively. In the event that the Tikhonov regularization term λ||W ||2 is added to the objective function, an additional term of 2λW T will appear in the gradient.
11.5. REGRESSION MODELING WITH NUMERIC CLASSES
|
355
|
Here, I is a d × d identity matrix. When the number of training data points is small, all the training examples might lie on a hyperplane with dimensionality less than d. As a result, the d × d matrix DT D is not of full rank and therefore not invertible. In other words, the system of equations DT DW T = DT y is underdetermined and has infinitely many solutions. In this case, the general definition of the Moore–Penrose pseudoinverse in Eq. 11.5 is useful. Even though the inverse of DT D does not exist, the limit of Eq. 11.5 can still be computed. It is possible to compute D+ using the SVD of D (cf. Sect. 2.4.3.4 of Chap. 2). More efficient computation methods are possible with the use of the following matrix identity (see Exercise 15):
D+ = (DT D)+DT = DT (DDT )+.
|
(11.6)
|
This identity is useful when d n or n d. Here, we will show only the case where d n
because
|
the other case is very similar. The first step to diagonalize the d
|
×
|
d symmetric
|
|
T
|
D:
|
|
|
matrix D
|
|
|
|
|
|
DTD = PΛPT.
|
|
(11.7)
|
|
The columns of P are the orthonormal eigenvectors of DT D and Λ is a diagonal matrix containing the eigenvalues. When the matrix DT D is of rank k < d, the pseudoinverse (DT D)+ of DT D is computed as follows:
Λ+ is derived from Λ by setting it to 1/Λii for the k nonzero entries, and 0, otherwise. ii
|
T = (DT D)+DT
|
|
|
(11.9)
|
|
W
|
|
y.
|
|
Even though the underdetermined system of equations DT DW T = DT y has infinitely many solutions, the pseudoinverse always provides a solution W with the smallest L2 -norm ||W || among the alternatives. Smaller coefficients are desirable because they reduce overfitting. Overfitting is a significant problem in general, and especially so when the matrix DT D is not of full rank. A more effective approach is to use Tikhonov regularization or Lasso. In Tikhonov regularization, also known as ridge regression, a penalty term λ||W ||2 is added to the objective function O of Eq. 11.3, where λ > 0 is a regularization parameter. In
that case, the solution for W T becomes (DT D + λI )−1DT y, where I is a d × d identity matrix. The matrix (DT D + λI) can be shown to be always positive-definite and therefore invertible. The compact solution provided by the Moore–Penrose pseudoinverse is a special case of Tikhonov regularization in which λ is infinitesimally small (i.e., λ → 0). In general, the value of λ should be selected adaptively by cross-validation. In Lasso , an L1-penalty, λ d |wi|, is used instead of the L2-penalty term. The resulting problem does not have a
i=1
closed form solution, and it is solved using iterative techniques such as proximal gradient methods and coordinate descent [256]. Lasso has the tendency to select sparse solutions (i.e., few nonzero components) for W , and it is particularly effective for high-dimensional data with many irrelevant features. Lasso can also be viewed as an embedded model (cf. Sect. 10.2 of Chap. 10) for feature selection because features with zero coefficients are effectively discarded. The main advantage of Lasso over ridge regression is not necessarily one of performance, but that of its highly interpretable feature selection.
Although the use of a penalty term for regularization might seem arbitrary, it creates sta-bility by discouraging very large coefficients. By penalizing all regression coefficients, noisy features are often deemphasized to a greater degree. A common manifestation of overfitting
356 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS
in linear regression is that the additive contribution of a large coefficient to W · X may be frequently canceled by another large coefficient in a small training data set. Such features might be noisy. This situation can lead to inaccurate predictions over unseen test instances because the response predictions are very sensitive to small perturbations in feature values. Regularization prevents this situation by penalizing large coefficients. Bayesian interpre-tations also exist for these regularization methods. For example, Tikhonov regularization assumes Gaussian priors on the parameters W and class variable. Such assumptions are helpful in obtaining a unique and probabilistically interpretable solution when the available training data is limited.
11.5.1.1 Relationship with Fisher’s Linear Discriminant
Fisher’s linear discriminant for binary classes (cf. Sect. 10.2.1.4 of Chap. 10) can be shown to be a special case of least-squares regression. Consider a problem with two classes, in which the two classes 0 and 1 contain a fraction p0 and p1, respectively, of the n data points. Assume that the d-dimensional mean vectors of the two classes are μ0 and μ1, and the covariance matrices are Σ0 and Σ1, respectively. Furthermore, it is assumed that the data matrix D is mean- centered. The response variables y are set to −1/p0 for class 0 and +1/p1 for class 1. Note that the response variables are also mean-centered as a result. Let us now examine the solution for W obtained by least-squares regression. The term DT y is proportional to μ 1 T − μ0T , because the value of y is −1/p0 for a fraction p 0 of the data records belonging to class 0, and it is equal to 1/p1 for a fraction p1 of the data records belonging to class 1. In other words, we have:
Dostları ilə paylaş: |