T −
|
|
T .
|
|
|
|
|
|
|
|
μ1
|
μ0
|
|
For mean-centered data,
|
D
|
T
|
D
|
is equal to the covariance matrix. It can be shown using
|
|
|
|
|
n
|
|
|
some simple algebra (see Exercise 21 of Chap. 10) that the covariance matrix is equal to Sw + p0p1Sb, where Sw = (p0Σ0 + p1Σ1) and Sb = ( μ1 − μ0)T (μ1 − μ0) are the (scaled) d × d within-class and between-class scatter matrices, respectively. Therefore, we have:
(Sw + p0p1Sb)
|
|
T ∝
|
|
T −
|
|
T .
|
(11.10)
|
|
W
|
|
μ1
|
μ0
|
|
Furthermore, the vector SbW T always points in the direction μ1T − μ0T because SbW T = (μ1T − μ0T ) (μ1 − μ0)W T . This implies that we can drop the term involving Sb from Eq. 11.10 without affecting the constant of proportionality:
SwW T ∝ (μ1T − μ0T )
(p0Σ0 + p1Σ1)W T ∝ (μ1T − μ0T )
It is easy to see that the vector W is the same as the Fisher’s linear discriminant of Sect. 10.2.1.4 in Chap. 10.
11.5.2 Principal Component Regression
Because overfitting is caused by the large number of parameters in W , a natural approach is to work with a reduced dimensionality data matrix. In principal component regression,
11.5. REGRESSION MODELING WITH NUMERIC CLASSES
|
357
|
the largest k d principal components of the input data matrix D (cf. Sect. 2.4.3.1 of Chap. 2) with nonzero eigenvalues are determined. These principal components are the top-k eigenvectors of the d × d covariance matrix of D. Let the top-k eigenvectors be arranged in matrix form as the orthonormal columns of the d × k matrix Pk. The original n × d data matrix D is transformed to a new n × k data matrix R = DPk. The new derived set of k-dimensional input variables Z1 . . . Zn, which are rows of R, are used as training data to learn a reduced k-dimensional set of coefficients W :
In this case, the k-dimensional vector of regression coefficients W can be expressed in terms of R as (RT R)−1RT y. This solution is identical to the previous case, except that a smaller and full-rank k × k matrix RT R is inverted. Prediction on a test instance T is performed after transforming it to this new k-dimensional space as T Pk. The dot product between T Pk and W provides the numerical prediction of the test instance. The effectiveness of principal component regression is because of the discarding of the low-variance dimensions, which are either redundant directions (zero eigenvalues) or noisy directions (very small eigenvalues). If all directions are included after PCA-based axis rotation (i.e., k = d), then the approach will yield the same results as linear regression on the original data. It is common to standardize the data matrix D to zero mean and unit variance before performing PCA. In such cases, the test instances also need to be scaled and translated in an identical way.
11.5.3 Generalized Linear Models
The implicit assumption in linear models is that a constant change in the ith feature variable leads to a constant change in the response variable, which is proportional to wi. However, such assumptions are inappropriate in many settings. For example, if the response variable is the height of a person, and the feature variable is the age, the height is not expected to vary linearly with age. Furthermore, the model needs to account for the fact that such variables can never be negative. In other cases, such as customer ratings, the response variables might take on integer values from a bounded range. Nevertheless, the elegant simplicity of linear models can still be leveraged in these settings. In generalized linear models (GLM), each response variable yi is modeled as an outcome of a (typically exponential) probability distribution with mean f (W · Xi) as follows:
Dostları ilə paylaş: |