cij =
|
n
|
xi xj
|
− μiμj ∀i, j ∈ {1 . . . d}
|
(2.6)
|
|
n
|
k
|
k
|
|
|
k=1
|
|
|
|
Let μ = (μ1 . . . μd) is the d-dimensional row vector representing the means along the different dimensions. Then, the aforementioned d × d computations of Eq. 2.6 for different values of i and j can be expressed compactly in d × d matrix form as follows:
Note that the d diagonal entries of the matrix C correspond to the d variances. The covari-ance matrix C is positive semi-definite, because it can be shown that for any d-dimensional column vector v, the value of vT Cv is equal to the variance of the 1-dimensional projection Dv of the data set D on v.
vT Cv =
|
(Dv)T Dv
|
− (μ v)2 = Variance of 1-dimensional points in Dv ≥ 0
|
(2.8)
|
|
n
|
|
In fact, the goal of PCA is to successively determine orthonormal vectors v maximizing vT Cv. How can one determine such directions? Because the covariance matrix is symmetric and positive semidefinite, it can be diagonalized as follows:
The columns of the matrix P contain the orthonormal eigenvectors of C, and Λ is a diagonal matrix containing the nonnegative eigenvalues. The entry Λii is the eigenvalue corresponding to the ith eigenvector (or column) of the matrix P . These eigenvectors represent successive orthogonal solutions1 to the aforementioned optimization model maximizing the variance vT Cv along the unit direction v.
Setting the gradient of the Lagrangian relaxation vT Cv−λ(||v||2 −1) to 0 is equivalent to the eigenvector condition Cv − λv = 0. The variance along an eigenvector is vT Cv = vT λv = λ. Therefore, one should include the orthonormal eigenvectors in decreasing order of eigenvalue λ to maximize preserved variance in reduced subspace.
2.4. DATA REDUCTION AND TRANSFORMATION
|
43
|
An interesting property of this diagonalization is that both the eigenvectors and eigenval-ues have a geometric interpretation in terms of the underlying data distribution. Specifically, if the axis system of data representation is rotated to the orthonormal set of eigenvectors in the columns of P , then it can be shown that all d2 covariances of the newly transformed feature values are zero. In other words, the greatest variance-preserving directions are also the correlation-removing directions. Furthermore, the eigenvalues represent the variances of the data along the corresponding eigenvectors. In fact, the diagonal matrix Λ is the new covariance matrix after axis rotation. Therefore, eigenvectors with large eigenvalues preserve greater variance, and are also referred to as principal components. Because of the nature of the optimization formulation used to derive this transformation, a new axis system containing only the eigenvectors with the largest eigenvalues is optimized to retaining the maximum variance in a fixed number of dimensions. For example, the scatter plot of Fig. 2.2 illustrates the various eigenvectors, and it is evident that the eigenvector with the largest variance is all that is needed to create a variance-preserving representation. It generally suffices to retain only a small number of eigenvectors with large eigenvalues.
Without loss of generality, it can be assumed that the columns of P (and corresponding diagonal matrix Λ) are arranged from left to right in such a way that they correspond to decreasing eigenvalues. Then, the transformed data matrix D in the new coordinate system after axis rotation to the orthonormal columns of P can be algebraically computed as the following linear transformation:
D = DP (2.10)
While the transformed data matrix D is also of size n × d, only its first (leftmost) k d columns will show significant variation in values. Each of the remaining (d − k) columns of D will be approximately equal to the mean of the data in the rotated axis system. For mean-centered data, the values of these (d − k ) columns will be almost 0. Therefore, the dimensionality of the data can be reduced, and only the first k columns of the transformed data matrix D may need to be retained2 for representation purposes. Furthermore, it can be confirmed that the covariance matrix of the transformed data D = DP is the diagonal matrix Λ by applying the covariance definition of Eq. 2.7 to DP (transformed data) and μP (transformed mean) instead of D and μ, respectively. The resulting covariance matrix can be expressed in terms of the original covariance matrix C as P T CP . Substituting C = P ΛP T from Eq. 2.9 shows equivalence because P T P = P P T = I. In other words, correlations have been removed from the transformed data because Λ is diagonal.
The variance of the data set defined by projections along top-k eigenvectors is equal to the sum of the k corresponding eigenvalues. In many applications, the eigenvalues show a precipitous drop-off after the first few values. For example, the behavior of the eigenvalues for the 279-dimensional Arrythmia data set from the UCI Machine Learning Repository [213] is illustrated in Fig. 2.3. Figure 2.3a shows the absolute magnitude of the eigenvalues in increasing order, whereas Fig. 2.3b shows the total amount of variance retained in the top-k eigenvalues. Figure 2.3b can be derived by using the cumulative sum of the smallest eigen-values in Fig. 2.3a. It is interesting to note that the 215 smallest eigenvalues contain less than 1 % of the total variance in the data and can therefore be removed with little change to the results of similarity-based applications. Note that the Arrythmia data set is not a very strongly correlated data set along many pairs of dimensions. Yet, the dimensional-ity reduction is drastic because of the cumulative effect of the correlations across many dimensions.
The means of the remaining columns also need be stored if the data set is not mean centered.
44 CHAPTER 2. DATA PREPARATION
7000
6000
5000
4000
3000
2000
1000
0
0
|
|
|
|
|
|
|
|
|
|
4.5
|
x 10 4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
VARIANCE
|
3.5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
CUMULATIVE
|
1.5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2.5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TOTAL
|
2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50
|
100
|
150
|
200
|
250
|
300
|
|
0
|
50
|
100
|
150
|
200
|
250
|
|
|
|
|
300
|
|
INCREASING INDEX OF EIGENVALUE
|
|
|
|
|
|
INCREASING INDEX OF EIGENVALUE
|
|
|
|
|
|
|
|
|
(a) Magnitude of Eigenvalues (b) Variance in smallest k
(Increasing Index): Arrythmia Eigenvalues: Arrythmia
Figure 2.3: Variance retained with increasing number of eigenvalues for the Arrythmia data set
The eigenvectors of the matrix C may be determined by using any numerical method discussed in [295] or by an off-the -shelf eigenvector solver. PCA can be extended to discov-ering nonlinear embeddings with the use of a method known as the kernel trick. Refer to Sect. 10.6.4.1 of Chap. 10 for a brief description of kernel PCA.
2.4.3.2 Singular Value Decomposition
Singular value decomposition (SVD) is closely related to principal component analysis (PCA). However, these distinct methods are sometimes confused with one another because of the close relationship. Before beginning the discussion of SVD, we state how it is related to PCA. SVD is more general than PCA because it provides two sets of basis vectors instead of one. SVD provides basis vectors of both the rows and columns of the data matrix, whereas PCA only provides basis vectors of the rows of the data matrix. Furthermore, SVD provides the same basis as PCA for the rows of the data matrix in certain special cases:
Dostları ilə paylaş: |