SVD provides the same basis vectors and data transformation as PCA for data sets in which the mean of each attribute is 0.
The basis vectors of PCA are invariant to mean-translation, whereas those of SVD are not. When the data are not mean centered, the basis vectors of SVD and PCA will not be the same, and qualitatively different results may be obtained. SVD is often applied without mean centering to sparse nonnegative data such as document-term matrices. A formal way of defining SVD is as a decomposable product of (or factorization into) three matrices:
Here, Q is an n × n matrix with orthonormal columns, which are the left singular vectors.
is an n × d diagonal matrix containing the singular values, which are always nonnegative and, by convention, arranged in nonincreasing order. Furthermore, P is a d × d matrix with orthonormal columns, which are the right singular vectors. Note that the diagonal matrix Σ is rectangular rather than square, but it is referred to as diagonal because only entries of the
2.4. DATA REDUCTION AND TRANSFORMATION
|
45
|
form Σii are nonzero. It is a fundamental fact of linear algebra that such a decomposition always exists, and a proof may be found in [480]. The number of nonzero diagonal entries of
is equal to the rank of the matrix D, which is at most min{n, d}. Furthermore, because of the orthonormality of the singular vectors, both P T P and QT Q are identity matrices. We make the following observations:
The columns of matrix Q, which are also the left singular vectors, are the orthonormal eigenvectors of DDT . This is because DDT = QΣ(P T P )ΣT QT = QΣΣT QT . There-fore, the square of the nonzero singular values, which are diagonal entries of the n × n diagonal matrix ΣΣT , represent the nonzero eigenvalues of DDT .
The columns of matrix P , which are also the right singular vectors, are the orthonor-mal eigenvectors of DT D. The square of the nonzero singular values, which are rep-resented in diagonal entries of the d × d diagonal matrix ΣT Σ, are the nonzero eigen-values of DT D. Note that the nonzero eigenvalues of DDT and DT D are the same. The matrix P is particularly important because it provides the basis vectors, which are analogous to the eigenvectors of the covariance matrix in PCA.
3. Because the covariance matrix of mean-centered data is DT D (cf. Eq. 2.7) and the
n
right singular vectors of SVD are eigenvectors of DT D, it follows that the eigenvectors of PCA are the same as the right-singular vectors of SVD for mean-centered data. Furthermore, the squared singular values in SVD are n times the eigenvalues of PCA. This equivalence shows why SVD and PCA can provide the same transformation for mean-centered data.
Without loss of generality, it can be assumed that the diagonal entries of Σ are arranged in decreasing order, and the columns of matrix P and Q are also ordered accordingly. Let Pk and Qk be the truncated d × k and n × k matrices obtained by selecting the first k columns of P and Q, respectively. Let Σk be the k × k square matrix containing the top k singular values. Then, the SVD factorization yields an approximate d-dimensional data representation of the original data set D:
The columns of Pk represent a k-dimensional basis system for a reduced representation of the data set. The dimensionality reduced data set in this k-dimensional basis system is given by the n×k data set Dk = DPk = Qk Σk, as in Eq. 2.10 of PCA. Each of the n rows of Dk contain the k coordinates of each transformed data point in this new axis system. Typically, the value of k is much smaller than both n and d. Furthermore, unlike PCA, the rightmost (d − k) columns of the full d-dimensional transformed data matrix D = DP will be approximately 0 (rather than the data mean), whether the data are mean centered or not. In general, PCA projects the data on a low-dimensional hyperplane passing through the data mean, whereas SVD projects the data on a low-dimensional hyperplane passing through the origin. PCA captures as much of the variance (or, squared Euclidean distance about the mean) of the data as possible, whereas SVD captures as much of the aggregate squared Euclidean distance about the origin as possible. This method of approximating a data matrix is referred to as
truncated SVD.
In the following, we will show that truncated SVD maximizes the aggregate squared Euclidean distances (or energy) of the transformed data points about the origin. Let v be a
46 CHAPTER 2. DATA PREPARATION
|