Figure 2.4: Complementary basis properties of matrix factorization in SVD
d-dimensional column vector and
Dv be the projection of the data set
D on
v. Consider the problem of determining the
unit vector
v such that the sum of squared Euclidean distances (
Dv)
T (
Dv) of the projected data points from the origin is maximized. Setting the gradient of the Lagrangian relaxation
vT DT Dv − λ(
||v ||2
− 1) to 0 is equivalent to the eigenvector condition
DT Dv − λv = 0. Because the right singular vectors are eigenvectors of
DT D, it follows that the eigenvectors (right singular vectors) with the
k largest eigenvalues (squared singular values) provide a basis that maximizes the preserved energy in the transformed and reduced data matrix
Dk =
DPk =
QkΣ
k. Because the
energy, which is the sum of squared Euclidean distances from the origin, is invariant to axis rotation, the energy in
Dk is the same as that in
DkPkT =
QkΣ
kPkT . Therefore,
k-rank SVD is a maximum energy-preserving
factorization. This result is known as the
Eckart–Young theorem.
The total preserved energy of the projection Dv of the data set D along unit right-singular vector v with singular value σ is given by (Dv)T (Dv), which can be simplified as follows:
(Dv)T (Dv) = vT (DT Dv) = vT (σ2v) = σ2
Because the energy is defined as a linearly separable sum along orthonormal directions, the preserved energy in the data projection along the top-k singular vectors is equal to the sum of the squares of the top-k singular values. Note that the total energy in the data set D is always equal to the sum of the squares of all the nonzero singular values. It can be shown that maximizing the preserved energy is the same as minimizing the squared error3 (or lost energy) of the k-rank approximation. This is because the sum of the energy in the preserved subspace and the lost energy in the complementary (discarded) subspace is always a constant, which is equal to the energy in the original data set D.
When viewed purely in terms of eigenvector analysis, SVD provides two different perspec-tives for understanding the transformed and reduced data. The transformed data matrix can either be viewed as the projection DPk of the data matrix D on the top k basis eigenvectors Pk of the d × d scatter matrix DT D, or it can directly be viewed as the scaled eigenvec-tors QkΣk = DPk of the n × n dot-product similarity matrix DDT . While it is generally computationally expensive to extract the eigenvectors of an n × n similarity matrix, such an approach also generalizes to nonlinear dimensionality reduction methods where notions of linear basis vectors do not exist in the original space. In such cases, the dot-product similarity matrix is replaced with a more complex similarity matrix in order to extract a nonlinear embedding (cf. Table 2.3).
SVD is more general than
PCA and can be used to simultaneously determine a subset of
k basis vectors for the data matrix and its transpose with the maximum energy. The latter can be useful in understanding complementary transformation properties of
DT .