Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə36/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   32   33   34   35   36   37   38   39   ...   423
1-Data Mining tarjima

2.4. DATA REDUCTION AND TRANSFORMATION

47

The orthonormal columns of Qk provide a k -dimensional basis system for (approximately) transforming “data points” corresponding to the rows of DT , and the matrix DT Qk = PkΣk contains the corresponding coordinates. For example, in a user-item ratings matrix, one may wish to determine either a reduced representation of the users, or a reduced representation of the items. SVD provides the basis vectors for both reductions. Truncated SVD expresses the data in terms of k dominant latent components. The ith latent component is expressed in the ith basis vectors of both D and DT , and its relative importance in the data is defined by the ith singular value. By decomposing the matrix product QkΣkPkT into column vectors of Qk and Pk (i.e., dominant basis vectors of DT and D), the following additive sum of the k latent components can be obtained:





k










k







QkΣkPkT =




σi




T = σi(










T )

(2.13)




qi

pi

qi

pi




i=1










i=1







Here qi is the ith column of Q, pi is the ith column of P , and σi is the ith diagonal entry of Σ. Each latent component σi(qi piT ) is an n × d matrix with rank 1 and energy σi2. This decomposition is referred to as spectral decomposition. The relationships of the reduced basis vectors to SVD matrix factorization are illustrated in Fig. 2.4.


An example of a rank-2 truncated SVD of a toy 6 × 6 matrix is illustrated below:
































2

2

1

2

0

0



































































2

3

3

3

0

0





























































D =

1

11100




Q

Σ

P T

















































2

22311






2

2




2

















































0

0

0

1

1



































































1


























































































































































0

0

0

2

1

2













































0.41




0.17




















































































































































0.65




0.31











































































0.23




0.13




8.4

0

0.41






0.49






0.44



0.61



0.10



0.12











0.56




0.20







0 3.3

0.21




0.31




0.26

0.37

0.44

0.68









0.10

0.46









































































































































































































































0.19

0.78































































































1.55




1.87




1.67 1.91




0.10




0.04
























































































2.46
































































2.98




2.66 2.95 0.10



0.03


































=

0.89




1.08




0.96 1.04 0.01

0.04








































1.81




2.11




1.91 3.14 0.77



1.03


































































































































0.02




0.05




0.02 1.06 0.74




1.11



















































0.04 1.89 1.28














































0.10

0.02




1.92









































































































Note that the rank-2 matrix is a good approximation of the original matrix. The entry with the largest error is underlined in the final approximated matrix. Interestingly, this entry is also inconsistent with the structure of the remaining matrix in the original data (why?). Truncated SVD often tries to correct inconsistent entries, and this property is sometimes leveraged for noise reduction in error-prone data sets.


2.4.3.3 Latent Semantic Analysis


Latent semantic analysis (LSA) is an application of the SVD method to the text domain. In this case, the data matrix D is an n × d document-term matrix containing normalized


48 CHAPTER 2. DATA PREPARATION


word frequencies in the n documents, where d is the size of the lexicon. No mean centering is used, but the results are approximately the same as PCA because of the sparsity of D. The sparsity of D implies that most of the entries in D are 0, and the mean values of each column are much smaller than the nonzero values. In such scenarios, it can be shown that the covariance matrix is approximately proportional to DT D . The sparsity of the data set also results in a low intrinsic dimensionality. Therefore, in the text domain, the reduction in dimensionality from LSA is rather drastic. For example, it is not uncommon to be able to represent a corpus drawn on a lexicon of 100,000 dimensions in fewer than 300 dimensions.





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   32   33   34   35   36   37   38   39   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin