Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə37/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   33   34   35   36   37   38   39   40   ...   423
1-Data Mining tarjima

LSA is a classical example of how the “loss” of information from discarding some dimen-sions can actually result in an improvement in the quality of the data representation. The text domain suffers from two main problems corresponding to synonymy and polysemy. Synonymy refers to the fact that two words may have the same meaning. For example, the words “comical” and “hilarious” mean approximately the same thing. Polysemy refers to the fact that the same word may mean two different things. For example, the word “jaguar” could refer to a car or a cat. Typically, the significance of a word can only be understood in the context of other words in the document. This is a problem for similarity-based appli-cations because the computation of similarity with the use of word frequencies may not be completely accurate. For example, two documents containing the words “comical” and “hilarious,” respectively, may not be deemed sufficiently similar in the original representa-tion space. The two aforementioned issues are a direct result of synonymy and polysemy effects. The truncated representation after LSA typically removes the noise effects of syn-onymy and polysemy because the (high-energy) singular vectors represent the directions of correlation in the data, and the appropriate context of the word is implicitly represented along these directions. The variations because of individual differences in usage are implic-itly encoded in the low-energy directions, which are truncated anyway. It has been observed that significant qualitative improvements [184, 416] for text applications may be achieved with the use of LSA. The improvement4 is generally greater in terms of synonymy effects than polysemy. This noise-removing behavior of SVD has also been demonstrated in general multidimensional data sets [25].
2.4.3.4 Applications of PCA and SVD

Although PCA and SVD are primarily used for data reduction and compression, they have many other applications in data mining. Some examples are as follows:





  1. Noise reduction: While removal of the smaller eigenvectors/singular vectors in PCA and SVD can lead to information loss, it can also lead to improvement in the quality of data representation in surprisingly many cases. The main reason is that the variations along the small eigenvectors are often the result of noise, and their removal is generally beneficial. An example is the application of LSA in the text domain where the removal of the smaller components leads to the enhancement of the semantic characteristics of text. SVD is also used for deblurring noisy images. These text- and image-specific results have also been shown to be true in arbitrary data domains [25]. Therefore, the data reduction is not just space efficient but actually provides qualitative benefits in many cases.





  • Concepts that are not present predominantly in the collection will be ignored by truncation. Therefore, alternative meanings reflecting infrequent concepts in the collection will be ignored. While this has a robust effect on the average, it may not always be the correct or complete disambiguation of polysemous words.

2.4. DATA REDUCTION AND TRANSFORMATION

49




  1. Data imputation: SVD and PCA can be used for data imputation applications [23], such as collaborative filtering, because the reduced matrices Qk, Σk, and Pk can be estimated for small values of k even from incomplete data matrices. Therefore, the entire matrix can be approximately reconstructed as QkΣkPkT . This application is discussed in Sect. 18.5 of Chap. 18.





  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   33   34   35   36   37   38   39   40   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin