Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə273/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   269   270   271   272   273   274   275   276   ...   423
1-Data Mining tarjima

WORDS


d



DOCUMENTS




SCALED




n

DOCUMENT













TERM










MATRIX






D = [P(Xi, wj)]


TOPICS



DOCUMENTS

kDOMINANT

k




BASISVECTORSOFINVERTEDLISTS







n









Qk = [P(Xi|Gm)]


TOPICS
k



x

TOPICS

k

k

x

TOPICS

































P(Gm): PRIOR PROBABILITY
OF TOPIC Gm
WORDS


d


k DOMINANT



  1. BASIS VECTORS OF DOCUMENTS



PkT = [P(wj|Gm)]



Figure 13.4: Matrix factorization of PLSA












LION

TIGER

CHEETAH

JAGUAR

PORSCHE

FERRARI







X1













0

0




CATS

X2













0

0







X3













0

0




BOTH

X4






















CARS

X5

0

0

0













X6

0

0

0






























D




CATS

CARS

X1




0

X2




0

X3




0

X4







X5

0

0

X6

0

0







Qk













LION

TIGER

CHEETAH

JAGUAR

PORSCHE

FERARI




X 0

0

X CARSCATS













0

0




0

0

0

0










k










Pk

T













Figure 13.5: An example of PLSA (Revisiting Fig. 6.22 of Chap. 6)

probabilistic interpretability. By examining the probability values in each column of Pk, one can immediately infer the topical words of the corresponding aspect. This is not possible in LSA, where the entries in the corresponding matrix Pk do not have clear probabilistic significance and may even be negative. One advantage of LSA is that the transformation can be interpreted in terms of the rotation of an orthonormal axis system. In LSA, the columns of Pk are a set of orthonormal vectors representing this rotated basis. This is not the case in PLSA. Orthogonality of the basis system in LSA enables straightforward projection of out-of-sample documents (i.e., documents not included in D) onto the new rotated axis system.


Interestingly, as in SVD/LSA, the latent properties of the transpose of the document matrix are revealed by PLSA. Each row of PkΣk can be viewed as the transformed coordi-nates of the vertical or inverted list representation (rows of the transpose) of the document matrix D in the basis space defined by columns of Qk . These complementary properties are illustrated in Fig. 13.4. PLSA can also be viewed as a kind of nonnegative matrix factorization method (cf. Sect. 6.8 of Chap. 6) in which matrix elements are interpreted as probabilities and the maximum-likelihood estimate of a generative model is maximized rather than minimizing the Frobenius norm of the error matrix.


An example of an approximately optimal PLSA matrix factorization of a toy 6 ×6 exam-ple, with 6 documents and 6 words, is illustrated in Fig. 13.5. This example is the same (see Fig. 6.22) as the one used for nonnegative matrix factorization (NMF) in Chap. 6. Note that the factorizations in the two cases are very similar except that all basis vectors



13.4. TOPIC MODELING

445

are normalized to sum to 1 in PLSA, and the dominance of the basis vectors is reflected in a separate diagonal matrix containing the prior probabilities. Although the factorization presented here for PLSA is identical to that of NMF for intuitive understanding, the fac-torizations will usually be slightly different4 because of the difference in objective functions in the two cases. Also, most of the entries in the factorized matrices will not be exactly 0 in a real example, but many of them might be quite small.


As in LSA, the problems of synonymy and polysemy are addressed by PLSA. For exam-ple, if an aspect G1 explains the topic of cats, then two documents X and Y containing the words “cat” and “kitten,” respectively, will have positive values of the transformed coordinate for aspect G1. Therefore, similarity computations between these documents will be improved in the transformed space. A word with multiple meanings (polysemous word) may have positive components in different aspects. For example, a word such as “jaguar” can either be a cat or a car. If G1 be an aspect that explains the topic of cats, and G2 is an aspect that explains the topic of cars, then both P (“jaguar”|G1) and P (“jaguar”|G2) may be highly positive. However, the other words in the document will provide the context necessary to reinforce one of these two aspects. A document X that is mostly about cats will have a high value of P (X|G1), whereas a document Y that is mostly about cars will have a high value of P (Y |G2). This will be reflected in the matrix Qk = [P (Xi|Gm)]n×k and the new transformed coordinate representation QkΣk. Therefore, the computations will also be robust in terms of adjusting for polysemy effects. In general, semantic concepts will be amplified in the transformed representation QkΣk. Therefore, many data mining applica-tions will perform more robustly in terms of the n × k transformed representation QkΣk rather than the original n × d document-term matrix.


13.4.2 Use in Clustering and Comparison with Probabilistic Clustering


The estimated parameters have intuitive interpretations in terms of clustering. In the Bayes model for clustering (Fig. 13.3a), the generative process is optimized to clustering docu-ments, whereas the generative process in topic modeling (Fig. 13.3b) is optimized to discov-ering the latent semantic components. The latter can be shown to cluster document–word pairs, which is different from clustering documents. Therefore, although the same parame-ter set P (wj |Gm) and P (X|Gm ) is estimated in the two cases, qualitatively different results will be obtained. The model of Fig. 13.3a generates a document from a unique hidden component (cluster), and the final soft clustering is a result of uncertainty in estimation from observed data. On the other hand, in the probabilistic latent semantic model, different parts of the same document may be generated by different aspects, even at the generative modeling level. Thus, documents are not generated by individual mixture components, but by a combination of mixture components. In this sense, PLSA provides a more realistic model because the diverse words of an unusual document discussing both cats and cars (see Fig. 13.5) can be generated by distinct aspects. In Bayes clustering, even though such a document is generated in entirety by one of the mixture components, it may have sim-ilar assignment (posterior) probabilities with respect to two or more clusters because of estimation uncertainty. This difference is because PLSA was originally intended as a data transformation and dimensionality reduction method, rather than as a clustering method. Nevertheless, good document clusters can usually be derived from PLSA as well. The value P (Gm|Xi) provides an assignment probability of the document Xi to aspect (or “cluster”)








  • The presented factorization for PLSA is approximately optimal, but not exactly optimal.

446 CHAPTER 13. MINING TEXT DATA


Gm and can be derived from the parameters estimated in the M-step using the Bayes rule as follows:















P (Gm) · P (




|Gm)










P (

Gm|




) =

Xi

.

(13.16)




X













i




k




























r=1 P (Gr) · P (Xi|Gr)










Thus, the PLSA approach can also be viewed a soft clustering method that provides assign-ment probabilities of documents to clusters. In addition, the quantity P (wj |Gm), which is estimated in the M-step, provides probabilistic information about the probabilistic affinity of different words to aspects (or topics). The terms with the highest probability values for a specific aspect Gm can be viewed as a cluster digest for that topic.


As the PLSA approach also provides a multidimensional n × k coordinate representation QkΣk of the documents, a different way of performing the clustering would be to represent the documents in this new space and use a k-means algorithm on the transformed corpus. Because the noise impact of synonymy and polysemy has been removed by PLSA, the k-means approach will generally be more effective on the reduced representation than on the original corpus.


13.4.3 Limitations of PLSA

Although the PLSA method is an intuitively sound model for probabilistic modeling, it does have a number of practical drawbacks. The number of parameters grows linearly with the number of documents. Therefore, such an approach can be slow and may overfit the training data because of the large number of estimated parameters. Furthermore, while PLSA provides a generative model of document–word pairs in the training data, it cannot easily assign probabilities to previously unseen documents. Most of the other EM mixture models discussed in this book, such as the probabilistic Bayes model, are much better at assigning probabilities to previously unseen documents. To address these issues, Latent Dirichlet Allocation (LDA) was defined. This model uses Dirichlet priors on the topics, and generalizes relatively easily to new documents. In this sense, LDA is a fully generative model. The bibliographic notes contain pointers to this model.


13.5 Specialized Classification Methods for Text

As in clustering, classification algorithms are affected by the nonnegative, sparse and high-dimensional nature of text data. An important effect of sparsity is that the presence of a word in a document is more informative than the absence of the word. This observation has implications for classification methods such as the Bernoulli model used for Bayes classification that treat the presence and absence of a word in a symmetric way.


Popular techniques in the text domain include instance-based methods, the Bayes clas-sifier, and the SVM classifier. The Bayes classifier is very popular because Web text is often combined with other types of features such as URLs or side information. It is relatively easy to incorporate these features into the Bayes classifier. The sparse high-dimensional nature of text also necessitates the design of more refined multinomial Bayes models for the text domain. SVM classifiers are also extremely popular for text data because of their high accuracy. The major issue with the use of the SVM classifier is that the high-dimensional nature of text necessitates performance enhancements to such classifiers. In the following, some of these algorithms will be discussed.




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   269   270   271   272   273   274   275   276   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin