Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə270/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   266   267   268   269   270   271   272   273   ...   423
1-Data Mining tarjima

P (Gm|




)·I(




wj )







P (wj |Gm) =







X

X,

(13.5)




X



































































P (Gm|X)
















X









Here, I( X, wj ) is an indicator variable that takes on the value of 1, if the word wj is present in X, and 0, otherwise. As in the Bayes classification method, the same Laplacian smoothing approach may be incorporated to reduce overfitting. The prior probabilities P (Gm) for each cluster may also be estimated by computing the average assignment probability of all documents to Gm. This completes the description of the M-step of the EM algorithm. The next E-step uses these modified values of P (wj |Gm) and the prior probability to derive the posterior Bayes probability with a standard Bayes classifier. Therefore, the following two iterative steps are repeated to convergence:



  1. (E-step) Estimate posterior probability of membership of documents to clusters using Bayes rule:




P (Gm|




) ∝ P (Gm)




P (wj |Gm)




(1 − P (wj |Gm)) .

(13.6)




X










wj ∈X

wj

X







The aforementioned Bayes rule assumes a Bernoulli generative model. Note that Eq. 13.6 is identical to naive Bayes posterior probability estimation for classifica-tion. The multinomial model, which is discussed later in this chapter, may also be used. In such a case, the aforementioned posterior probability definition of Eq. 13.6 is replaced by the multinomial Bayes classifier.





  1. (M-step) Estimate conditional distribution P (wj |Gm) of words (Eq. 13.5) and prior probabilities P (Gm) of different clusters using the estimated probabilities in the E-step.

At the end of the process, the estimated value of P (Gm|X) provides a cluster assignment probability and the estimated value of P (wj |Gm) provides the term distribution of each cluster. This can be viewed as a probabilistic variant of the notion of cluster digest discussed earlier. Therefore, the probabilistic method provides dual insights about cluster membership and the words relevant to each cluster.


438 CHAPTER 13. MINING TEXT DATA


13.3.3 Simultaneous Document and Word Cluster Discovery


The probabilistic algorithm discussed in the previous section can simultaneously discover document and word clusters. As discussed in Sect. 7.4 of Chap. 7 on high-dimensional clus-tering methods, this is important in the high-dimensional case because clusters are best characterized in terms of both rows and columns simultaneously. In the text domain, the additional advantage of such methods is that the topical words of a cluster provide seman-tic insights about that cluster. Another example is the nonnegative matrix factorization method, discussed in Sect. 6.8 of Chap. 6. This approach is very popular in the text domain because the factorized matrices have a natural interpretation for text data. This approach can simultaneously discover word clusters and document clusters that are represented by the columns of the two factorized matrices. This is also closely related to the concept of




co-clustering.

13.3.3.1 Co-clustering


Co-clustering is most effective for nonnegative matrices in which many entries have zero values. In other words, the matrix is sparsely populated. This is the case for text data. Co-clustering methods can also be generalized to dense matrices, although these techniques are not relevant to the text domain. Co-clustering is also sometimes referred to as bi-clustering or two-mode clustering because of its exploitation of both “modes” (words and documents). While the co-clustering method is presented here in the context of text data, the broader approach is also used in the biological domain with some modifications.


The idea in co-clustering is to rearrange the rows and columns in the data matrix so that most of the nonzero entries become arranged into blocks. In the context of text data, this matrix is the n × d document term matrix D, where rows correspond to documents and columns correspond to words. Thus, the ith cluster is associated with a set of rows Ri (documents), and a set of columns Vi (words). The rows R i are disjoint from one another over different values of i, and the columns Vi are also disjoint from one another over different values of i. Thus, the co-clustering method simultaneously leads to document clusters and word clusters. From an intuitive perspective, the words representing the columns of Vi are the most relevant (or topical) words for cluster Ri. The set Vi therefore defines a cluster digest of Ri.


In the context of text data, word clusters are just as important as document clusters because they provide insights about the topics of the underlying collection. Most of the methods discussed in this book for document clustering, such as the scatter/gather method, probabilistic methods, and nonnegative matrix factorization (see Sect. 6.8 of Chap. 6, pro-duce word clusters (or cluster digests) in addition to document clusters. However, the words in the different clusters are overlapping in these algorithms, whereas document clusters are non overlapping in all algorithms except for the probabilistic (soft) EM method. In co-clustering, the word clusters and document clusters are both non overlapping. Each docu-ment and word is strictly associated with a particular cluster. One nice characteristic of co-clustering is that it explicitly explores the duality between word clusters and document clusters. Coherent word clusters can be shown to induce coherent document clusters and vice versa. For example, if meaningful word clusters were already available, then one might be able to cluster documents by assigning each document to the word cluster with which it has the most words in common. In co-clustering, the goal is to do this simultaneously so that word clusters and document clusters depend on each other in an optimal way.



13.3. SPECIALIZED CLUSTERING METHODS FOR TEXT

439



Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   266   267   268   269   270   271   272   273   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin