Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	270/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 266 267 268 269 270 271 272 273 ... 423

1-Data Mining tarjima

P (G_m|

)·I(

w_j )

P (w_j |G_m) =

(13.5)

P (G_m|X)

Here, I( X, w_j ) is an indicator variable that takes on the value of 1, if the word w_j is present in X, and 0, otherwise. As in the Bayes classification method, the same Laplacian smoothing approach may be incorporated to reduce overfitting. The prior probabilities P (G_m) for each cluster may also be estimated by computing the average assignment probability of all documents to G_m. This completes the description of the M-step of the EM algorithm. The next E-step uses these modified values of P (w_j |G_m) and the prior probability to derive the posterior Bayes probability with a standard Bayes classifier. Therefore, the following two iterative steps are repeated to convergence:

(E-step) Estimate posterior probability of membership of documents to clusters using Bayes rule:

P (G_m\|		) ∝ P (G_m)		P (w_j \|G_m)		(1 − P (w_j \|G_m)) .	(13.6)
	X
		w_j ∈X	w_j		X

The aforementioned Bayes rule assumes a Bernoulli generative model. Note that Eq. 13.6 is identical to naive Bayes posterior probability estimation for classifica-tion. The multinomial model, which is discussed later in this chapter, may also be used. In such a case, the aforementioned posterior probability definition of Eq. 13.6 is replaced by the multinomial Bayes classifier.

(M-step) Estimate conditional distribution P (w_j |G_m) of words (Eq. 13.5) and prior probabilities P (G_m) of diﬀerent clusters using the estimated probabilities in the E-step.

At the end of the process, the estimated value of P (G_m|X) provides a cluster assignment probability and the estimated value of P (w_j |G_m) provides the term distribution of each cluster. This can be viewed as a probabilistic variant of the notion of cluster digest discussed earlier. Therefore, the probabilistic method provides dual insights about cluster membership and the words relevant to each cluster.

438 CHAPTER 13. MINING TEXT DATA

13.3.3 Simultaneous Document and Word Cluster Discovery

The probabilistic algorithm discussed in the previous section can simultaneously discover document and word clusters. As discussed in Sect. 7.4 of Chap. 7 on high-dimensional clus-tering methods, this is important in the high-dimensional case because clusters are best characterized in terms of both rows and columns simultaneously. In the text domain, the additional advantage of such methods is that the topical words of a cluster provide seman-tic insights about that cluster. Another example is the nonnegative matrix factorization method, discussed in Sect. 6.8 of Chap. 6. This approach is very popular in the text domain because the factorized matrices have a natural interpretation for text data. This approach can simultaneously discover word clusters and document clusters that are represented by the columns of the two factorized matrices. This is also closely related to the concept of

co-clustering.

13.3.3.1 Co-clustering

Co-clustering is most eﬀective for nonnegative matrices in which many entries have zero values. In other words, the matrix is sparsely populated. This is the case for text data. Co-clustering methods can also be generalized to dense matrices, although these techniques are not relevant to the text domain. Co-clustering is also sometimes referred to as bi-clustering or two-mode clustering because of its exploitation of both “modes” (words and documents). While the co-clustering method is presented here in the context of text data, the broader approach is also used in the biological domain with some modifications.

The idea in co-clustering is to rearrange the rows and columns in the data matrix so that most of the nonzero entries become arranged into blocks. In the context of text data, this matrix is the n × d document term matrix D, where rows correspond to documents and columns correspond to words. Thus, the ith cluster is associated with a set of rows R_i (documents), and a set of columns V_i (words). The rows R _i are disjoint from one another over diﬀerent values of i, and the columns V_i are also disjoint from one another over diﬀerent values of i. Thus, the co-clustering method simultaneously leads to document clusters and word clusters. From an intuitive perspective, the words representing the columns of V_i are the most relevant (or topical) words for cluster R_i. The set V_i therefore defines a cluster digest of R_i.

In the context of text data, word clusters are just as important as document clusters because they provide insights about the topics of the underlying collection. Most of the methods discussed in this book for document clustering, such as the scatter/gather method, probabilistic methods, and nonnegative matrix factorization (see Sect. 6.8 of Chap. 6, pro-duce word clusters (or cluster digests) in addition to document clusters. However, the words in the diﬀerent clusters are overlapping in these algorithms, whereas document clusters are non overlapping in all algorithms except for the probabilistic (soft) EM method. In co-clustering, the word clusters and document clusters are both non overlapping. Each docu-ment and word is strictly associated with a particular cluster. One nice characteristic of co-clustering is that it explicitly explores the duality between word clusters and document clusters. Coherent word clusters can be shown to induce coherent document clusters and vice versa. For example, if meaningful word clusters were already available, then one might be able to cluster documents by assigning each document to the word cluster with which it has the most words in common. In co-clustering, the goal is to do this simultaneously so that word clusters and document clusters depend on each other in an optimal way.

13.3. SPECIALIZED CLUSTERING METHODS FOR TEXT

439

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 266 267 268 269 270 271 272 273 ... 423