Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	269/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 265 266 267 268 269 270 271 272 ... 423

1-Data Mining tarjima

Split operation: The process of splitting can be used to further refine the clusters into groups of better granularity. This can be achieved by applying the buckshot procedure on the individual documents in a cluster by using k = 2 and then reclustering around these centers. This entire procedure requires O(k · n_i) time for a cluster containing n_i documents, and therefore splitting all the groups requires O(k · n) time. However, it is not necessary to split all the groups. Instead, only a subset of the groups can be split. These are the groups that are not very coherent and contain documents of a disparate nature. To measure the coherence of a group, the self-similarity of the documents in the cluster is computed. This self-similarity provides an understanding of the underlying coherence. This quantity can be computed either in terms of the average similarity of the documents in a cluster to its centroid or in terms of the average similarity of the cluster documents to each other. The split criterion can then be applied selectively only to those clusters that have low self-similarity. This helps in creating more coherent clusters.

Join operation: The join operation merges similar clusters into a single cluster. To perform the merging, the topical words of each cluster are computed, as the most frequent words in the centroid of the cluster. Two clusters are considered similar if there is significant overlap between the topical words of the two clusters.

The scatter/gather approach is eﬀective because of its ability to combine hierarchical and k-means algorithms.

13.3.2 Probabilistic Algorithms

Probabilistic text clustering can be considered an unsupervised version of the naive Bayes classification method discussed in Sect. 10.5.1 of Chap. 10. It is assumed that the documents need to be assigned to one of k clusters G₁ . . . G_k. The basic idea is to use the following generative process:

13.3. SPECIALIZED CLUSTERING METHODS FOR TEXT

437

Select a cluster G_m, where m ∈ {1 . . . k}.

Generate the term distribution of G_m based on a generative model. Examples of such models for text include the Bernoulli model or the multinomial model.

The observed data are then used to estimate the parameters of the Bernoulli or multinomial distributions in the generative process. This section will discuss the Bernoulli model.

The clustering is done in an iterative way with the EM algorithm, where cluster assign-ments of documents are determined from conditional word distributions in the E-step with the Bayes rule, and the conditional word distributions are inferred from cluster assignments in the M- step. For initialization, the documents are randomly assigned to clusters. The ini-tial prior probabilities P (G_m) and conditional feature distributions P (w _j |G_m) are estimated from the statistical distribution of this random assignment. A Bayes classifier is used to estimate the posterior probability P (G_m|X) in the E-step. The Bayes classifier commonly uses either a Bernoulli model or the multinomial model discussed later in this chapter. The posterior probability P (G_m|X) of the Bayes classifier can be viewed as a soft assignment probability of document X to the mth mixture component G_m. The conditional feature distribution P (w_j |G_m) for word w_j is estimated from these posterior probabilities in the M-step as follows:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 265 266 267 268 269 270 271 272 ... 423