Split operation: The process of splitting can be used to further refine the clusters into groups of better granularity. This can be achieved by applying the buckshot procedure on the individual documents in a cluster by using k = 2 and then reclustering around these centers. This entire procedure requires O(k · ni) time for a cluster containing ni documents, and therefore splitting all the groups requires O(k · n) time. However, it is not necessary to split all the groups. Instead, only a subset of the groups can be split. These are the groups that are not very coherent and contain documents of a disparate nature. To measure the coherence of a group, the self-similarity of the documents in the cluster is computed. This self-similarity provides an understanding of the underlying coherence. This quantity can be computed either in terms of the average similarity of the documents in a cluster to its centroid or in terms of the average similarity of the cluster documents to each other. The split criterion can then be applied selectively only to those clusters that have low self-similarity. This helps in creating more coherent clusters.
Join operation: The join operation merges similar clusters into a single cluster. To perform the merging, the topical words of each cluster are computed, as the most frequent words in the centroid of the cluster. Two clusters are considered similar if there is significant overlap between the topical words of the two clusters.
The scatter/gather approach is effective because of its ability to combine hierarchical and k-means algorithms.
13.3.2 Probabilistic Algorithms
Probabilistic text clustering can be considered an unsupervised version of the naive Bayes classification method discussed in Sect. 10.5.1 of Chap. 10. It is assumed that the documents need to be assigned to one of k clusters G1 . . . Gk. The basic idea is to use the following generative process:
Select a cluster Gm, where m ∈ {1 . . . k}.
Generate the term distribution of Gm based on a generative model. Examples of such models for text include the Bernoulli model or the multinomial model.
The observed data are then used to estimate the parameters of the Bernoulli or multinomial distributions in the generative process. This section will discuss the Bernoulli model.
The clustering is done in an iterative way with the EM algorithm, where cluster assign-ments of documents are determined from conditional word distributions in the E-step with the Bayes rule, and the conditional word distributions are inferred from cluster assignments in the M- step. For initialization, the documents are randomly assigned to clusters. The ini-tial prior probabilities P (Gm) and conditional feature distributions P (w j |Gm) are estimated from the statistical distribution of this random assignment. A Bayes classifier is used to estimate the posterior probability P (Gm|X) in the E-step. The Bayes classifier commonly uses either a Bernoulli model or the multinomial model discussed later in this chapter. The posterior probability P (Gm|X) of the Bayes classifier can be viewed as a soft assignment probability of document X to the mth mixture component Gm. The conditional feature distribution P (wj |Gm) for word wj is estimated from these posterior probabilities in the M-step as follows:
Dostları ilə paylaş: |