Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə268/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   264   265   266   267   268   269   270   271   ...   423
1-Data Mining tarjima

13.3. SPECIALIZED CLUSTERING METHODS FOR TEXT

435

incoherence can sometimes be inherited by later iterations, as a result of which the quality of the final results will be poor.


To address this issue, the scatter/gather approach uses a combination of hierarchical partitioning and k-means clustering in a two-phase approach. An efficient and simplified form of hierarchical clustering is applied to a sample of the corpus, to yield a robust set of seeds in the first phase. This is achieved by using either of two possible procedures that are referred to as buckshot and fractionation, respectively. Both these procedures are different types of hierarchical procedures. In the second phase, the robust seeds generated in the first phase are used as the starting point of a k-means algorithm, as adapted to text data. The size of the sample in the first phase is carefully selected to balance the time required by the first phase and the second phase. Thus, the overall approach may be described as follows:





  1. Apply either the buckshot or fractionation procedures to create a robust set of initial seeds.




  1. Apply a k-means approach on the resulting set of seeds to generate the final clusters. Additional refinements may be used to improve the clustering quality.

Next, the buckshot and fractionation procedures will be described. These are two alternatives for the first phase with a similar running time. The fractionation method is the more robust one, but the buckshot method is faster in many practical settings.


Buckshot: Let k be the number of clusters to be found and n be the number of



documents in the corpus. The buckshot method selects a seed superset of size k · n and then agglomerates them to k seeds. Straightforward agglomerative hierarchical clustering algorithms (requiring1 quadratic time) are applied to this initial sample of





k · n seeds. As we use quadratically scalable algorithms in this phase, this approach requires O(k · n) time. This seed set is more robust than a naive data sample of k seeds because it represents the summarization of a larger sample of the corpus.



  • Fractionation: Unlike the buckshot method, which uses a sample of k · n documents, the fractionation method works with all the documents in the corpus. The fraction-ation algorithm initially breaks up the corpus into n/m buckets, each of size m > k documents. An agglomerative algorithm is applied to each of these buckets to reduce them by a factor ν ∈ (0, 1). This step creates ν · m agglomerated documents in each bucket, and therefore ν·n agglomerated documents over all buckets. An “agglomerated document” is defined as the concatenation of the documents in a cluster. The process is repeated by treating each of these agglomerated documents as a single document. The approach terminates when a total of k seeds remains.

It remains to be explained how the documents are partitioned into buckets. One possibility is to use a random partitioning of the documents. However, a more carefully designed procedure can achieve more effective results. One such procedure is to sort the documents by the index of the jth most common word in the document. Here, j is chosen to be a small number, such as 3, that corresponds to medium frequency words in the documents. Contiguous groups of m documents in this sort order are mapped to clusters. This approach ensures that the resulting groups have at least a few common words in them and are therefore not completely random. This can sometimes help in improving the quality of the centers.








  • As discussed in Chap. 6, standard agglomerative algorithms require more than quadratic time, though some simpler variants of single-linkage clustering [469] can be implemented in approximately quadratic time.

436 CHAPTER 13. MINING TEXT DATA

The agglomerative clustering of m documents in the first iteration of the fractionation algorithm requires O(m2) time for each group, and sums to O(n · m) over the n/m different groups. As the number of individuals reduces geometrically by a factor of ν in each iteration, the total running time over all iterations is O (n· m·(1+ν +ν2 +. . .)). For ν < 1, the running time over all iterations is still O(n ·m). By selecting m = O(k), one still ensure a running time of O(n · k) for the initialization procedure.


The buckshot and fractionation procedures require O(k · n) time. This is equivalent to the running time of a single iteration of the k-means algorithm. As discussed below, this is important in (asymptotically) balancing the running time of the two phases of the algorithm.


When the initial cluster centers have been determined with the use of the buckshot or fractionation algorithms, one can apply the k-means algorithm with the seeds obtained in the first step. Each document is assigned to the nearest of the k cluster centers. The centroid of each such cluster is determined as the concatenation of the documents in that cluster. Furthermore, the less frequent words of each centroid are removed. These centroids replace the seeds from the previous iteration. This process can be iteratively repeated to refine the cluster centers. Typically, only a small constant number of iterations is required because the greatest improvements occur only in the first few iterations. This ensures that the overall running time of each of the first and second phases is O(k · n).


It is also possible to use a number of enhancements after the second clustering phase.


These enhancements are as follows:




1   ...   264   265   266   267   268   269   270   271   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin