Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə307/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   303   304   305   306   307   308   309   310   ...   423
1-Data Mining tarjima

Cluster addition: The number of clusters added is k · f , where k is the number of clusters at the end of the last iteration. The value of f is in the range (0, 1), and is computed as follows. Let ka be the number of clusters added in the previous iteration, and let kr be the number of clusters removed because of elimination of overlapping clusters in the previous iteration. Then, the value of f is computed as follows:




f =

max{ka − kr, 0}

(15.2)







ka













The rationale for this is that when the algorithm reaches its “natural” number of clusters, eliminations will dominate. In such cases, f will be small or 0, and few new clusters need to be added. On the other hand, in cases where the current number of clusters is significantly lower than the “natural” number of clusters in the data, the value of f should be close to 1. In earlier iterations, the number of added clusters is much larger than the number of removed clusters, which results in rapid growth.


The new clusters created are singleton clusters. The sequences that are as different as possible from both the existing clusters and each other are selected. Therefore pair-wise similarity needs to be computed between each unclustered sequence and other clusters/unclustered sequences. Because it can be expensive to compute pairwise sim-ilarity between the clusters and all unclustered sequences, a sample of unclustered sequences is used to restrict the scope of new seed selection. The approach for com-puting similarity will be described later.





  1. Sequence assignment to clusters: Sequences are assigned to clusters for which the sim-ilarity to the cluster is larger than a user-specified threshold t. The original CLUSEQ algorithm provides a way to adjust the threshold t as well, though the description in this chapter provides only a simplified version of the algorithm, where t is fixed and specified by the user. A given sequence may be assigned to either multiple clusters or may remain unassigned to any cluster. The actual similarity computation is performed using a Markovian similarity measure. This measure will be described later.




  1. Cluster elimination: Many clusters are highly overlapping because of assignment of sequences to multiple clusters. It is desired to restrict this overlap to reduce redun-dancy in the clustering. If the number of sequences that are unique to a particular cluster is less than a predefined threshold, then such a cluster is eliminated.

The only step that remains to be described is the computation of the Markovian similarity measure between sequences and clusters. The idea is that if a sequence of symbols S = s1s2 . . . sn is similar to a cluster Ci, then it should be “easy” to generate S using the





506 CHAPTER 15. MINING DISCRETE SEQUENCES

conditional distribution of the symbols inside the cluster. Then, the probability P (S|Ci) is defined as follows:






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   303   304   305   306   307   308   309   310   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin