Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə306/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   302   303   304   305   306   307   308   309   ...   423
1-Data Mining tarjima

k = f = 1;

Let C1 be a singleton cluster with randomly chosen sequence; repeat


Add ka = k · f new singleton clusters containing sequences

that are as different as possible from existing clusters/each other;



  • = k + ka;

Assign (if possible) each sequence in D to each cluster in C1 . . . Ck for which the similarity is at least t;

Eliminate the kr clusters containing less than minthresh sequences uniquely assigned to them;


k = k − kr;
= max{ka−kr ,0} ;


f
ka
until no change in clustering result;
return clusters C1 . . . Ck;
end

Figure 15.3: The simplified CLUSEQ Algorithm


construction and the choice of the text-clustering algorithm. The CONTOUR method [505] uses a two-level hierarchical clustering, where fine-grained microclusters are generated in the first step. Then, these microclusters are agglomerated into higher-level clusters. The bibliographic notes contain pointers to specific instantiations of this framework.

15.3.4 Probabilistic Clustering


Probabilistic clustering methods are based on the generative principle, that a symbol in a given sequence is generated with a probability defined by statistical correlations with the symbols before it. This is based on the general principle of Markovian Models. Therefore, the similarity between a sequence and a cluster is computed using the generative probability of the symbols within that cluster. After a similarity function has been defined between a cluster and a sequence, it can be used to create a distance-based algorithm. The CLUSEQ algorithm is based on this principle.


15.3.4.1 Markovian Similarity-Based Algorithm: CLUSEQ


The CLUstering SEQuences (CLUSEQ) algorithm is based on the broader principle of Markovian Models. Markovian models are used to define a similarity function between a sequence and cluster. The CLUSEQ algorithm can otherwise be considered a similarity-based iterative partitioning algorithm. While traditional partitioning algorithms fix the number of clusters over multiple iterations, this is not the case in CLUSEQ. The CLUSEQ algorithm starts with only a single cluster. A carefully controlled number of new clusters containing individual sequences are added in each iteration, and older ones are removed when they are deemed to be too similar to existing clusters. The initial growth in the number of clusters is rapid but it slows down over the course of the algorithm. It is even possible for the number of clusters to shrink in later iterations. One advantage of this approach is that the algorithm can automatically determine the natural number of clusters.





15.3. SEQUENCE CLUSTERING

505

Instead of using the number of clusters as an input parameter, the CLUSEQ algorithm works with a similarity threshold t. A sequence is assigned to a cluster, if its similarity to the cluster exceeds the threshold t. Sequences may be assigned to any number of clusters (or no cluster) as long as the similarity is greater than t. The CLUSEQ algorithm has three main steps corresponding to addition of new clusters, assignment of sequences to clusters, and elimination of clusters. These steps are repeated iteratively until there is no change in the clustering result. A simplified version2 of the CLUSEQ algorithm is described in Fig. 15.3. A detailed description of the individual steps is provided below.






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   302   303   304   305   306   307   308   309   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin