k = f = 1;
Let C1 be a singleton cluster with randomly chosen sequence; repeat
Add ka = k · f new singleton clusters containing sequences
that are as different as possible from existing clusters/each other;
Assign (if possible) each sequence in D to each cluster in C1 . . . Ck for which the similarity is at least t;
Eliminate the kr clusters containing less than minthresh sequences uniquely assigned to them;
k = k − kr;
= max{ka−kr ,0} ;
f
ka
until no change in clustering result;
return clusters C1 . . . Ck;
end
Figure 15.3: The simplified CLUSEQ Algorithm
construction and the choice of the text-clustering algorithm. The CONTOUR method [505] uses a two-level hierarchical clustering, where fine-grained microclusters are generated in the first step. Then, these microclusters are agglomerated into higher-level clusters. The bibliographic notes contain pointers to specific instantiations of this framework.
15.3.4 Probabilistic Clustering
Probabilistic clustering methods are based on the generative principle, that a symbol in a given sequence is generated with a probability defined by statistical correlations with the symbols before it. This is based on the general principle of Markovian Models. Therefore, the similarity between a sequence and a cluster is computed using the generative probability of the symbols within that cluster. After a similarity function has been defined between a cluster and a sequence, it can be used to create a distance-based algorithm. The CLUSEQ algorithm is based on this principle.
15.3.4.1 Markovian Similarity-Based Algorithm: CLUSEQ
The CLUstering SEQuences (CLUSEQ) algorithm is based on the broader principle of Markovian Models. Markovian models are used to define a similarity function between a sequence and cluster. The CLUSEQ algorithm can otherwise be considered a similarity-based iterative partitioning algorithm. While traditional partitioning algorithms fix the number of clusters over multiple iterations, this is not the case in CLUSEQ. The CLUSEQ algorithm starts with only a single cluster. A carefully controlled number of new clusters containing individual sequences are added in each iteration, and older ones are removed when they are deemed to be too similar to existing clusters. The initial growth in the number of clusters is rapid but it slows down over the course of the algorithm. It is even possible for the number of clusters to shrink in later iterations. One advantage of this approach is that the algorithm can automatically determine the natural number of clusters.
15.3. SEQUENCE CLUSTERING
|
505
|
Instead of using the number of clusters as an input parameter, the CLUSEQ algorithm works with a similarity threshold t. A sequence is assigned to a cluster, if its similarity to the cluster exceeds the threshold t. Sequences may be assigned to any number of clusters (or no cluster) as long as the similarity is greater than t. The CLUSEQ algorithm has three main steps corresponding to addition of new clusters, assignment of sequences to clusters, and elimination of clusters. These steps are repeated iteratively until there is no change in the clustering result. A simplified version2 of the CLUSEQ algorithm is described in Fig. 15.3. A detailed description of the individual steps is provided below.
Dostları ilə paylaş: |