Data Mining: The Textbook


particularly useful for SVM clas-sification. Some examples of kernel-based similarity are discussed in detail in sec-tion 15.6.4



Yüklə 17,13 Mb.
səhifə305/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   301   302   303   304   305   306   307   308   ...   423
1-Data Mining tarjima

Kernel-based similarity: Kernel-based similarity is particularly useful for SVM clas-sification. Some examples of kernel-based similarity are discussed in detail in sec-tion 15.6.4.

The different measures are used in different application-specific scenarios. Many of these scenarios will be discussed in this chapter.


15.3.1 Distance-Based Methods


When a distance or similarity function has been defined, the k-medoids method can be generalized very simply to sequence data. The k-medoids method is agnostic as to the choice of data type and the similarity function because it is designed as a generic hill-climbing approach. In fact, the CLARANS algorithm, discussed in Sect. 7.3.1 of Chap. 7, can be easily generalized to work with any data type. The algorithm selects k representa-tive sequences from the data, and assigns each data point to their closest sequence, using the selected distance (similarity) function. The quality of the representative sequences is improved iteratively with the use of a hill-climbing algorithm.


The hierarchical methods, discussed in Sect. 6.4 of Chap. 6, can also be generalized to any data type because they work with pairwise distances between the different data objects. The main challenge of using hierarchical methods is that they require O(n2) pairwise distance computations between n objects. Distance function computations on sequence data are generally expensive because they require expensive dynamic programming methods. Therefore, the applicability of hierarchical methods is restricted to smaller data sets.


15.3.2 Graph-Based Methods


Graph-based methods are a general approach to clustering that is agnostic to the underlying data type. The transformation of different data types to similarity graphs is described in Sect. 2.2.2.9 of Chap. 2. The broader approach in graph-based methods is as follows:





  1. Construct a graph in which each node corresponds to a data object. Each node is connected to its k-nearest neighbors, with a weight equal to the similarity between the corresponding pairs of data objects. In cases where a distance function is used, it is converted to a similarity function as follows:




wij = ed(Oi,Oj )2/t2

(15.1)

Here, d(Oi, Oj ) represents the distance between the objects Oi and Oj and t is a parameter.



15.3. SEQUENCE CLUSTERING

503




  1. The edges are assumed to be undirected, and any parallel edges are removed by drop-ping one of the edges. Because the distance functions are assumed to be symmetric, the parallel edges will have the same weight.




  1. Any of the clustering or community detection algorithms discussed in Sect. 19.3 of Chap. 19 may be used for clustering nodes of the newly created graph.

After the nodes have been clustered, these clusters can be mapped back to clusters of data objects by using the correspondence between nodes and data objects.


15.3.3 Subsequence-Based Clustering


The major problem with the aforementioned methods is that they are based on similarity functions that use global alignment between the sequences. For longer sequences, global alignment becomes increasingly ineffective because of the noise effects of computing sim-ilarity between pairs of long sequences. Many local portions of sequences are noisy and irrelevant to similarity computations even when large portions of two sequences are similar. One possibility is to design local alignment similarity functions or use the keyword-based similarity method discussed earlier.


A more direct approach is to use frequent subsequence-based clustering methods. Some related approaches also use k-grams extracted from the sequence instead of frequent subse-quences. However, k -grams are generally more sensitive to noise than frequent subsequences. The idea is that the frequent subsequences represent the key structural characteristics that are common across different sequences. After the frequent subsequences have been deter-mined, the original sequences can be transformed into this new feature space, and a “bag-of-words” representation can be created in terms of these new features. Then, the sequence objects can be clustered in the same way as any text-clustering algorithm. The overall approach can be described as follows:





  1. Determine the frequent subsequences F from the sequence database D using any frequent sequential pattern mining algorithm. Different applications may vary in the specific constraints imposed on the sequences, such as a minimum or maximum length of the determined sequences.




  1. Determine a subset FS from the frequent subsequences F based on an appropriate selection criterion. Typically, a subset of frequent subsequences should be selected, so as to maximize coverage and minimize redundancy. The idea is to use only a modest number of relevant features for clustering. For example, the notion of Frequent Sum-marized Subsequences (FSS) is used to determine condensed groups of sequences [505]. The bibliographic notes contain specific pointers to these methods.




  1. Represent each sequence in the database as a “bag of frequent subsequences” from FS . In other words, the transformed representation of a sequence contains all frequent subsequences from FS that it contains.




  1. Apply any text-clustering algorithm on this new representation of the database of sequences. Text-clustering algorithms are discussed in Chap. 13. The tf-idf weighting may be applied to the different features, as discussed in Chap. 13.

The aforementioned discussion is a broad overview of frequent subsequence-based clustering, although the individual steps are implemented in different ways by different methods. The key differences among the different algorithms are in terms of methodology for feature


504 CHAPTER 15. MINING DISCRETE SEQUENCES


Algorithm CLUSEQ(Sequence Database: D, Similarity Threshold: t)


begin


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   301   302   303   304   305   306   307   308   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin