Data Mining: The Textbook

particularly useful for SVM clas-sification. Some examples of kernel-based similarity are discussed in detail in sec-tion 15.6.4

Yüklə 17,13 Mb.

səhifə	305/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 301 302 303 304 305 306 307 308 ... 423

1-Data Mining tarjima

Kernel-based similarity: Kernel-based similarity is particularly useful for SVM clas-sification. Some examples of kernel-based similarity are discussed in detail in sec-tion 15.6.4.

The diﬀerent measures are used in diﬀerent application-specific scenarios. Many of these scenarios will be discussed in this chapter.

15.3.1 Distance-Based Methods

When a distance or similarity function has been defined, the k-medoids method can be generalized very simply to sequence data. The k-medoids method is agnostic as to the choice of data type and the similarity function because it is designed as a generic hill-climbing approach. In fact, the CLARANS algorithm, discussed in Sect. 7.3.1 of Chap. 7, can be easily generalized to work with any data type. The algorithm selects k representa-tive sequences from the data, and assigns each data point to their closest sequence, using the selected distance (similarity) function. The quality of the representative sequences is improved iteratively with the use of a hill-climbing algorithm.

The hierarchical methods, discussed in Sect. 6.4 of Chap. 6, can also be generalized to any data type because they work with pairwise distances between the diﬀerent data objects. The main challenge of using hierarchical methods is that they require O(n2) pairwise distance computations between n objects. Distance function computations on sequence data are generally expensive because they require expensive dynamic programming methods. Therefore, the applicability of hierarchical methods is restricted to smaller data sets.

15.3.2 Graph-Based Methods

Graph-based methods are a general approach to clustering that is agnostic to the underlying data type. The transformation of diﬀerent data types to similarity graphs is described in Sect. 2.2.2.9 of Chap. 2. The broader approach in graph-based methods is as follows:

Construct a graph in which each node corresponds to a data object. Each node is connected to its k-nearest neighbors, with a weight equal to the similarity between the corresponding pairs of data objects. In cases where a distance function is used, it is converted to a similarity function as follows:

_w_ij= _e−^d⁽^Oi^,Oj ⁾²^/t²

(15.1)

Here, d(O_i, O_j ) represents the distance between the objects O_i and O_j and t is a parameter.

15.3. SEQUENCE CLUSTERING

503

The edges are assumed to be undirected, and any parallel edges are removed by drop-ping one of the edges. Because the distance functions are assumed to be symmetric, the parallel edges will have the same weight.

Any of the clustering or community detection algorithms discussed in Sect. 19.3 of Chap. 19 may be used for clustering nodes of the newly created graph.

After the nodes have been clustered, these clusters can be mapped back to clusters of data objects by using the correspondence between nodes and data objects.

15.3.3 Subsequence-Based Clustering

The major problem with the aforementioned methods is that they are based on similarity functions that use global alignment between the sequences. For longer sequences, global alignment becomes increasingly ineﬀective because of the noise eﬀects of computing sim-ilarity between pairs of long sequences. Many local portions of sequences are noisy and irrelevant to similarity computations even when large portions of two sequences are similar. One possibility is to design local alignment similarity functions or use the keyword-based similarity method discussed earlier.

A more direct approach is to use frequent subsequence-based clustering methods. Some related approaches also use k-grams extracted from the sequence instead of frequent subse-quences. However, k -grams are generally more sensitive to noise than frequent subsequences. The idea is that the frequent subsequences represent the key structural characteristics that are common across diﬀerent sequences. After the frequent subsequences have been deter-mined, the original sequences can be transformed into this new feature space, and a “bag-of-words” representation can be created in terms of these new features. Then, the sequence objects can be clustered in the same way as any text-clustering algorithm. The overall approach can be described as follows:

Determine the frequent subsequences F from the sequence database D using any frequent sequential pattern mining algorithm. Diﬀerent applications may vary in the specific constraints imposed on the sequences, such as a minimum or maximum length of the determined sequences.

Determine a subset F_S from the frequent subsequences F based on an appropriate selection criterion. Typically, a subset of frequent subsequences should be selected, so as to maximize coverage and minimize redundancy. The idea is to use only a modest number of relevant features for clustering. For example, the notion of Frequent Sum-marized Subsequences (FSS) is used to determine condensed groups of sequences [505]. The bibliographic notes contain specific pointers to these methods.

Represent each sequence in the database as a “bag of frequent subsequences” from F_S . In other words, the transformed representation of a sequence contains all frequent subsequences from F_S that it contains.

Apply any text-clustering algorithm on this new representation of the database of sequences. Text-clustering algorithms are discussed in Chap. 13. The tf-idf weighting may be applied to the diﬀerent features, as discussed in Chap. 13.

The aforementioned discussion is a broad overview of frequent subsequence-based clustering, although the individual steps are implemented in diﬀerent ways by diﬀerent methods. The key diﬀerences among the diﬀerent algorithms are in terms of methodology for feature

504 CHAPTER 15. MINING DISCRETE SEQUENCES

Algorithm CLUSEQ(Sequence Database: D, Similarity Threshold: t)

begin

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 301 302 303 304 305 306 307 308 ... 423