Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə304/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   300   301   302   303   304   305   306   307   ...   423
1-Data Mining tarjima

15.3. SEQUENCE CLUSTERING

501



Apriori pruning cannot be applied. However, the sequence obtained by dropping items from the first or last elements of a frequent sequence will always be frequent. Therefore, the spe-cific join-based approach discussed in this chapter can still be used to exhaustively generate all candidates without losing any frequent patterns. A modified pruning rule is used. While checking (k − 1)- subsequences of a candidate for pruning, only subsequences containing the same number of elements as the candidate are checked. In other words, elements con-taining 1 item cannot be removed from the candidate for checking its subsequences. Such subsequences are referred to as contiguous subsequences. A noteworthy special case is one in which maxgap = 0. This case is often used for determining timeseries motifs after a time series has been converted to a discrete sequence using methods discussed in Sect. 14.4 of Chap. 14.

Another constraint of interest is the minimum gap constraint, or mingap constraint between successive elements. A minimum gap constraint between successive elements will always satisfy the downward closure property. Therefore, the GSP approach can be used with very minor modifications. The only modification is that this constraint is checked during support counting. This generates a smaller set of subsequences Fk . The join and pruning steps remain unchanged. The bibliographic notes contain pointers to many other interesting constraints such as the window-size constraint.


15.3 Sequence Clustering


As in the case of timeseries data, the clustering of sequences is heavily dependent on the definition of similarity. When a similarity function has been defined, many of the tradi-tional multidimensional methods such as k-medoids and graph-based methods can be easily adapted to sequence data. It should be pointed out that these two methods can be used for virtually any data type, and are dependent only on the choice of distance function.


Similarity measures for sequence data have been defined in Chap. 3. The most common similarity functions used for sequence data are as follows:





  1. Match-based measure: This measure is equal to the number of matching positions between the two sequences. This can be meaningfully computed only when the two sequences are of equal length, and a one-to-one correspondence exists between the positions.




  1. Dynamic time warping (DTW): In this case, the number of nonmatches between the two sequences can be used with dynamic time warping. The dynamic time warping (DTW) method is discussed in detail in Sect. 3.4.1.3 of Chap. 3. The idea is to stretch and shrink the time dimension dynamically to account for the varying speeds of data generation for different series.




  1. Longest common subsequence (LCSS): As the name of this measure suggests, the longest matching subsequence between the two sequences is computed. This is then used to measure the similarity between the two sequences. The LCSS method is dis-cussed in detail in Sect. 3.4.2.2 of Chap. 3.




  1. Edit distance: This is defined as the cost of edit operations required to transform one sequence into another. The edit distance measure is described in Sect. 3.4.2.1 of Chap. 3. A number of alignment methods, such as BLAST, are specifically designed for biological sequences. Pointers to these methods may be found in the bibliographic notes.

502 CHAPTER 15. MINING DISCRETE SEQUENCES





  1. Keyword-based similarity: In this case, a k-gram representation is used, in which each sequence is represented by a bag of segments of length k. These k-grams are extracted from the original data sequences by using a sliding window of length k on the sequences. Each such k-gram represents a new “keyword,” and a tf-idf representa-tion can be used in terms of these keywords. If desired, the infrequent k-grams can be dropped. Any text-based, vector-space similarity measure, discussed in Chap. 13, may be used. Since the ordering of the segments is no longer used after the transformation, such an approach allows the use of a wider range of data mining algorithms. In fact, any text mining algorithm can be used on this transformation.





  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   300   301   302   303   304   305   306   307   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin