Data Mining: The Textbook
15.8 Bibliographic Notes The problem of sequence mining has been studied extensively in computational biology. The classical book by Gusfield [244 ] provides an excellent introduction of sequence min-ing algorithms from the perspective of computational biology. This book also contains an excellent survey on most of the other important similarity measures for strings, trees, and graphs. String indexing is discussed in detail in this work. The use of transformation rules for timeseries similarity has been studied in [283, 432]. Such rules can be used to create edit distance- like measures for continuous time series. Methods for the string edit distance are proposed in [438]. It has been shown in [141], how the Lp-norm may be combined with the edit distance. Algorithms for the longest common subsequence problem may be found in [77 , 92, 270, 280]. A survey of these algorithms is available in [92]. Numerous other mea-sures for timeseries and sequence similarity may be found in [32]. Timeseries and discrete sequence similarity measures are discussed in detail in Chap. 3 of this book, and in an earlier tutorial by Gunopulos and Das [241]. In the context of biological data, the BLAST system [73] is one of the most popular alignment tools. The problem of mining sequential patterns was first proposed in [59]. The GSP algorithm was also proposed in the same work. The GSP algorithm is a straightforward modification of the Apriori algorithm. Most frequent pattern mining problems can be extended to sequen-tial pattern mining because of the relationship between the two models. Subsequently, most of the algorithms discussed in Chap. 4 on frequent pattern mining have been generalized to sequential pattern mining. Savasere et al.’s vertical data structures [446] have been general-ized to SPADE [535], and the FP-growth algorithm has been generalized to PrefixSpan [421]. The TreeProjection algorithm has also been generalized to sequential pattern mining [243]. Both PrefixSpan and the TreeProjection-based methods are based on combining database projection with exploration of the candidate search space with the use of an enumeration tree. The description of this chapter is a simplified and generalized description of these two related works [243, 421]. Methods for finding constraint-based sequences are discussed in [224, 346]. A recent survey on sequential pattern mining may be found in [392]. The problem of sequence data clustering has been studied extensively. A detailed survey on clustering sequence data, in the context of the biological domain, may be found in [32]. The CLUSEQ algorithm is described in detail in [523]. The Probabilistic Suffix Trees, used by CLUSEQ, are discussed in the same work. The earliest frequent sequence-based approach for clustering was proposed in [242]. The CONTOUR method for frequent sequence mining was proposed in [505]. This method uses a combination of frequent sequence mining and microclustering to create clusters from the sequences. The use of Hidden Markov Models for discrete sequence clustering is discussed in [474]. A significant amount of work has been done on the problem of temporal outlier detec-tion in general, and discrete sequences in particular. A general survey on temporal outlier detection may be found in [237 ]. The book [5] contains chapters on temporal and dis-crete sequence outlier detection. A survey on anomaly detection in discrete sequences was presented in [132]. Two well-known techniques that use Markovian techniques for finding position outliers are discussed in [387, 525]. Combination outliers typically use windowing techniques in which comparison units are extracted from the sequence for the purposes of analysis [211, 274]. The information-theoretic measures for compression-based similarity were proposed in [311]. The frequency -based approach for determining the surprise level of comparison units is discussed in [310]. The TARZAN algorithm, proposed in this work, uses suffix trees for efficient computation. A general survey on Hidden Markov Models may be found in [327]. 528 CHAPTER 15. MINING DISCRETE SEQUENCES The problem of sequence classification is addressed in detail in the surveys [33, 516]. The use of wavelet methods for sequence classification was proposed in [51]. A description of a variety of string kernels for SVM classification is provided in [85]. The use of Hidden Markov Models for string classification is discussed in [327]. 15.9 Exercises Consider the sequence ABCDDCBA, defined on the alphabet Σ = {A, B, C, D}. Compute the vector-space representation for all k-mers of length 1, and the vector-space representation of all k-mers of length 2. Repeat the process for the sequence Yüklə 17,13 Mb. Dostları ilə paylaş: |