Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə320/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   316   317   318   319   320   321   322   323   ...   423
1-Data Mining tarjima

K(Yi, Yj ) =βs

I(KM ER(Yi, r, s) = KM ER(Yj , r, s))

(15.9)

s=1

r=1




Here, I (·) is an indicator function that takes on the value of 1 in case of a match, and 0 otherwise. One drawback of the weighted degree kernel over the spectrum kernel, is that it requires the two strings Y i and Yj to be of equal length. This can be partially addressed by allowing shifts in the matching process. Pointers to these enhance-ments may be found in the bibliographic notes.


15.6.5 Probabilistic Methods: Hidden Markov Models

Hidden Markov Models are an important tool that are utilized in a wide variety of tasks in sequence analysis. It has already been shown earlier in this chapter, in Sects. 15.3.4.2 and 15.5, how Hidden Markov Models can be utilized for both clustering and outlier detec-tion. In this section, the use of Hidden Markov Models for sequence classification will be leveraged. In fact, the most common use of HMMs is for the problem of classification. HMMs are very popular in computational biology, where they are used for protein classification.


526 CHAPTER 15. MINING DISCRETE SEQUENCES


The basic approach for using HMMs for classification is to create a separate HMM for each of the classes in the data. Therefore, if there are a total of k classes, this will result in k different Hidden Markov Models. The Baum–Welch algorithm, described in Sect. 15.5.4, is used to train the HMMs for each class. For a given test sequence, the fit of each of the k models to the test sequence is determined using the approach described in Sect. 15.5.2. The best matching class is reported as the relevant one. The overall approach for training and testing with HMMs may be described as follows:





  1. (Training) Use Baum–Welch algorithm of Sect. 15.5.4 to construct a separate HMM model for each of the k classes.




  1. (Testing) For a given test sequence Y , determine the fit probability of the sequence to the k different Hidden Markov Models, using the evaluation procedure discussed in Sect. 15.5.2. Report the class, for which the corresponding HMM has the highest fit probability to the test sequence.

Many variations of this basic approach have been used to achieve different trade- offs between effectiveness and efficiency. The bibliographic notes contain pointers to some of these meth-ods.


15.7 Summary

Discrete sequence mining is closely related to timeseries data mining, just as categorical data mining is closely related to numeric data mining. Therefore, many algorithms are very similar across the two domains. The work on discrete sequence mining originated in the field of computational biology, where DNA strands are encoded as strings.


The problem of sequential pattern mining discovers frequent sequences from a database of sequences. The GSP algorithm for frequent sequence mining is closely based on the Apri-ori method. Because of the close relationship between the two problems, most algorithms for frequent pattern mining can be generalized to discrete sequence mining in a relatively straightforward way.


Many of the methods for multidimensional clustering can be generalized to sequence clustering, as long as an effective similarity function can be defined between the sequences. Examples include the k-medoids method, hierarchical methods, and graph-based methods. An interesting line of work converts sequences to bags of k-grams. Text-clustering algorithms are applied to this representation. In addition, a number of specialized methods, such as CLUSEQ, have been developed. Probabilistic methods use a mixture of Hidden Markov Models for sequence clustering.


Outlier analysis for sequence data is similar to that for timeseries data. Position outliers are determined using Markovian models for probabilistic prediction. Combination outliers can be determined using distance-based, frequency-based, or Hidden Markov Models. Hid-den Markov Models are a very general tool for sequence analysis and are used frequently for a wide variety of data mining tasks. HMMs can be viewed as mixture models, in which each state of the mixture is sequentially dependent on the previous states.

Numerous techniques from multidimensional classification can be adapted to discrete sequence classification. These include nearest neighbor methods, graph-based methods, rule-based methods, Hidden Markov Models, and Kernel Support Vector Machines. Numerous string kernels have been designed for more effective sequence classification.




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   316   317   318   319   320   321   322   323   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin