Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə317/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   313   314   315   316   317   318   319   320   ...   423
1-Data Mining tarjima

m−1

ψr(T, si, sj )







r=1










m−1

γr(T, si)











































r=1
















m

(a

, σ

k

)

γ (T, s

)




θi(σk) =

r=1 I

r







· r

i










m




























r=1 γr(T, si)










The precise derivations of these estimations, on the basis of expectation-maximization prin-ciples, may be found in [327]. This completes the description of the M-step.


As in all EM methods, the procedure is applied iteratively to convergence. The approach can be generalized easily to N sequences by applying the steps to each of the sequences, and averaging the corresponding model parameters in each step.


15.5.5 Applications


Hidden Markov Models can be used for a wide variety of sequence mining problems, such as clustering, classification, and anomaly detection. The application of HMM to clustering has already been described in Sect. 15.3.4.2 of this chapter. The application to classification will be discussed in Sect. 15.6.5 of this chapter. Therefore, this section will focus on the problem of anomaly detection.


In theory, it is possible to compute anomaly scores directly for the test sequence V , once the training model has been constructed from the sequence database D = T1 . . . TN . However, as the length of the test sequence increases, the robustness of such a model dimin-ishes because of the increasing noise resulting from the curse of dimensionality. Therefore, the comparison units (either extracted from the test sequence or specified by the domain expert), are used for computing the anomaly scores of windows of the sequence. The anomaly scores of the different windows can then be combined together by using a simple function such as determining the number of anomalous window units in a sequence.


Some methods also use the Viterbi algorithm on the test sequence to mine the most likely state sequence. In some domains, it is easier to determine anomalies in terms of the state sequence rather than the observable sequence. Furthermore, low transition probabilities on portions of the state sequence provide anomalous localities of the observable sequence. The downside is that the most likely state sequence may have a very low probability of matching the observed sequence. Therefore, the estimated anomalies may not reflect the true anomalies in the data when an estimated state sequence is used for anomaly detection. The real utility of the Viterbi algorithm is in providing an explanation of the anomalous behavior of sequences in terms of the intuitively understandable states, rather than anomaly score quantification.


15.6 Sequence Classification



It is assumed that a set of N sequences, denoted by S1 . . . SN , is available for building the training model. Each of these sequences is annotated with a class label drawn from {1 . . . k}. This training data is used to construct a model that can predict the label



522 CHAPTER 15. MINING DISCRETE SEQUENCES

of unknown test sequences. Many modeling techniques, such as nearest neighbor classi-fiers, rule -based methods, and graph-based methods, are common to timeseries and discrete sequence classification because of the temporal nature of the two data types.


15.6.1 Nearest Neighbor Classifier


The nearest neighbor classifier is used frequently for different data types, including discrete sequence data. The basic description of the nearest neighbor classifier for multidimensional data may be found in Sect. 10.8 of Chap. 10. For discrete sequence data, the main difference is in the similarity function used for nearest neighbor classification. Similarity functions for discrete sequences are discussed in Sects. 3.4.1 and 3.4.2 of Chap. 3. The basic approach is the same as in multidimensional data. For any test instance, its k-nearest neighbors in the training data are determined. The dominant label from these k-nearest neighbors is reported as the relevant one for the test instance. The optimal value of k may be determined by using leave-one -out cross-validation. The effectiveness of the approach is highly sensitive to the choice of the distance function. The main problem is that the presence of noisy portions in the sequences throw off global similarity functions. A common approach is to use keyword-based similarity in which n-grams are extracted from the string to create a vector-space representation. The nearest-neighbor (or any other) classifier can be constructed with this representation.


15.6.2 Graph-Based Methods

This approach is a semisupervised algorithm because it combines the knowledge in the training and test instances for classification. Furthermore, the approach is transductive because out-of-sample classification of test instances is generally not possible. Training and testing instances must be specified at the same time. The use of similarity graphs for semisupervised classification was introduced in Sect. 11.6.3 of Chap. 11. Graph-based methods can be viewed as general semisupervised meta-algorithms that can be used for any data type. The basic approach constructs a similarity graph from both the training and test instances. A graph G = (V, A) is constructed, in which a node in V corresponds to each of the training and test instances. A subset of nodes in G is labeled. These correspond to instances in the training data, whereas the unlabeled nodes correspond to instances in the test data. Each node in V is connected to its k-nearest neighbors with an undirected edge in A. The similarity is computed using any of the distance functions discussed in Sects. 3.4.1 and 3.4.2 of Chap. 3. In the resulting network, a subset of the nodes are labeled, and the remaining nodes are unlabeled. The specified labels of nodes in N are then used to predict labels for nodes where they are unknown. This problem is referred to as collective classification. Numerous methods for collective classification are discussed in Sect. 19.4 of Chap. 19. The derived labels on the nodes are then mapped back to the data objects. As in the case of nearest-neighbor classification, the effectiveness of the approach is sensitive to the choice of distance function used for constructing the graph.




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   313   314   315   316   317   318   319   320   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin