Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə311/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   307   308   309   310   311   312   313   314   ...   423
1-Data Mining tarjima

P

P (0.61)

C (0.38)

E (1.0)




R

C

E







R (1.0)
















E (0.01)












P = PLACED ON SHELF


R = REMOVED FROM SHELF C = CHECKOUT E = EXIT STORE
ANOMALOUS EVENT


(SHOPLIFTING)


FIRST ORDER MARKOV MODEL






PR

C (0.38)

RC

E (1.0)

CE
















(1.0)R



RP

P (0.61)



RE

SECOND ORDER MARKOV MODEL

Figure 15.4: Markov model for the RFID-based shoplifting anomaly


An observation from Fig. 15.4 is that the number of states in the second-order model is larger than that in the first-order model. This is not a coincidence. As many as |Σ|k states may exist in an order-k model, though this provides only an upper bound. Many of the subsequences corresponding to these states may either not occur in the training data, or may be invalid in a particular application. For example, a PP state would be invalid in the example of Fig. 15.4 because the same item cannot be sequentially placed twice on the shelf without removing it at least once. Higher-order models represent complex systems more accurately at least at a theoretical level. However, choosing models of a higher order degrades the efficiency and may also result in overfitting.

15.4.1.1 Efficiency Issues: Probabilistic Suffix Trees


It is evident from the discussion in the previous sections that the Markovian and rule-based models are equivalent, with the latter being a simpler and easy-to-understand heuristic approximation of the former. Nevertheless, in both cases, the challenge is that the number of possible antecedents of length k can be as large as |Σ|k. This can make the methods slow, when a lookup for a test subsequence ai−k . . . ai−1 is required to determine the probability of P (ai|ai−k . . . ai−1). It is expensive to either compute these values on the fly, or even to retrieve their precomputed values, if they are not organized properly. The Probabilistic Suffix Tree (PST) provides an efficient approach for retrieval of such precomputed values. The utility of probabilistic suffix trees is not restricted to outlier detection, but is also applicable to clustering and classification. For example, the CLUSEQ algorithm, discussed in Sect. 15.3.4.1, uses PST to retrieve these prestored probability values.


Suffix trees are a classical data structure that store all subsequences in a given database. Probabilistic suffix trees represent a generalization of this structure that also stores the conditional probabilities of generation of the next symbol for a given sequence database. For order-k Markov Models, a suffix tree of depth at most k will store all the required conditional



15.4. OUTLIER DETECTION IN SEQUENCES







511













( P(X), P(Y))






















ROOT



















X







Y










( P(X|X), P(Y|X))

X







Y

( P(X|Y), P(Y|Y))


































X

Y




X




Y ( P(X|YY), P(Y|YY))










( P(X|XY), P(Y|XY)




( P(X|XX), P(Y|XX))

( P(X|YX), P(Y|YX))



















XX

YX




XY




YY




























X

Y

X

Y

X

Y

X

Y




XXX

YXX

XYX

YYX

XXY

YXY

XYY

YYY




P(Y|X|XXX),P(X(XX))

P(Y|Y|YXX),P(X(XX))

YX))P(Y|X|XYX),P(X(

YX))P(Y|Y|YYX),P(X(

P(Y|X|XXY),P(X(XY))

XY))P(Y|Y|YXY),P(X(

YY))P(Y|X|XYY),P(X(

|YYY),P(X( ))YP(Y|Y




Figure 15.5: Probabilistic suffix tree


probability values for the k th order Markovian models, including the conditionals for all lower-order Markov Models. Therefore, such a structure encodes all the information required for variable-order Markov Models as well. A key challenge is that the number of nodes in such a suffix tree can be as large as k |Σ|i, an issue that needs to be addressed with
i=0
selective pruning.

A probabilistic suffix tree is a hierarchical data structure representing the different suf-fixes of a sequence. A node in the tree with depth k represents a suffix of length k and is, therefore, labeled with a sequence of length k. The parent of a node ai−k . . . ai corre-sponds to the sequence ai− k+1 . . . ai. The latter is obtained by removing the first symbol from the former. Each edge is labeled with the symbol that needs to be removed to derive the sequence at the parent node. Thus, a path in the tree corresponds to suffixes of the same sequence. Each node also maintains a vector Σ of probabilities that correspond to the conditional probability of the generation of any symbol from Σ = 1 . . . σ|Σ|} after that sequence. Therefore, for a node corresponding to the sequence ai−k . . . ai, and for each





  • {1 . . . |Σ|}, the values of P (σj |ai−k . . . ai) are maintained. As discussed earlier, this cor-responds to the conditional probability that σj appears immediately after ai−k . . . ai, once the latter sequence has already been observed. This provides the generative probability cru-cial to the determination of position outliers. Note that this generative probability is also useful for other algorithms such as the CLUSEQ algorithm discussed earlier in this chapter.

An example of a suffix tree with the symbol set Σ = {X, Y } is illustrated in Fig. 15.5. The two possible symbol-generation probabilities at each node corresponding to either of the symbols X and Y are placed next to the corresponding nodes. It is also evident that a probabilistic suffix tree of depth k encodes all the transition probabilities for Markovian models up to order k. Therefore, such an approach can be used for higher-order Markovian


models.

512 CHAPTER 15. MINING DISCRETE SEQUENCES

The probabilistic suffix true is pruned significantly to improve its compactness. For example, suffixes that correspond to very low counts in the original data can be pruned from consideration. Furthermore, nodes with low generative probabilities of their underly-ing sequences can be pruned from consideration. The generative probability of a sequence a1 . . . an is approximated as follows:





P (a1 . . . an) = P (a1) · P (a2|a1) . . . P (an|a1 . . . an−1)

(15.6)

For Markovian models of order k < n, the value of P (ar|a1 . . . ar−1) in the equation above is approximated by P (ar|ar −k . . . ar−1) for any value of k less than r. To create Markovian models of order k or less, it is not necessary to keep portions of the tree with depth greater than k.

Consider the sequence a1 . . . ai . . . an, in which it is desired to test whether position ai is



  • position outlier. Then, it is desired to determine P (ai|a1 . . . ai−1). It is possible that the suffix a1 . . . ai−1 may not be present in the suffix tree because it may have been pruned from consideration. In such cases, the short memory property is used to determine the longest suffix aj . . . ai−1 present in the suffix tree, and the corresponding probability is estimated by P (ai|aj . . . ai−1). Thus, the probabilistic suffix tree provides an efficient way to store and retrieve the relevant probabilities. The length of the longest path that exists in the suffix tree containing a nonzero probability estimate of P (ai|aj . . . ai−1) also provides an idea of the level of rarity of this particular sequence of events. Positions that contain only short paths preceding them in the suffix tree are more likely to be outliers. Thus, outlier scores may be defined from the suffix tree in multiple ways:




    1. If only short path lengths exist in the (pruned) suffix tree corresponding to a position ai and its preceding history, then that position is more likely be an outlier.




    1. For the paths of lengths 1 . . . r that do exist in the suffix tree for position ai, a combi-nation score may be used based on the models of different orders. In some cases, only lower-order scores are combined. In general, the use of lower-order scores is preferable, since they are usually more robustly represented in the training data.

15.4.2 Combination Outliers


In combination outliers, the goal is to determine unusual combinations of symbols in the sequences. Consider a setting, where a set of training sequences is provided, together with a test sequence. It is desirable to determine whether a test sequence is an anomaly, based on the “normal” patterns in the training sequences. In many cases, the test sequences may be quite long. Therefore, the combination of symbols in the full sequence may be unique with respect to the training sequences. This means that it is hard to characterize “normal” sequences on the basis of the full sequence. Therefore, small windows are extracted from the training and test sequences for the purpose of comparison. Typically, all windows (including overlapping ones) are extracted from the sequences, though it is also possible to work with nonoverlapping windows. These are referred to as comparison units. The anomaly scores are defined with respect to these comparison units. Thus, unusual windows in the sequences are reported. The following discussion will focus exclusively on determining such unusual windows.


Some notations and definitions will be used to distinguish between the training database, test sequence, and the comparison units.


1. The training database is denoted by D, and contains sequences denoted by T1 . . . TN .



15.4. OUTLIER DETECTION IN SEQUENCES

513




  1. The test sequence is denoted by V .




  1. The comparison units are denoted by U1 . . . Ur. Typically, each Ui is derived from small, contiguous windows of V . In domain-dependent cases, U1 . . . Ur may be provided by the user.

The model may be a distance-based, or frequency-based or may be a Hidden Markov Model. Each of these will be discussed in subsequent sections. Because Hidden Markov Models are general constructs that are used for different problems such as clustering, classification, and outlier detection, they will be discussed in a section of their own, immediately following this section.


15.4.2.1 Distance-Based Models

In distance-based models, the absolute distance (or similarity) of the comparison unit is computed to equivalent windows of the training sequence. The distance of the k-th nearest neighbor window in the training sequence is used to determine the anomaly score. In the context of sequence data, many proximity-functions are similarity functions rather than dis-tance functions. In the former case, higher values indicate greater proximity. Some common methods for computing the similarity between a pair of sequences are as follows:





  1. Simple matching coefficient: This is the simplest possible function and determines the number of matching positions between two sequences of equal length. This is also equivalent to the Hamming distance between a pair of sequences.




  1. Normalized longest common subsequence: The longest common subsequence can be considered the sequential analog of the cosine distance between two ordered sets. Let T1 and T2 be two sequences, and the length of (unnormalized) longest common subsequence between T1 and T2 be denoted by L(T1, T2). The unnormalized longest common subsequence can be computed using methods discussed in Sect. 3.4.2 of Chap. 3. Then, the value N L(T1, T2) of the normalized longest common subsequence is computed by normalizing L(T1, T2) with the underlying sequence lengths in a way similar to the cosine computation between unordered sets:




NL(T1

,T2) =

L(T1, T2)

(15.7)



















|T1| ·

|T2|
















The advantage of this approach is that it can match two sequences of unequal lengths.

The drawback is that the computation process is relatively slow.





  1. Edit distance: The edit distance is one of the most common similarity functions used for sequence matching. This similarity function is discussed in Chap. 3. This func-tion measures the distance between two sequences by the minimum number of edits required to transform one sequence to the other. The computation of the edit distance can be computationally very expensive.




  1. Compression-based dissimilarity: This measure is based on principles of information theory. Let W be a window of the training data, and W ⊕Ui be the string representing the concatenation of W and Ui. Let DL(S) < |S| be the description length of any string S after applying a standard compression algorithm to it. Then, the compression-based dissimilarity CD(W, Ui) is defined as follows:




CD(W, Ui) =

DL(W ⊕ Ui)

(15.8)







DL(W ) + DL(Ui)













514 CHAPTER 15. MINING DISCRETE SEQUENCES


This measure always lies in the range (0, 1), and lower values indicate greater sim-ilarity. The intuition behind this approach is that when the two sequences are very similar, the description length of the combined sequence will be much smaller than that of the sum of the description lengths. On the other hand, when the sequences are very different, the description length of the combined string will be almost the same as the sum of the description lengths.


To compute the anomaly score for a comparison unit Ui with respect to the training sequences in T1 . . . TN , the first step is to extract equivalent windows from T1 . . . TN as the size of the comparison unit. The k-th nearest neighbor distance is used as the anomaly score for that window. The unusual windows may be reported, or the scores from different windows may be consolidated into a single anomaly score.


15.4.2.2 Frequency-Based Models


Frequency-based models are typically used with domain-specific comparison units specified by the user. In this case, the relative frequency of the comparison unit needs to be measured in the training sequences and the test sequences, and the level of surprise is correspondingly determined.


When the comparison units are specified by the user, a natural way of determining the anomaly score is to test the frequency of the comparison unit Uj in the training and test pat-terns. For example, when a sequence contains a hacking attempt, such as a sequence of Login and Password events, this sequence will have much higher frequency in the test sequence, as compared to the training sequences. The specification of such relevant comparison units by a user provides very useful domain knowledge to an outlier analysis application.


Let f (T, Uj ) represent the number of times that the comparison unit Uj occurs in the sequence T . Since the frequency f (T, Uj ) depends on the length of T , the normalized fre-


ˆ
quency f (T, Uj ) may be obtained by dividing the frequency by the length of the sequence:



ˆ


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   307   308   309   310   311   312   313   314   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin