Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	303/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 299 300 301 302 303 304 305 306 ... 423

1-Data Mining tarjima

CHAPTER 15.

MINING DISCRETE SEQUENCES

< >

LEVEL 0

<{a}>

<{b}>

<{c}>

<{d}>

LEVEL 1

<{a,b}> <{a,c}>

<{a}{c}>

<{b}{a}>

<{b}{b}>

<{b,c}>

< {b}{c} > <{c}{a}> <{c}{b}>

LEVEL 2

<{a,b,c}>

<{a,b}{c}> <{b}{a,b}>

<{b}{a}{b}>

<{b,c}{a}> <{b,c}{b}>

<{c}{a,b}>

LEVEL 3

Figure 15.2: The equivalent of an enumeration tree for sequential pattern mining

generation in GSP is defined in terms of two separate cases. The two cases correspond to temporal extensions and set-wise extensions of candidates.

As discussed in Chap. 4, the Apriori algorithm can be viewed as an enumeration tree algorithm. It is also possible to define an analogous candidate tree in sequential pattern mining, albeit with a somewhat diﬀerent structure than the enumeration tree in frequent pattern mining. The key diﬀerences in join-based candidate generation between Apriori and GSP algorithms translate to diﬀerences in the structure and growth of the candidate tree in sequential pattern mining. In general, candidate trees for sequential pattern mining are more complex because they need to accommodate both temporal and set-wise growth of sequences. Therefore, the definition of candidate extensions of a tree- node needs to be changed. A node for sequence S can be extended to a lower-level node in one of two ways:

Set-wise extension: In this case, an item is added to the last element in the sequence S to create a candidate pattern. Therefore, the number of elements does not increase. For an item to be added to the last element of S, it must satisfy two properties; (a) the item successfully extends the parent sequence of S in the candidate tree with either a set-wise or temporal extension to another frequent sequence, and (b) the item must be lexicographically later than all items in the last element of S. As in frequent pattern mining, a lexicographic ordering of items needs to be fixed up front.

Temporal extension: A new element with a single item is added to the end of the current sequence S. As in the previous case of set-wise extensions, any frequent item extension of the parent of S may be used to extend S (condition (a)). However, the added item need not be lexicographically later than the items in the last element of sequence S.

These two kinds of extensions can be shown to be equivalent to the two kinds of joins in the GSP algorithm. As in frequent pattern mining, the candidate extensions of a node are a subset of the corresponding frequent extensions of its parent node. An example of the frequent portion of the candidate tree for sequential pattern mining is illustrated in Fig. 15.2. Note the greater complexity of the tree, because of set-wise and temporal addition of items at each level of the tree. The set-wise extensions are marked as “S”, and the temporal extensions are marked as “T” on the corresponding tree edges. A particularly illuminating discussion on this topic may be found in [243], on which this example is based.

15.2. SEQUENTIAL PATTERN MINING

499

It is possible to convert any enumeration tree algorithm for frequent pattern mining to a sequential pattern mining algorithm by systematically making appropriate modifications. These changes account for the diﬀerent structure of the candidate tree in sequential pattern mining compared to that in frequent pattern mining. This candidate tree is implicitly gen-erated by all sequential pattern mining algorithms, such as GSP and PrefixSpan. Because the enumeration tree1 is the generic framework describing all frequent pattern mining algo-rithms, it implies that virtually all frequent pattern mining algorithms can be modified to the sequential pattern mining scenario. For example, the work in [243] generalizes TreeP-rojection, and the PrefixSpan algorithm generalizes FP-growth. Savasere et al.’s vertical format [446] has also been generalized to sequential pattern mining algorithms. The main diﬀerence between these algorithms is the diﬀerent eﬃciency of counting with the use of a variety of data structures, projection-based reuse tricks, and diﬀerent candidate-tree explo-ration strategies such as breadth-first or depth-first, rather than a fundamental diﬀerence in search space size. The size of the candidate tree is fixed, although pruning strategies such as the Apriori-style pruning can reduce its size.

The notion of projection-based reuse can also be extended to sequential pattern min-ing. The projected representation T (P) of the sequence database T is associated with a sequential pattern P in the candidate enumeration tree (as modified for sequential pattern mining). Then, each sequence Y ∈ T in the database is projected at P according to the following rules:

The sequential pattern P needs to be a subsequence of Y for the projection of Y to be included in the projected database T (P).

All items that are either not in the last element of P, or are not successful frequent extensions (either temporal or set-wise) of the parent of P are not included in the projection of Y because they are irrelevant for counting frequent extensions of P.
The earliest temporal occurrence of P in Y, as a subsequence, is determined. Let the last element P_r in P be matched to the element Y_k in Y according to this subsequence matching. Then, the first element of the projected representation of Y is equal to the set of items in Y_k − P_r that are lexicographically later than all items in P_r. If

the resulting element Q is null, then it is not included in the projection of Y. This first element, if non-null, is special because it may only be used for counting set-wise extensions of element P_r in the enumeration tree and is, therefore, denoted as Q with an underscore in front of it.

4. The remaining elements in the projected sequence after Q correspond to the elements in Y occurring temporally after the last matched element Y_k in Y. All these elements are included in the projection of Y after removing the irrelevant items discussed in step 2. These remaining elements can be used for counting either set-wise extensions of the last element P_r in P, or temporal extensions of P. For any of these remaining elements (other than Q) to be used for counting the set-wise extensions of P_r, the element would already need to contain P_r.

The projected database T (P) can be used to count the frequent extensions of P more eﬃ-ciently and determine the frequent ones. As in the frequent pattern mining, this projection can be performed successively in top-down fashion during the construction an enumeration-tree-like candidate structure. The projected database at a node can be generated by recur-sively projecting the database at its parent. The basic approach is exactly analogous to

See discussion in Sect. 4.4.4.5 of Chap. 4. A similar argument applies to sequential pattern mining.

500 CHAPTER 15. MINING DISCRETE SEQUENCES

projection-based frequent pattern mining algorithms discussed in Chap. 4. The algorithm starts with a candidate tree ET which is the null node with the entire sequence database

at that node. This tree is extended repeatedly using the following step until no nodes remain in ET for further extension.

Select a node (P, T (P)) in ET for extension;

Generate the candidate temporal and set-wise extensions of P;

Determine the frequent extensions of P using support counting on T (P );

Extend ET with frequent extensions and their recursively projected databases;

The final candidate tree ET contains all the frequent sequential patterns. Diﬀerent strategies for selecting the node P can lead to the generation of the candidate tree in a diﬀerent order such as breadth-first or depth-first order. This simplified and generalized description is roughly based on the frameworks independently proposed in [243] and Pre-fixSpan, which are closely related. The reader is referred to the bibliographic notes for discussion of the specific algorithms.

15.2.2 Constrained Sequential Pattern Mining

In many cases, additional constraints are imposed on the sequential patterns, such as con-straints on gaps between successive elements of the sequence. One solution is to use the unconstrained GSP algorithm, and then, as a postprocessing step, remove all subsequences not satisfying the constraint. However, this brute-force approach is a very ineﬃcient solu-tion because the number of constrained patterns may be orders of magnitude smaller than the unconstrained patterns. Therefore, the incorporation of the constraints directly into the GSP algorithm, during the candidate-generation or support-counting step, is significantly more eﬃcient. Depending on the nature of the constraints, the changes required to the GSP algorithm may be minor, or significant. In all cases, the support-counting procedure for F_k needs to be modified. The constraints are explicitly checked during the support count-ing. This reduces the number of frequent patterns generated, and makes the process more eﬃcient than the brute-force method. However, the incorporation of such constraints may sometimes result in invalidation of the downward closure property of the mined patterns. In such cases, appropriate changes may need to be made to the GSP algorithm. In cases where the downward closure property is not violated, the GSP algorithm can be used with very minor modifications for constraint checking during support counting.

An important constraint that does not violate the downward closure property is the maxspan constraint. This constraint specifies that the time diﬀerence between the first and last elements of a subsequence must be no larger than maxspan. Therefore, the GSP algorithm can be used directly, with the modification that the constraint is checked during support counting. Thus, the approach works with a much smaller set of subsequences in each step and is generally more eﬃcient than the brute-force method.

Another common constraint in sequential pattern mining is the maximum gap constraint. This is also referred to as the maxgap constraint. Note that all (k −1)-subsequences of a par-ticular frequent k-sequence may not be valid because of the maximum gap constraint. This is somewhat problematic because the Apriori principle cannot be used eﬀectively. For exam-ple, the subsequence a₁a₅ is not supported by the transaction database sequence a₁a₂a₃a₄a₅ under the maxgap value of 1, because the gap value between a₁ and a₅ is 3. However, the subsequence a₁a₃a₅ is supported by the same transaction database sequence under this maxgap value. It is, therefore, possible for a₁a₅ to have lower support than a₁a₃a₅. Thus,

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 299 300 301 302 303 304 305 306 ... 423