Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə303/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   299   300   301   302   303   304   305   306   ...   423
1-Data Mining tarjima

CHAPTER 15.

MINING DISCRETE SEQUENCES
















< >













LEVEL 0








































T

T




T




T
















<{a}>

<{b}>




<{c}>

<{d}>




LEVEL 1




S

S

T

T

T

S

T

T

T











































<{a,b}> <{a,c}>

<{a}{c}>

<{b}{a}>

<{b}{b}>




<{b,c}>

< {b}{c} > <{c}{a}> <{c}{b}>

LEVEL 2




S

T




S

T




T

T

S







<{a,b,c}>

<{a,b}{c}> <{b}{a,b}>

<{b}{a}{b}>

<{b,c}{a}> <{b,c}{b}>

<{c}{a,b}>

LEVEL 3




Figure 15.2: The equivalent of an enumeration tree for sequential pattern mining


generation in GSP is defined in terms of two separate cases. The two cases correspond to temporal extensions and set-wise extensions of candidates.

As discussed in Chap. 4, the Apriori algorithm can be viewed as an enumeration tree algorithm. It is also possible to define an analogous candidate tree in sequential pattern mining, albeit with a somewhat different structure than the enumeration tree in frequent pattern mining. The key differences in join-based candidate generation between Apriori and GSP algorithms translate to differences in the structure and growth of the candidate tree in sequential pattern mining. In general, candidate trees for sequential pattern mining are more complex because they need to accommodate both temporal and set-wise growth of sequences. Therefore, the definition of candidate extensions of a tree- node needs to be changed. A node for sequence S can be extended to a lower-level node in one of two ways:





  1. Set-wise extension: In this case, an item is added to the last element in the sequence S to create a candidate pattern. Therefore, the number of elements does not increase. For an item to be added to the last element of S, it must satisfy two properties; (a) the item successfully extends the parent sequence of S in the candidate tree with either a set-wise or temporal extension to another frequent sequence, and (b) the item must be lexicographically later than all items in the last element of S. As in frequent pattern mining, a lexicographic ordering of items needs to be fixed up front.




  1. Temporal extension: A new element with a single item is added to the end of the current sequence S. As in the previous case of set-wise extensions, any frequent item extension of the parent of S may be used to extend S (condition (a)). However, the added item need not be lexicographically later than the items in the last element of sequence S.

These two kinds of extensions can be shown to be equivalent to the two kinds of joins in the GSP algorithm. As in frequent pattern mining, the candidate extensions of a node are a subset of the corresponding frequent extensions of its parent node. An example of the frequent portion of the candidate tree for sequential pattern mining is illustrated in Fig. 15.2. Note the greater complexity of the tree, because of set-wise and temporal addition of items at each level of the tree. The set-wise extensions are marked as “S”, and the temporal extensions are marked as “T” on the corresponding tree edges. A particularly illuminating discussion on this topic may be found in [243], on which this example is based.





15.2. SEQUENTIAL PATTERN MINING

499

It is possible to convert any enumeration tree algorithm for frequent pattern mining to a sequential pattern mining algorithm by systematically making appropriate modifications. These changes account for the different structure of the candidate tree in sequential pattern mining compared to that in frequent pattern mining. This candidate tree is implicitly gen-erated by all sequential pattern mining algorithms, such as GSP and PrefixSpan. Because the enumeration tree1 is the generic framework describing all frequent pattern mining algo-rithms, it implies that virtually all frequent pattern mining algorithms can be modified to the sequential pattern mining scenario. For example, the work in [243] generalizes TreeP-rojection, and the PrefixSpan algorithm generalizes FP-growth. Savasere et al.’s vertical format [446] has also been generalized to sequential pattern mining algorithms. The main difference between these algorithms is the different efficiency of counting with the use of a variety of data structures, projection-based reuse tricks, and different candidate-tree explo-ration strategies such as breadth-first or depth-first, rather than a fundamental difference in search space size. The size of the candidate tree is fixed, although pruning strategies such as the Apriori-style pruning can reduce its size.


The notion of projection-based reuse can also be extended to sequential pattern min-ing. The projected representation T (P) of the sequence database T is associated with a sequential pattern P in the candidate enumeration tree (as modified for sequential pattern mining). Then, each sequence Y ∈ T in the database is projected at P according to the following rules:





  1. The sequential pattern P needs to be a subsequence of Y for the projection of Y to be included in the projected database T (P).




  1. All items that are either not in the last element of P, or are not successful frequent extensions (either temporal or set-wise) of the parent of P are not included in the projection of Y because they are irrelevant for counting frequent extensions of P.

  2. The earliest temporal occurrence of P in Y, as a subsequence, is determined. Let the last element Pr in P be matched to the element Yk in Y according to this subsequence matching. Then, the first element of the projected representation of Y is equal to the set of items in Yk − Pr that are lexicographically later than all items in Pr. If

the resulting element Q is null, then it is not included in the projection of Y. This first element, if non-null, is special because it may only be used for counting set-wise extensions of element Pr in the enumeration tree and is, therefore, denoted as Q with an underscore in front of it.


4. The remaining elements in the projected sequence after Q correspond to the elements in Y occurring temporally after the last matched element Yk in Y. All these elements are included in the projection of Y after removing the irrelevant items discussed in step 2. These remaining elements can be used for counting either set-wise extensions of the last element Pr in P, or temporal extensions of P. For any of these remaining elements (other than Q) to be used for counting the set-wise extensions of Pr, the element would already need to contain Pr.


The projected database T (P) can be used to count the frequent extensions of P more effi-ciently and determine the frequent ones. As in the frequent pattern mining, this projection can be performed successively in top-down fashion during the construction an enumeration-tree-like candidate structure. The projected database at a node can be generated by recur-sively projecting the database at its parent. The basic approach is exactly analogous to








  • See discussion in Sect. 4.4.4.5 of Chap. 4. A similar argument applies to sequential pattern mining.

500 CHAPTER 15. MINING DISCRETE SEQUENCES

projection-based frequent pattern mining algorithms discussed in Chap. 4. The algorithm starts with a candidate tree ET which is the null node with the entire sequence database



  • at that node. This tree is extended repeatedly using the following step until no nodes remain in ET for further extension.

Select a node (P, T (P)) in ET for extension;


Generate the candidate temporal and set-wise extensions of P;


Determine the frequent extensions of P using support counting on T (P );


Extend ET with frequent extensions and their recursively projected databases;


The final candidate tree ET contains all the frequent sequential patterns. Different strategies for selecting the node P can lead to the generation of the candidate tree in a different order such as breadth-first or depth-first order. This simplified and generalized description is roughly based on the frameworks independently proposed in [243] and Pre-fixSpan, which are closely related. The reader is referred to the bibliographic notes for discussion of the specific algorithms.


15.2.2 Constrained Sequential Pattern Mining

In many cases, additional constraints are imposed on the sequential patterns, such as con-straints on gaps between successive elements of the sequence. One solution is to use the unconstrained GSP algorithm, and then, as a postprocessing step, remove all subsequences not satisfying the constraint. However, this brute-force approach is a very inefficient solu-tion because the number of constrained patterns may be orders of magnitude smaller than the unconstrained patterns. Therefore, the incorporation of the constraints directly into the GSP algorithm, during the candidate-generation or support-counting step, is significantly more efficient. Depending on the nature of the constraints, the changes required to the GSP algorithm may be minor, or significant. In all cases, the support-counting procedure for Fk needs to be modified. The constraints are explicitly checked during the support count-ing. This reduces the number of frequent patterns generated, and makes the process more efficient than the brute-force method. However, the incorporation of such constraints may sometimes result in invalidation of the downward closure property of the mined patterns. In such cases, appropriate changes may need to be made to the GSP algorithm. In cases where the downward closure property is not violated, the GSP algorithm can be used with very minor modifications for constraint checking during support counting.


An important constraint that does not violate the downward closure property is the maxspan constraint. This constraint specifies that the time difference between the first and last elements of a subsequence must be no larger than maxspan. Therefore, the GSP algorithm can be used directly, with the modification that the constraint is checked during support counting. Thus, the approach works with a much smaller set of subsequences in each step and is generally more efficient than the brute-force method.


Another common constraint in sequential pattern mining is the maximum gap constraint. This is also referred to as the maxgap constraint. Note that all (k −1)-subsequences of a par-ticular frequent k-sequence may not be valid because of the maximum gap constraint. This is somewhat problematic because the Apriori principle cannot be used effectively. For exam-ple, the subsequence a1a5 is not supported by the transaction database sequence a1a2a3a4a5 under the maxgap value of 1, because the gap value between a1 and a5 is 3. However, the subsequence a1a3a5 is supported by the same transaction database sequence under this maxgap value. It is, therefore, possible for a1a5 to have lower support than a1a3a5. Thus,




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   299   300   301   302   303   304   305   306   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin