elements and 4 items, which is identical to S1. However, the nature of the join will be some-what different in these cases. In general, cases where the last element of S2 is a 1-itemset need to be treated specially. The following rules can be used to execute the join:
If the last element of S2 is a 1-itemset, then the joined candidate may be obtained by appending the last element of S2 to S1 as a separate element. For example, consider the following two sequences:
S1 = Bread, Butter, Cheese}, {Cheese, Eggs} S2 = Bread, Butter}, {Cheese, Eggs}, {M ilk}
The join of the two sequences is Bread, Butter, Cheese}, {Cheese, Eggs}, {M ilk} .
If the last element of S2 is not a 1-itemset, but a superset of the last element of S1, then the joined candidate may be obtained by replacing the last element of S1 with the last element of S2. For example, consider the following two sequences:
S1 = Bread, Butter, Cheese}, {Cheese, Eggs} S2 = Bread, Butter}, {M ilk, Cheese, Eggs}
The join of the two sequences is Bread, Butter, Cheese}, {M ilk, Cheese, Eggs}.
These key differences from Apriori joins are a result of the temporal complexity and the set-based elements in sequential patterns. Alternative methods exist for performing the joins. For example, an alternative approach is to remove one item from the last elements of both S1 and S2 to check whether the resulting sequences are identical. However, in this case multiple candidates might be generated from the same pair. For example, a, b, c and a, b, d can join to any of a, b, c, d, a, b, d, c, and a, b, cd. The first join rule of removing the first item from S1 and the last item from S2 has the merit of having a unique join result. For any specific join rule, it is important to ensure exhaustive and nonrepetitive generation of candidates. As we will see later, a similar notion to the frequent-pattern enumeration tree can be introduced in sequential pattern mining to ensure exhaustive and nonrepetitive candidate generation.
The Apriori trick is then used to prune sequences that violate downward closure. The idea is to check if each k-subsequence of a candidate in Ck+1 is present in Fk. The candidate set is pruned of those candidates that do not satisfy this closure property. The frequent (k + 1)-candidate sequences Ck+1 are then checked against the sequence database T , and the support is counted. The counting of the support is executed according to the notion of a subsequence, as presented in Definition 15.2.1. All frequent candidates in Ck+1 are retained in Fk+1. The algorithm terminates, when no frequent sequences are generated in Fk+1 in a particular iteration. All the frequent sequences generated during the different iterations of the levelwise approach are returned by the algorithm. The pseudocode of the GSP algorithm is illustrated in Fig. 15.1.
15.2.1 Frequent Patterns to Frequent Sequences
It is easy to see that the Apriori and GSP algorithms are structurally similar. This is not a coincidence. The basic structure of the frequent pattern and sequential pattern mining problems are similar. Aside from the differences in the support counting approach, the main difference between GSP and Apriori is in terms of how candidates are generated. The join
Dostları ilə paylaş: |