Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	79/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 75 76 77 78 79 80 81 82 ... 423

1-Data Mining tarjima

FP-growth is a recursive algorithm that extends suﬃxes of frequent patterns. Any recur-sive approach has a tree-structure associated with it that is referred to as its recursion tree, and a dynamic recursion stack that stores the recursion variables on the current path of the recursion tree during execution. Therefore, it is instructive to examine the suﬃx-based recursion tree created by the FP -growth algorithm, and compare it with the classical prefix-based enumeration tree used by enumeration-tree algorithms.

In Fig. 4.13a, the enumeration tree from the earlier example of Fig. 4.3 has been repli-cated. This tree of frequent patterns is counted by all enumeration-tree algorithms along with a single layer of infrequent candidate extensions of this tree corresponding to failed candidate tests. Each call of FP- growth discovers the set of frequent patterns extending a particular suﬃx of items, just as each branch of an enumeration tree explores the itemsets for a particular prefix. So, what is the hierarchical recursive relationship among the suﬃxes whose conditional pattern bases are explored? First, we need to decide on an ordering of items. Because the recursion is performed on suﬃxes and enumeration trees are constructed on prefixes, the opposite ordering {f, e, d, c, b, a} is assumed to adjust for the diﬀerent con-vention in the two methods. Indeed, most enumeration-tree methods order items from the least frequent to the most frequent, whereas FP-growth does the reverse. The corresponding recursion tree of FP-growth, when the 1-itemsets are ordered from left to right in dictio-nary order, is illustrated in Fig. 4.13b. The trees in Figs 4.13a and 4.13b are identical, with the only diﬀerence being that they are drawn diﬀerently, to account for the opposite lexicographic ordering. The FP-growth recursion tree on the reverse lexicographic ordering has an identical structure to the traditional enumeration tree on the prefixes. During any given recursive call of FP-growth, the current (recursion) stack of suﬃx items is the path in the enumeration tree that is currently being explored. This enumeration tree is explored in depth-first order by FP-growth because of its recursive nature.

Traditional enumeration-tree methods typically count the support of a single layer of infrequent extensions of the frequent patterns in the enumeration-tree, as (failed) candidates, to rule them out. Therefore, it is instructive to explore whether FP-growth avoids counting these infrequent candidates. Note that when conditional transaction databases F PT _i are

4.4. FREQUENT ITEMSET MINING ALGORITHMS

121

created (see Fig. 4.12), infrequent items must be removed from them. This requires the counting of the support of these (implicitly failed) candidate extensions. In a traditional candidate generate-and-test algorithm, the frequent candidate extensions would be reported immediately after the counting step as a successful candidate test. However, in FP-growth, these frequent extensions are encoded back into the conditional transaction database F PT _i, and the reporting is delayed to the next level recursive call. In the next level recursive call, these frequent extensions are then extracted from F PT _i and reported. The counting and removal of infrequent items from conditional transaction sets is an implicit candidate evaluation and testing step. The number of such failed candidate tests4 in FP -growth is exactly equal to that of enumeration-tree algorithms, such as Apriori (without the level-wise pruning step). This equality follows directly from the relationship of all these algorithms to how they explore the enumeration tree and rule out infrequent portions. All pattern- growth methods, including FP-growth, should be considered enumeration-tree methods, as should Apriori. Whereas traditional enumeration trees are constructed on prefixes, the (implicit) FP-growth enumeration trees are constructed using suﬃxes. This is a diﬀerence only in the item-ordering convention.

The depth-first strategy is the approach of choice in database projection methods because it is more memory-eﬃcient to maintain the conditional transaction sets along a (relatively small) depth of the enumeration (recursion) tree rather than along the (much larger) breadth of the enumeration tree. As discussed in the previous section, memory man-agement becomes a problem even with the depth- first strategy beyond a certain database size. However, the specific strategy used for tree exploration does not have any impact on the size of the enumeration tree (or candidates) explored over the course of the entire algorithm execution. The only diﬀerence is that breadth-first methods process candidates in large batches based on pattern size, whereas depth-first methods process candidates in smaller batches of immediate siblings in the enumeration tree. From this perspective, FP-growth cannot avoid the exponential candidate search space exploration required by enumeration-tree methods, such as Apriori.

Whereas methods such as Apriori can also be interpreted as counting methods on an enumeration-tree of exactly the same size as the recursion tree of FP-growth, the counting work done at the higher levels of the enumeration tree is lost. This loss is because the counting is done from scratch at each level in Apriori with the entire transaction database rather than a projected database that remembers and reuses the work done at the higher levels of the tree. Projection-based reuse is also utilized by Savasere et al.’s vertical count-ing methods [446] and DepthProject. The use of a pointer-trie combination data structure for projected transaction representation is the primary diﬀerence of FP-growth from other projection-based methods. In the context of depth-first exploration, these methods can be understood either as divide-and-conquer strategies or as projection-based reuse strategies. The notion of projection-based reuse is more general because it applies to both the breadth-first and depth-first versions of the algorithm, and it provides a clearer picture of how compu-tational savings are achieved by avoiding wasteful and repetitive counting. Projection-based reuse enables the eﬃcient testing of candidate item extensions in a restricted portion of the database rather than the testing of candidate itemsets in the full database. Therefore, the eﬃciencies in FP-growth are a result of more eﬃcient counting per candidate and not because of fewer candidates. The only diﬀerences in search space size between various methods are

An ad hoc pruning optimization in FP-growth terminates the recursion when all nodes in the FP-Tree lie on a single path. This pruning optimization reduces the number of successful candidate tests but not the number of failed candidate tests. Failed candidate tests often dominate successful candidate tests in real data sets.

122 CHAPTER 4. ASSOCIATION PATTERN MINING

the result of ad hoc pruning optimizations, such as level-wise pruning in Apriori, bucketing in the DepthProject algorithm, and the single-path boundary condition of FP-growth.

The bookkeeping of the projected transaction sets can be done diﬀerently with the use of diﬀerent data structures, such as arrays, pointers, or a pointer-trie combination. Many diﬀerent data structure variations are explored in diﬀerent projection algorithms, such as TreeProjection, DepthProject, FP-growth, and H-Mine [419]. Each data structure is associated with a diﬀerent set of eﬃciencies and overheads.

In conclusion, the enumeration tree5 is the most general framework to describe all previ-ous frequent pattern mining algorithms. This is because the enumeration tree is a subgraph of the lattice (candidate space) and it provides a way to explore the candidate patterns in a systematic and non-redundant way. The support testing of the frequent portion of the enumeration tree along with a single layer of infrequent candidate extensions of these nodes is fundamental to all frequent itemset mining algorithms for ruling in and ruling out possible (or candidate) frequent patterns. Any algorithm, such as FP-growth, which uses the enumeration tree to rule in and rule out possible extensions of frequent patterns with support counting, is a candidate generate-and-test algorithm.

4.5 Alternative Models: Interesting Patterns

The traditional model for frequent itemset generation has found widespread popularity and acceptance because of its simplicity. The simplicity of using raw frequency counts for the support, and that of using the conditional probabilities for the confidence is very appealing. Furthermore, the downward closure property of frequent itemsets enables the design of eﬃcient algorithms for frequent itemset mining. This algorithmic convenience does not, however, mean that the patterns found are always significant from an application-specific perspective. Raw frequencies of itemsets do not always correspond to the most interesting patterns.

For example, consider the transaction database illustrated in Fig. 4.1. In this database, all the transactions contain the item M ilk. Therefore, the item M ilk can be appended to any set of items, without changing its frequency. However, this does not mean that M ilk is truly associated with any set of items. Furthermore, for any set of items X, the association rule X ⇒ {M ilk } has 100% confidence. However, it would not make sense for the supermarket merchant to assume that the basket of items X is discriminatively indicative of M ilk. Herein lies the limitation of the traditional support-confidence model.

Sometimes, it is also desirable to design measures that can adjust to the skew in the individual item support values. This adjustment is especially important for negative pattern mining. For example, the support of the pair of items {M ilk, Butter } is very diﬀerent from that of {¬M ilk, ¬Butter}. Here, ¬ indicates negation. On the other hand, it can be argued that the statistical coeﬃcient of correlation is exactly the same in both cases. Therefore, the measure should quantify the association between both pairs in exactly the same way. Clearly, such measures are important for negative pattern mining. Measures that satisfy this property are said to satisfy the bit symmetric property because values of 0 in the binary matrix are treated in a similar way to values of 1.

⁵FP-growth has been presented in a separate section from enumeration tree methods only because it uses a diﬀerent convention of constructing suﬃx-based enumeration trees. It is not necessary to distinguish “pattern growth” methods from “candidate-based” methods to meaningfully categorize various frequent pattern mining methods. Enumeration tree methods are best categorized on the basis of their (i) tree exploration strategy, (ii) projection-based reuse properties, and (iii) relevant data structures.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 75 76 77 78 79 80 81 82 ... 423