TreeProjection is a family of methods that uses recursive projections of the transactions down the enumeration tree structure. The goal of these recursive projections is to reuse the counting work that has already been done at a given node of the enumeration tree at its descendent nodes. This reduces the overall counting effort by orders of magnitude. TreeProjection is a general framework that shows how to use database projection in the context of a variety of different strategies for construction of the enumeration tree, such as breadth-first, depth-first, or a combination of the two. The DepthProject approach is a specific instantiation of this framework with the depth-first strategy. Different strategies have different trade-offs between the memory requirements and disk-access costs.
The main observation in projection-based methods is that if a transaction does not con-tain the itemset corresponding to an enumeration-tree node, then this transaction will not be relevant for counting at any descendent (superset itemset) of that node. Therefore, when counting is done at an enumeration-tree node, the information about irrelevant transac-tions should somehow be preserved for counting at its descendent nodes. This is achieved with the notion of projected databases . Each projected transaction database is specific to an enumeration-tree node. Transactions that do not contain the itemset P are not included in the projected databases at node P and its descendants. This results in a significant reduc-tion in the number of projected transactions. Furthermore, only the candidate extension items of P , denoted by C(P ), are relevant for counting at any of the subtrees rooted at node P . Therefore, the projected database at node P can be expressed only in terms of the items in C(P ). The size of C(P ) is much smaller than the universe of items, and therefore the projected database contains a smaller number of items per transaction with increasing size of P . We denote the projected database at node P by T (P ). For example, consider the node P = ab in Fig. 4.3, in which the candidate items for extending ab are C(P ) = {c, d, f }. Then, the transaction abcf g maps to the projected transaction cf in T (P ). On the other hand, the transaction acf g is not even present in T (P ) because P = ab is not a subset of acf g. The special case T (N ull) = T corresponds to the top level of the enumeration tree and is equal to the full transaction database. In fact, the subproblem at node P with transaction database T (P ) is structurally identical to the top-level problem, except that it is a much smaller problem focused on determining frequent patterns with a prefix of P . Therefore, the frequent node P in the enumeration tree can be extended further by count-ing the support of individual items in C(P ) using the relatively small database T (P ). This
4.4. FREQUENT ITEMSET MINING ALGORITHMS
|
107
|
Algorithm ProjectedEnumerationTree(Transactions: T ,
Minimum Support: minsup)
begin
Initialize enumeration tree ET to a single (N ull, T ) root node; while any node in ET has not been examined do begin
Select an unexamined node (P, T (P )) from ET for examination;
Generate candidates item extensions C(P ) of node (P, T (P ));
Determine frequent item extensions F (P ) ⊆ C(P ) by support counting
of individual items in smaller projected database T (P ); Remove infrequent items in T (P );
for each frequent item extension i ∈ F (P ) do begin Generate T (P ∪ {i}) from T (P );
Add (P ∪ {i}, T (P ∪ {i})) as child of P in ET ;
end
end
return enumeration tree ET ;
end
Figure 4.5: Generic enumeration-tree growth with unspecified growth strategy and database projections
results in a simplified and efficient counting process of candidate 1-item extensions rather than itemsets.
The enumeration tree can be grown with a variety of strategies such as the breadth-first or depth-first strategies. At each node, the counting is performed with the use of the projected database rather than the entire transaction database, and a further reduced and projected transaction database is propagated to the children of P . At each level of the hierarchical projection down the enumeration tree, the number of items and the number of transactions in the projected database are reduced. The basic idea is that T (P ) contains the minimal portion of the transaction database that is relevant for counting the subtree rooted at P , based on the removal of irrelevant transactions and items by the counting process that has already been performed at higher levels of the tree. By recursively projecting the transaction database down the enumeration tree, this counting work is reused. We refer to this approach as projection-based reuse of counting effort.
The generic enumeration-tree algorithm with hierarchical projections is illustrated in Fig. 4.5. This generic algorithm does not assume any specific exploration strategy, and is quite similar to the generic enumeration-tree pseudocode shown in Fig. 4.4. There are two differences between the pseudocodes.
For simplicity of notation, we have shown the exploration of a single node P at one time in Fig. 4.5, rather than a group of nodes P (as in Fig. 4.4). However, the pseudocode shown in Fig. 4.5 can easily be rewritten for a group of nodes P. Therefore, this is not a significant difference.
The key difference is that the projected database T (P ) is used to count support at node P . Each node in the enumeration tree is now represented by the itemset
and projected database pair (P, T (P )). This is a very important difference because T (P ) is much smaller than the original database. Therefore, a significant amount of information gained by counting the supports of ancestors of node P , is preserved in T (P ). Furthermore, one only needs to count the support of single item extensions of node P in T (P ) (rather than entire itemsets) in order to grow the subtree at P further.
108 CHAPTER 4. ASSOCIATION PATTERN MINING
The enumeration tree can be constructed in many different ways depending on the lexico-graphic ordering of items. How should the items be ordered? The structure of the enumer-ation tree has a built-in bias towards creating unbalanced trees in which the lexicograph-ically smaller items have more descendants. For example, in Fig. 4.3, node a has many more descendants than node f . Therefore, ordering the items from least support to greatest support ensures that the computationally heavier branches of the enumeration tree have fewer relevant transactions. This is helpful in maximizing the selectivity of projections and ensuring better efficiency.
The strategy used for selection of the node P defines the order in which the nodes of the enumeration tree are materialized. This strategy has a direct impact on memory man-agement because projected databases, which are no longer required for future computation, can be deleted. In depth-first strategies, the lexicographically smallest unexamined node
is selected for extension. In this case, one only needs to maintain projected databases along the current path of the enumeration tree being explored. In breadth-first strategies, an entire group of nodes P corresponding to all patterns of a particular size are grown first. In such cases, the projected databases need to be simultaneously maintained along the full breadth of the enumeration tree ET at the two current levels involved in the growth process. Although it may be possible to perform the projection on such a large number of nodes for smaller transaction databases, some modifications to the basic framework of Fig. 4.5 are needed for the general case of larger databases.
In particular, breadth-first variations of the TreeProjection framework perform hierarchi-cal projections on the fly during counting from their ancestor nodes. The depth-first varia-tions of TreeProjection, such as DepthProject, achieve full projection-based reuse because the projected transactions can be consistently maintained at each materialized node along the relatively small path of the enumeration tree from the root to the current node. The breadth-first variations do have the merit that they can optimize disk-access costs for arbitrarily large databases at the expense of losing some of the power of projection-based reuse. As will be discussed later, all (full) projection-based reuse methods face memory-management chal-lenges with increasing database size. These additional memory requirements can be viewed as the price for persistently storing the relevant work done in earlier iterations in the indi-rect form of projected databases. There is usually a different trade-off between disk-access costs and memory/computational requirements in various strategies, which is exploited by the TreeProjection framework. The bibliographic notes contain pointers to specific details of these optimized variations of TreeProjection.
Dostları ilə paylaş: |