Figure 4.3: The lexicographic or enumeration tree of frequent itemsets
4.4.3 Enumeration-Tree Algorithms
These algorithms are based on set enumeration concepts, in which the different candidate itemsets are generated in a tree-like structure known as the enumeration tree, which is a subgraph of the lattice of itemsets introduced in Fig. 4.1. This tree-like structure is also referred to as a lexicographic tree because it is dependent on an upfront lexicographic order-ing among the items. The candidate patterns are generated by growing this lexicographic tree. This tree can be grown in a wide variety of different strategies to achieve different trade-offs between storage, disk access costs, and computational efficiency. Because most of the discussion in this section will use this structure as a base for algorithmic development, this concept will be discussed in detail here. The main characteristic of the enumeration tree (or lexicographic tree) is that it provides an abstract hierarchical representation of the itemsets. This representation is leveraged by frequent pattern mining algorithms for sys-tematic exploration of the candidate patterns in a non-repetitive way. The final output of these algorithms can also be viewed as an enumeration tree structure that is defined only on the frequent itemsets. The enumeration tree is defined on the frequent itemsets in the following way:
A node exists in the tree corresponding to each frequent itemset. The root of the tree corresponds to the null itemset.
Let I = {i1, . . . ik} be a frequent itemset, where i1, i2 . . . ik are listed in lexicographic order. The parent of the node I is the itemset {i1, . . . ik−1}. Thus, the child of a node can only be extended with items occurring lexicographically after all items occur-ring in that node. The enumeration tree can also be viewed as a prefix tree on the lexicographically ordered string representation of the itemsets.
This definition of an ancestral relationship naturally creates a tree structure on the nodes, which is rooted at the null node. An example of the frequent portion of the enumeration tree is illustrated in Fig. 4.3. An item that is used to extend a node to its (frequent) child in the enumeration tree is referred to as a frequent tree extension, or simply a tree extension. In the example of Fig. 4.3, the frequent tree extensions of node a are b, c, d, and f , because
104 CHAPTER 4. ASSOCIATION PATTERN MINING
these items extend node a to the frequent itemsets ab , ac, ad, and af , respectively. The lattice provides many paths to extend the null itemset to a node, whereas an enumeration tree provides only one path. For example, itemset ab can be extended either in the order a → ab, or in the order b → ab in the lattice. However, only the former is possible in the enumeration tree after the lexicographic ordering has been fixed. Thus, the lexicographic ordering imposes a strictly hierarchical structure on the itemsets. This hierarchical structure enables systematic and non-redundant exploration of the itemset search space by algorithms that generate candidates by extending frequent itemsets with one item at a time. The enumeration tree can be constructed in many ways with different lexicographic orderings of items. The impact of this ordering will be discussed later.
Most of the enumeration tree algorithms work by growing this enumeration tree of frequent itemsets with a predefined strategy. First, the root node of the tree is extended by finding the frequent 1-items. Then, these nodes may be extended to create candidates. These are checked against the transaction database to determine the ones that are frequent. The enumeration tree framework provides an order and structure to the frequent itemset discovery, which can be leveraged to improve the counting and pruning process of candidates. In the following discussion, the terms “node” and “itemset” will be used interchangeably. Therefore, the notation P will be used to denote both an itemset, and its corresponding node in the enumeration tree.
So, how can candidates nodes be generated in a systematic way from the frequent nodes in the enumeration tree that have already been discovered? For an item i to be considered a candidate for extending a frequent node P to P ∪ {i}, it must also be a frequent extension of the parent Q of P . This is because of the downward closure property, and it can be used to systematically define the candidate extensions of a node P after the frequent extensions of its parent Q have been determined. Let F (Q) represent the frequent lexicographic tree extensions of node Q. Let i ∈ F (Q) be the frequent extension item that extends frequent node Q to frequent node P = Q ∪ {i}. Let C(P ) denote the subset of items from F (Q) occurring lexicographically after the item i used to extend node Q to node P . The set C(P ) defines the candidate extension items of node P , which are defined as items that can be appended at the end of P to create candidate itemsets. This provides a systematic methodology to generate candidate children of node P . As we will see in Sect. 4.4.3.1, the resulting candidates are identical to those generated by Apriori joins. Note that the relationship F (P ) ⊆ C(P ) ⊂ F (Q) is always true. The value of F (P ) in Fig. 4.3, when
= ab, is {c, d}. The value of C(P ) for P = ab is {c, d, f } because these are frequent extensions of parent itemset Q = {a} of P occurring lexicographically after the item b. Note that the set of candidate extensions C(ab) also contains the (infrequent) item f that the set of frequent extensions F (ab) does not. Such infrequent item extensions correspond to failed candidate tests in all enumeration tree algorithms. Note that the infrequent itemset abf is not included in the frequent itemset tree of Fig. 4.3. It is also possible to create an enumeration tree structure on the candidate itemsets, which contains an additional layer of infrequent candidate extensions of the nodes in Fig. 4.3. Such a tree would contain abf .
Enumeration tree algorithms iteratively grow the enumeration tree ET of frequent pat-terns. A very generic description of this iterative step, which is executed repeatedly to extend the enumeration tree ET , is as follows:
Select one or more nodes P in ET ;
Determine candidate extensions C(P ) for each such node P ∈ P;
Count support of generated candidates;
Add frequent candidates to ET (tree growth);
Dostları ilə paylaş: |