Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə70/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   66   67   68   69   70   71   72   73   ...   423
1-Data Mining tarjima

4.4. FREQUENT ITEMSET MINING ALGORITHMS

105

Algorithm GenericEnumerationTree(Transactions: T ,


Minimum Support: minsup)


begin
Initialize enumeration tree ET to single N ull node; while any node in ET has not been examined do begin


Select one of more unexamined nodes P from ET for examination; Generate candidates extensions C(P ) of each node P ∈ P;


Determine frequent extensions F (P ) ⊆ C(P ) for each P ∈ P with support counting; Extend each node P ∈ P in ET with its frequent extensions in F (P );
end

return enumeration tree ET ;


end

Figure 4.4: Generic enumeration-tree growth with unspecified growth strategy and counting method

This approach is continued until none of the nodes can be extended any further. At this point, the algorithm terminates. A more detailed description is provided in Fig. 4.4. Interestingly, almost all frequent pattern mining algorithms can be viewed as variations and extensions of this simple enumeration-tree framework. Within this broader framework, a wide variability exists both in terms of the growth strategy of the tree and the specific data structures used for support counting. Therefore, the description of Fig. 4.4 is very generic because none of these aspects are specified. The different choices of growth strategy and counting methodology provide different trade-offs between efficiency, space-requirements, and disk access costs. For example, in breadth-first strategies, the node set P selected in an iteration of Fig. 4.4 corresponds to all nodes at one level of the tree. This approach may be more relevant for disk-resident databases because all nodes at a single level of the tree can be extended during one counting pass on the transaction database. Depth-first strategies select a single node at the deepest level to create P. These strategies may have better ability to explore the tree deeply and discover long frequent patterns early. The early discovery of longer patterns is especially useful for computational efficiency in maximal pattern mining and for better memory management in certain classes of projection-based algorithms.


Because the counting approach is the most expensive part, the different techniques attempt to use growth strategies that optimize the work done during counting. Further-more, it is crucial for the counting data structures to be efficient. This section will explore some of the common algorithms, data structures, and pruning strategies that leverage the enumeration -tree structure in the counting process. Interestingly, the enumeration- tree framework is so general that even the Apriori algorithm can be interpreted within this framework, although the concept of an enumeration tree was not used when Apriori was proposed.


4.4.3.1 Enumeration-Tree-Based Interpretation of Apriori

The Apriori algorithm can be viewed as the level-wise construction of the enumeration tree in breadth-first manner. The Apriori join for generating candidate (k + 1)-itemsets is performed in a non-redundant way by using only the first (k − 1) items from two frequent k-itemsets. This is equivalent to joining all pairs of immediate siblings at the kth level of the enumeration tree. For example, the children of ab in Fig. 4.3 may be obtained by joining


106 CHAPTER 4. ASSOCIATION PATTERN MINING




ab with all its frequent siblings (other children of node a) that occur lexicographically later than it. In other words, the join operation of node P with its lexicographically later frequent siblings produces the candidates corresponding to the extension of P with each of its candidate tree-extensions C(P ). In fact, the candidate extensions C(P ) for all nodes P at a given level of the tree can be exhaustively and non-repetitively generated by using joins between all pairs of frequent siblings at that level. The Apriori pruning trick then discards some of the enumeration tree nodes because they are guaranteed not to be frequent. A single pass over the transaction database is used to count the support of these candidate extensions, and generate the frequent extensions F (P ) ⊆ C(P ) for each node P in the level being extended. The approach terminates when the tree cannot be grown further in a particular pass over the database. Thus, the join operation of Apriori has a direct interpretation in terms of the enumeration tree, and the Apriori algorithm implicitly extends the enumeration tree in a level-wise fashion with the use of joins.
4.4.3.2 TreeProjection and DepthProject



Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   66   67   68   69   70   71   72   73   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin