Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə68/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   64   65   66   67   68   69   70   71   ...   423
1-Data Mining tarjima

4.4. FREQUENT ITEMSET MINING ALGORITHMS

101

Algorithm Apriori(Transactions: T , Minimum Support: minsup)


begin
k = 1;


F1 = { All Frequent 1-itemsets };
while Fk is not empty do begin
Generate Ck+1 by joining itemset-pairs in Fk;

Prune itemsets from Ck+1 that violate downward closure; Determine Fk+1 by support counting on (Ck+1, T ) and retaining


itemsets from Ck+1 with support at least minsup; k = k + 1;

end;
return(k F );


i=1 i
Figure 4.2: The Apriori algorithm

joined together on the two common items a and b, will yield the candidate 4-itemset abcd. Of course, it is possible to join other frequent patterns to create the same candidate. One might also join abc and bcd to achieve the same result. Suppose that all four of the 3-subsets of abcd are present in the set of frequent 3-itemsets. One can create the candidate 4-itemset in 42 = 6 different ways. To avoid redundancy in candidate generation, the convention is to impose a lexicographic ordering on the items and use the first (k − 1) items of the itemset for the join. Thus, in this case, the only way to generate abcd would be to join using the first two items a and b. Therefore, the itemsets abc and abd would need to be joined to create abcd. Note that, if either of abc and abd are not frequent, then abcd will not be generated as a candidate using this join approach. Furthermore, in such a case, it is assured that abcd will not be frequent because of the downward closure property of frequent itemsets. Thus, the downward closure property ensures that the candidate set generated using this approach does not miss any itemset that is truly frequent. As we will see later, this non- repetitive and exhaustive way of generating candidates can be interpreted in the context of a conceptual hierarchy of the patterns known as the enumeration tree. Another point to note is that the joins can usually be performed very efficiently. This efficiency is because, if the set F k is sorted in lexicographic (dictionary) order, all itemsets with a common set of items in the first k − 1 positions will appear contiguously, allowing them to be located easily.





  • level-wise pruning trick can be used to further reduce the size of the (k + 1)-candidate set. All the k-subsets (i.e., subsets of cardinality k) of an itemset I ∈ Ck+1 need to be present in Fk because of the downward closure property. Otherwise, it is guaranteed that the itemset I is not frequent. Therefore, it is checked whether all k-subsets of each itemset I ∈ Ck+1 are present in Fk. If this is not the case, then such itemsets I are removed from

Ck+1.

After the candidate itemsets Ck+1 of size (k + 1) have been generated, their support can be determined by counting the number of occurrences of each candidate in the transaction database T . Only the candidate itemsets that have the required minimum support are retained to create the set of (k + 1) -frequent itemsets Fk+1 ⊆ Ck+1. In the event that the set Fk+1 is empty, the algorithm terminates. At termination, the union ki=1Fi of the frequent patterns of different sizes is reported as the final output of the algorithm.


The overall algorithm is illustrated in Fig. 4.2. The heart of the algorithm is an iterative loop that generates (k + 1)-candidates from frequent k- patterns for successively higher values of k and counts them. The three main operations of the algorithm are candidate



I Ck+1
102 CHAPTER 4. ASSOCIATION PATTERN MINING

generation, pruning, and support counting. Of these, the support counting process is the most expensive one because it depends on the size of the transaction database T . The level-wise approach ensures that the algorithm is relatively efficient at least from a disk-access cost perspective. This is because each set of candidates in Ck+1 can be counted in a single pass over the data without the need for random disk accesses. The number of passes over the data is, therefore, equal to the cardinality of the longest frequent itemset in the data. Nevertheless, the counting procedure is still quite expensive especially if one were to use the naive approach of checking whether each itemset is a subset of a transaction. Therefore, efficient support counting procedures are necessary.


4.4.2.1 Efficient Support Counting


To perform support counting, Apriori needs to efficiently examined whether each candidate itemset is present in a transaction. This is achieved with the use of a data structure known as the hash tree. The hash tree is used to carefully organize the candidate patterns in Ck+1 for more efficient counting. Assume that the items in the transactions and the candidate itemsets are sorted lexicographically. A hash tree is a tree with a fixed degree of the internal nodes. Each internal node is associated with a random hash function that maps to the index of the different children of that node in the tree. A leaf node of the hash tree contains a list of lexicographically sorted itemsets, whereas an interior node contains a hash table. Every itemset in Ck+1 is contained in exactly one leaf node of the hash tree. The hash functions in the interior nodes are used to decide which candidate itemset belongs to which leaf node with the use of a methodology described below.


It may be assumed that all interior nodes use the same hash function f (·) that maps to [0 . . . h−1]. The value of h is also the branching degree of the hash tree. A candidate itemset in Ck +1 is mapped to a leaf node of the tree by defining a path from the root to the leaf node with the use of these hash functions at the internal nodes. Assume that the root of the hash tree is level 1, and all successive levels below it increase by 1. As before, assume that the items in the candidates and transactions are arranged in lexicographically sorted order. At an interior node in level i, a hash function is applied to the ith item of a candidate itemset to decide which branch of the hash tree to follow for the candidate itemset. The


tree is constructed recursively in top-down fashion, and a minimum threshold is imposed on the number of candidates in the leaf node to decide where to terminate the hash tree extension. The candidate itemsets in the leaf node are stored in sorted order.


To perform the counting, all possible candidate k-itemsets in Ck +1 that are subsets of a transaction Tj ∈ T are discovered in a single exploration of the hash tree. To achieve this goal, all possible paths in the hash tree, whose leaves might contain subset itemsets of the transaction Tj , are discovered using a recursive traversal. The selection of the relevant leaf nodes is performed by recursive traversal as follows. At the root node, all branches are followed such that any of the items in the transaction Tj hash to one of the branches. At a given interior node, if the ith item of the transaction Tj was last hashed (at the parent node), then all items following it in the transaction are hashed to determine the possible children to follow. Thus, by following all these paths, the relevant leaf nodes in the tree are determined. The candidates in the leaf node are stored in sorted order and can be compared efficiently to the transaction Tj to determine whether they are relevant. This process is repeated for each transaction to determine the final support count of each itemset in Ck+1.


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   64   65   66   67   68   69   70   71   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin