Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	68/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 64 65 66 67 68 69 70 71 ... 423

1-Data Mining tarjima

4.4. FREQUENT ITEMSET MINING ALGORITHMS

101

Algorithm Apriori(Transactions: T , Minimum Support: minsup)

begin
k = 1;

F₁ = { All Frequent 1-itemsets };
while F_k is not empty do begin
Generate C_k₊₁ by joining itemset-pairs in F_k;

Prune itemsets from C_k₊₁ that violate downward closure; Determine F_k₊₁ by support counting on (C_k₊₁, T ) and retaining

itemsets from C_k₊₁ with support at least minsup; k = k + 1;

end;
return(∪^k F );

_i₌₁ _i
Figure 4.2: The Apriori algorithm

joined together on the two common items a and b, will yield the candidate 4-itemset abcd. Of course, it is possible to join other frequent patterns to create the same candidate. One might also join abc and bcd to achieve the same result. Suppose that all four of the 3-subsets of abcd are present in the set of frequent 3-itemsets. One can create the candidate 4-itemset in 4₂ = 6 diﬀerent ways. To avoid redundancy in candidate generation, the convention is to impose a lexicographic ordering on the items and use the first (k − 1) items of the itemset for the join. Thus, in this case, the only way to generate abcd would be to join using the first two items a and b. Therefore, the itemsets abc and abd would need to be joined to create abcd. Note that, if either of abc and abd are not frequent, then abcd will not be generated as a candidate using this join approach. Furthermore, in such a case, it is assured that abcd will not be frequent because of the downward closure property of frequent itemsets. Thus, the downward closure property ensures that the candidate set generated using this approach does not miss any itemset that is truly frequent. As we will see later, this non- repetitive and exhaustive way of generating candidates can be interpreted in the context of a conceptual hierarchy of the patterns known as the enumeration tree. Another point to note is that the joins can usually be performed very eﬃciently. This eﬃciency is because, if the set F _k is sorted in lexicographic (dictionary) order, all itemsets with a common set of items in the first k − 1 positions will appear contiguously, allowing them to be located easily.

level-wise pruning trick can be used to further reduce the size of the (k + 1)-candidate set. All the k-subsets (i.e., subsets of cardinality k) of an itemset I ∈ C_k₊₁ need to be present in F_k because of the downward closure property. Otherwise, it is guaranteed that the itemset I is not frequent. Therefore, it is checked whether all k-subsets of each itemset I ∈ C_k₊₁ are present in F_k. If this is not the case, then such itemsets I are removed from

^Ck+1.

After the candidate itemsets C_k₊₁ of size (k + 1) have been generated, their support can be determined by counting the number of occurrences of each candidate in the transaction database T . Only the candidate itemsets that have the required minimum support are retained to create the set of (k + 1) -frequent itemsets F_k₊₁ ⊆ C_k₊₁. In the event that the set F_k₊₁ is empty, the algorithm terminates. At termination, the union ∪^k_i₌₁F_i of the frequent patterns of diﬀerent sizes is reported as the final output of the algorithm.

The overall algorithm is illustrated in Fig. 4.2. The heart of the algorithm is an iterative loop that generates (k + 1)-candidates from frequent k- patterns for successively higher values of k and counts them. The three main operations of the algorithm are candidate

^I^∈^Ck+1
102 CHAPTER 4. ASSOCIATION PATTERN MINING

generation, pruning, and support counting. Of these, the support counting process is the most expensive one because it depends on the size of the transaction database T . The level-wise approach ensures that the algorithm is relatively eﬃcient at least from a disk-access cost perspective. This is because each set of candidates in C_k₊₁ can be counted in a single pass over the data without the need for random disk accesses. The number of passes over the data is, therefore, equal to the cardinality of the longest frequent itemset in the data. Nevertheless, the counting procedure is still quite expensive especially if one were to use the naive approach of checking whether each itemset is a subset of a transaction. Therefore, eﬃcient support counting procedures are necessary.

4.4.2.1 Eﬃcient Support Counting

To perform support counting, Apriori needs to eﬃciently examined whether each candidate itemset is present in a transaction. This is achieved with the use of a data structure known as the hash tree. The hash tree is used to carefully organize the candidate patterns in C_k₊₁ for more eﬃcient counting. Assume that the items in the transactions and the candidate itemsets are sorted lexicographically. A hash tree is a tree with a fixed degree of the internal nodes. Each internal node is associated with a random hash function that maps to the index of the diﬀerent children of that node in the tree. A leaf node of the hash tree contains a list of lexicographically sorted itemsets, whereas an interior node contains a hash table. Every itemset in C_k₊₁ is contained in exactly one leaf node of the hash tree. The hash functions in the interior nodes are used to decide which candidate itemset belongs to which leaf node with the use of a methodology described below.

It may be assumed that all interior nodes use the same hash function f (·) that maps to [0 . . . h−1]. The value of h is also the branching degree of the hash tree. A candidate itemset in C_k ₊₁ is mapped to a leaf node of the tree by defining a path from the root to the leaf node with the use of these hash functions at the internal nodes. Assume that the root of the hash tree is level 1, and all successive levels below it increase by 1. As before, assume that the items in the candidates and transactions are arranged in lexicographically sorted order. At an interior node in level i, a hash function is applied to the ith item of a candidate itemset to decide which branch of the hash tree to follow for the candidate itemset. The

tree is constructed recursively in top-down fashion, and a minimum threshold is imposed on the number of candidates in the leaf node to decide where to terminate the hash tree extension. The candidate itemsets in the leaf node are stored in sorted order.

To perform the counting, all possible candidate k-itemsets in C_k ₊₁ that are subsets of a transaction T_j ∈ T are discovered in a single exploration of the hash tree. To achieve this goal, all possible paths in the hash tree, whose leaves might contain subset itemsets of the transaction T_j , are discovered using a recursive traversal. The selection of the relevant leaf nodes is performed by recursive traversal as follows. At the root node, all branches are followed such that any of the items in the transaction T_j hash to one of the branches. At a given interior node, if the ith item of the transaction T_j was last hashed (at the parent node), then all items following it in the transaction are hashed to determine the possible children to follow. Thus, by following all these paths, the relevant leaf nodes in the tree are determined. The candidates in the leaf node are stored in sorted order and can be compared eﬃciently to the transaction T_j to determine whether they are relevant. This process is repeated for each transaction to determine the final support count of each itemset in C_k₊₁.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 64 65 66 67 68 69 70 71 ... 423