Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	72/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 68 69 70 71 72 73 74 75 ... 423

1-Data Mining tarjima

Optimized counting at deeper level nodes: The projection-based approach enables specialized counting techniques at deeper level nodes near the leaves of the enumeration tree. These specialized counting methods can provide the counts of all the itemsets in a lower-level subtree in the time required to scan the projected database. Because such nodes are more numerous, this can lead to large computational improvements.
What is the point at which such counting methods can be used? When the number of frequent extensions F (P ) of a node P falls below a threshold t such that 2t fits in memory, an approach known as bucketing can be used. To obtain the best computational results, the value of t used should be such that 2t is much smaller than the number of transactions in the projected database. This can occur only when there are many repeated transactions in the projected database.

A two-phase approach is used. In the first phase, the count of each distinct transaction in the projected database is determined. This can be accomplished easily by maintaining 2|F (P )| buckets or counters, scanning the transactions one by one, and adding counts to the buckets. This phase can be completed in a simple scan of the small (projected) database

4.4. FREQUENT ITEMSET MINING ALGORITHMS

109

of transactions. Of course, this process only provides transaction counts and not itemset counts.

In the second phase, the transaction frequency counts can be further aggregated in a systematic way to create itemset frequency counts. Conceptually, the process of aggregating projected transaction counts is similar to arranging all the 2|F (P )| possibilities in the form of a lattice, as illustrated in Fig. 4.1. The counts of the lattice nodes, which are computed in the first phase, are aggregated up the lattice structure by adding the count of immediate supersets to their subsets. For small values of |F (P )|, such as 10, this phase is not the limiting computational factor, and the overall time is dominated by that required to scan the projected database in the first phase. An eﬃcient implementation of the second phase is discussed in detail below.

Consider a string composed of 0, 1, and ∗ that refers to an itemset in which the positions with 0 and 1 are fixed to those values (corresponding to presence or absence of items), whereas a position with a ∗ is a “don’t care.” Thus, all transactions can be expressed in terms of 0 and 1 in their binary representation. On the other hand, all itemsets can be expressed in terms of 1 and ∗ because itemsets are traditionally defined with respect to presence of items and ambiguity with respect to absence. Consider, for example, the case when |F (P )| = 4, and there are four items, numbered {1, 2, 3, 4}. An itemset containing items 2 and 4 is denoted by ∗1 ∗ 1. We start with the information on 24 = 16 bitstrings that are composed 0 and 1. These represent all possible distinct transactions. The algorithm aggregates the counts in |F (P )| iterations. The count for a string with a “*” in a particular position may be obtained by adding the counts for the strings with a 0 and 1 in those positions. For example, the count for the string *1*1 may be expressed as the sum of the counts of the strings 01*1 and 11*1. The positions may be processed in any order, although the simplest approach is to aggregate them from the least significant to the most significant.

A simple pseudocode to perform the aggregation is described below. In this pseudocode, the initial value of bucket[i] is equal to the count of the transaction corresponding to the bitstring representation of integer i. The final value of bucket[i] is one in which the trans-action count has been converted to an itemset count by successive aggregation. In other words, the 0s in the bitstring are replaced by “don’t cares.”

for i := 1 to k do begin

for j := 1 to 2^k do begin
if the ith bit of bitstring representation
of j is 0 then bucket[j] = bucket[j] + bucket[j + 2ⁱ⁻¹];
endfor

endfor

An example of bucketing for |F (P )| = 4 is illustrated in Fig. 4.6. The bucketing trick is performed commonly at lower nodes of the tree because the value of |F (P )| falls drastically at the lower levels. Because the nodes at the lower levels dominate the total number of nodes in the enumeration-tree structure, the impact of bucketing can be very significant.

Optimizations for maximal pattern mining: The DepthProject method, which is a depth-first variant of the approach, is particularly adaptable for maximal pattern discovery. In this case, the enumeration tree is explored in depth-first order to maximize the advantages of pruning the search space of regions containing only non-maximal patterns. The order of construction of the enumeration tree is important in the particular case of maximal frequent

110 CHAPTER 4. ASSOCIATION PATTERN MINING

Figure 4.6: Performing the second phase of bucketing

pattern mining because certain kinds of non-maximal search-space pruning are optimized with the depth-first order. The notion of lookaheads is one such optimization.
Let C(P ) be the set of candidate item extensions of node P . Before support counting, it is tested whether P ∪ C(P ) is a subset of a frequent pattern that has already been found. If such is indeed the case, then the pattern P ∪ C(P ) is a non-maximal frequent pattern, and the entire subtree (of the enumeration tree) rooted at P can be pruned. This kind of pruning is referred to as superset-based pruning. When P cannot be pruned, the supports of its candidate extensions need to be determined. During this support counting, the support of P ∪ C(P ) is counted along with the individual item extensions of P . If P ∪ C(P ) is found to be frequent, then it eliminates any further work of counting the support of (non-maximal) nodes in the subtree rooted at node P .

While lookaheads can also be used with breadth- first algorithms, they are more eﬀective with a depth-first strategy. In depth-first methods, longer patterns tend to be found first, and are, therefore, already available in the frequent set for superset-based pruning. For example, consider a frequent pattern of length 20 with 220 subsets. In a depth-first strategy, it can be shown that the pattern of length 20 will be discovered after exploring only 19 of its immediate prefixes. On the other hand, a breadth-first method may remain trapped by discovery of shorter patterns. Therefore, the longer patterns become available very early in depth-first methods such as DepthProject to prune large portions of the enumeration tree with superset-based pruning.

4.4.3.3 Vertical Counting Methods

The

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 68 69 70 71 72 73 74 75 ... 423