Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	67/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 63 64 65 66 67 68 69 70 ... 423

1-Data Mining tarjima

i

length than |U | in sparse transaction databases, this approach is orders of magnitude faster.

However, the resulting computational complexity is still not satisfactory for large values of

U . For example, when \|U \| = 1000 and l = 10, the value of	10		is of the order of 10	23	.
	i=1 i
		\|U\|

This value is still quite large and outside reasonable computational capabilities available today.

One observation is that even a very minor and rather blunt application of the downward closure property made the algorithm hundreds of orders of magnitude faster. Many of the fast algorithms for itemset generation use the downward closure property in a more refined way, both to generate the candidates and to prune them before counting. Algorithms for

100 CHAPTER 4. ASSOCIATION PATTERN MINING

frequent pattern mining search the lattice of possibilities (or candidates) for frequent pat-terns (see Fig. 4.1) and use the transaction database to count the support of candidates in this lattice. Better eﬃciencies can be achieved in a frequent pattern mining algorithm by using one or more of the following approaches:

Reducing the size of the explored search space (lattice of Fig. 4.1) by pruning candidate itemsets (lattice nodes) using tricks, such as the downward closure property.

Counting the support of each candidate more eﬃciently by pruning transactions that are known to be irrelevant for counting a candidate itemset.

Using compact data structures to represent either candidates or transaction databases that support eﬃcient counting.

The first algorithm that used an eﬀective pruning of the search space with the use of the downward closure property was the Apriori algorithm.

4.4.2 The Apriori Algorithm

The Apriori algorithm uses the downward closure property in order to prune the candidate search space. The downward closure property imposes a clear structure on the set of frequent patterns. In particular, information about the infrequency of itemsets can be leveraged to generate the superset candidates more carefully. Thus, if an itemset is infrequent, there is little point in counting the support of its superset candidates. This is useful for avoiding wasteful counting of support levels of itemsets that are known not to be frequent. The Apriori algorithm generates candidates with smaller length k first and counts their supports before generating candidates of length (k + 1). The resulting frequent k -itemsets are used to restrict the number of (k + 1)-candidates with the downward closure property. Candidate generation and support counting of patterns with increasing length is interleaved in Apriori. Because the counting of candidate supports is the most expensive part of the frequent pattern generation process, it is extremely important to keep the number of candidates low.

For ease in description of the algorithm, it will be assumed that the items in U have a lexicographic ordering, and therefore an itemset {a, b, c, d} can be treated as a (lexicograph-ically ordered) string abcd of items. This can be used to impose an ordering among itemsets (patterns), which is the same as the order in which the corresponding strings would appear in a dictionary.

The Apriori algorithm starts by counting the supports of the individual items to generate the frequent 1-itemsets. The 1-itemsets are combined to create candidate 2-itemsets, whose support is counted. The frequent 2 -itemsets are retained. In general, the frequent itemsets of length k are used to generate the candidates of length (k + 1) for increasing values of k. Algorithms that count the support of candidates with increasing length are referred to as level-wise algorithms. Let F_k denote the set of frequent k-itemsets, and C_k denote the set of candidate k-itemsets. The core of the approach is to iteratively generate the (k + 1)-candidates C_k₊₁ from frequent k-itemsets in F_k already found by the algorithm. The frequencies of these (k + 1)-candidates are counted with respect to the transaction database. While generating the (k + 1)-candidates, the search space may be pruned by checking whether all k-subsets of C_k₊₁ are included in F_k. So, how does one generate the relevant (k + 1)-candidates in C_k₊₁ from frequent k-patterns in F_k?

If a pair of itemsets X and Y in F_k have (k − 1) items in common, then a join between them using the (k − 1) common items will create a candidate itemset of size (k + 1). For example, the two 3-itemsets {a, b, c} (or abc for short) and {a, b, d} (or abd for short), when

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 63 64 65 66 67 68 69 70 ... 423