Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	66/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 62 63 64 65 66 67 68 69 ... 423

1-Data Mining tarjima

conf (X₂ ⇒ I − X₂) ≥ conf (X₁ ⇒ I − X₁)

(4.3)

4.4. FREQUENT ITEMSET MINING ALGORITHMS

This property follows directly from definition of confidence and the property of support monotonicity. Consider the rules {Bread} ⇒ {Butter, M ilk} and {Bread, Butter} ⇒ {M ilk}. The second rule is redundant with respect to the first because it will have the same support, but a confidence that is no less than the first. Because of confidence mono-tonicity, it is possible to report only the non-redundant rules. This issue is discussed in detail in the next chapter.

4.4 Frequent Itemset Mining Algorithms

In this section, a number of popular algorithms for frequent itemset generation will be discussed. Because there are a large number of frequent itemset mining algorithms, the focus of the chapter will be to discuss specific algorithms in detail to introduce the reader to the key tricks in algorithmic design. These tricks are often reusable across diﬀerent algorithms because the same enumeration tree framework is used by virtually all frequent pattern mining algorithms.

4.4.1 Brute Force Algorithms

For a universe of items U , there are a total of 2|U| − 1 distinct subsets, excluding the empty set. All 25 subsets for a universe of 5 items are illustrated in Fig. 4.1. Therefore, one possibility would be to generate all these candidate itemsets, and count their support against the transaction database T . In the frequent itemset mining literature, the term candidate itemsets is commonly used to refer to itemsets that might possibly be frequent (or candidates for being frequent). These candidates need to be verified against the transaction database by support counting. To count the support of an itemset, we would need to check whether a given itemset I is a subset of each transaction T_i ∈ T . Such an exhaustive approach is likely to be impractical, when the universe of items U is large. Consider the case where d = |U | = 1000. In that case, there are a total of 21000 > 10300 candidates. To put this number in perspective, if the fastest computer available today were somehow able to process one candidate in one elementary machine cycle, then the time required to process all candidates would be hundreds of orders of magnitude greater than the age of the universe. Therefore, this is not a practical solution.

Of course, one can make the brute -force approach faster by observing that no (k + 1)-patterns are frequent if no k-patterns are frequent. This observation follows directly from the downward closure property. Therefore, one can enumerate and count the support of all the patterns with increasing length. In other words, one can enumerate and count the support of all patterns containing one item, two items, and so on, until for a certain length l, none of the candidates of length l turn out to be frequent. For sparse transaction databases, the value of l is typically very small compared to |U |. At this point, one can terminate. This is a significant improvement over the previous approach because it requires the enumeration

of	l	\|U\|	2\|U\| candidates. Because the longest frequent itemset is of much smaller
	i=1

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 62 63 64 65 66 67 68 69 ... 423