length than |U | in sparse transaction databases, this approach is orders of magnitude faster.
However, the resulting computational complexity is still not satisfactory for large values of
U . For example, when |U | = 1000 and l = 10, the value of
|
10
|
|
is of the order of 10
|
23
|
.
|
|
i=1 i
|
|
|
|
|U|
|
|
|
|
This value is still quite large and outside reasonable computational capabilities available today.
One observation is that even a very minor and rather blunt application of the downward closure property made the algorithm hundreds of orders of magnitude faster. Many of the fast algorithms for itemset generation use the downward closure property in a more refined way, both to generate the candidates and to prune them before counting. Algorithms for
100 CHAPTER 4. ASSOCIATION PATTERN MINING
frequent pattern mining search the lattice of possibilities (or candidates) for frequent pat-terns (see Fig. 4.1) and use the transaction database to count the support of candidates in this lattice. Better efficiencies can be achieved in a frequent pattern mining algorithm by using one or more of the following approaches:
Reducing the size of the explored search space (lattice of Fig. 4.1) by pruning candidate itemsets (lattice nodes) using tricks, such as the downward closure property.
Counting the support of each candidate more efficiently by pruning transactions that are known to be irrelevant for counting a candidate itemset.
Using compact data structures to represent either candidates or transaction databases that support efficient counting.
The first algorithm that used an effective pruning of the search space with the use of the downward closure property was the Apriori algorithm.
4.4.2 The Apriori Algorithm
The Apriori algorithm uses the downward closure property in order to prune the candidate search space. The downward closure property imposes a clear structure on the set of frequent patterns. In particular, information about the infrequency of itemsets can be leveraged to generate the superset candidates more carefully. Thus, if an itemset is infrequent, there is little point in counting the support of its superset candidates. This is useful for avoiding wasteful counting of support levels of itemsets that are known not to be frequent. The Apriori algorithm generates candidates with smaller length k first and counts their supports before generating candidates of length (k + 1). The resulting frequent k -itemsets are used to restrict the number of (k + 1)-candidates with the downward closure property. Candidate generation and support counting of patterns with increasing length is interleaved in Apriori. Because the counting of candidate supports is the most expensive part of the frequent pattern generation process, it is extremely important to keep the number of candidates low.
For ease in description of the algorithm, it will be assumed that the items in U have a lexicographic ordering, and therefore an itemset {a, b, c, d} can be treated as a (lexicograph-ically ordered) string abcd of items. This can be used to impose an ordering among itemsets (patterns), which is the same as the order in which the corresponding strings would appear in a dictionary.
The Apriori algorithm starts by counting the supports of the individual items to generate the frequent 1-itemsets. The 1-itemsets are combined to create candidate 2-itemsets, whose support is counted. The frequent 2 -itemsets are retained. In general, the frequent itemsets of length k are used to generate the candidates of length (k + 1) for increasing values of k. Algorithms that count the support of candidates with increasing length are referred to as level-wise algorithms. Let Fk denote the set of frequent k-itemsets, and Ck denote the set of candidate k-itemsets. The core of the approach is to iteratively generate the (k + 1)-candidates Ck+1 from frequent k-itemsets in Fk already found by the algorithm. The frequencies of these (k + 1)-candidates are counted with respect to the transaction database. While generating the (k + 1)-candidates, the search space may be pruned by checking whether all k-subsets of Ck+1 are included in Fk. So, how does one generate the relevant (k + 1)-candidates in Ck+1 from frequent k-patterns in Fk?
If a pair of itemsets X and Y in Fk have (k − 1) items in common, then a join between them using the (k − 1) common items will create a candidate itemset of size (k + 1). For example, the two 3-itemsets {a, b, c} (or abc for short) and {a, b, d} (or abd for short), when
Dostları ilə paylaş: |