Data Mining: The Textbook


Partition-1, but small enough to be processed in core by



Yüklə 17,13 Mb.
səhifə75/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   71   72   73   74   75   76   77   78   ...   423
1-Data Mining tarjima

Partition-1, but small enough to be processed in core by Eclat. In such cases, Eclat is faster than Partition. Note that the number of computational operations for support counting in Partition-1 is fundamentally no different from that of Eclat because the tid list intersections between any pair of itemsets remain the same. Furthermore, Eclat implicitly assumes an upper bound on the database size. This is because it assumes that multiple tid lists, each of size at least a fraction minsup of the number of database records, fit in main memory. The cumulative memory overhead of the multiple tid lists always scales proportionally with database size, whereas the memory overhead of the ensemble-based Partition algorithm is independent of database size.
4.4.4 Recursive Suffix-Based Pattern Growth Methods

Enumeration trees are constructed by extending prefixes of itemsets that are expressed in a lexicographic order. It is also possible to express some classes of itemset exploration meth-ods recursively with suffix- based exploration. Although recursive pattern-growth is often understood as a completely different class of methods, it can be viewed as a special case of the generic enumeration-tree algorithm presented in the previous section. This relationship between recursive pattern-growth methods and enumeration-tree methods will be explored in greater detail in Sect. 4.4.4.5.



4.4. FREQUENT ITEMSET MINING ALGORITHMS

113

Recursive suffix-based pattern growth methods are generally understood in the context of the well-known FP-Tree data structure. While the FP-Tree provides a space- and time-efficient way to implement the recursive pattern exploration, these methods can also be implemented with the use of arrays and pointers. This section will present the recursive pattern growth approach in a simple way without introducing any specific data structure. We also present a number of simplified implementations3 with various data structures to facilitate better understanding. The idea is to move from the simple to the complex by providing a top-down data structure-agnostic presentation, rather than a tightly integrated presentation with the commonly used FP-Tree data structure. This approach provides a clear understanding of how the search space of patterns is explored and the relational with conventional enumeration tree algorithms.


Consider the transaction database T which is expressed in terms of only frequent 1-items. It is assumed that a counting pass has already been performed on T to remove the infrequent items and count the supports of the items. Therefore, the input to the recursive procedure described here is slightly different from the other algorithms discussed in this chapter in which this database pass has not been performed. The items in the database are ordered with decreasing support. This lexicographic ordering is used to define the ordering of items within itemsets and transactions. This ordering is also used to define the notion of prefixes and suffixes of itemsets and transactions. The input to the algorithm is the transaction database T (expressed in terms of frequent 1-items), a current frequent itemset suffix P , and the minimum support minsup. The goal of a recursive call to the algorithm is to determine all the frequent patterns that have the suffix P . Therefore, at the top-level recursive call of the algorithm, the suffix P is empty. At deeper-level recursive calls, the suffix P is not empty. The assumption for deeper-level calls is that T contains only those transactions from the original database that include the itemset P . Furthermore, each transaction in T is represented using only those frequent extension items of P that are lexicographically smaller than all items of P . Therefore T is a conditional transaction set, or projected database with respect to suffix P . This suffix-based projection is similar to the prefix-based projection in TreeProjection and DepthProject.


In any given recursive call, the first step is to construct the itemset Pi = {i} ∪ P by concatenating each item i in the transaction database T to the beginning of suffix P , and reporting it as frequent. The itemset Pi is frequent because T is defined in terms of frequent items of the projected database of suffix P . For each item i, it is desired to further extend Pi by using a recursive call with the projected database of the (newly extended) frequent suffix Pi . The projected database for extended suffix Pi is denoted by Ti , and it is created as follows. The first step is to extract all transactions from T that contain the item i. Because it is desired to extend the suffix Pi backwards, all items that are lexicographically greater than or equal to i are removed from the extracted transactions in Ti. In other words, the part of the transaction occurring lexicographically after (and including) i is not relevant for counting frequent patterns ending in Pi. The frequency of each item in Ti is counted, and the infrequent items are removed.


It is easy to see that the transaction set Ti is sufficient to generate all the frequent patterns with Pi as a suffix. The problem of finding all frequent patterns ending in Pi using the transaction set Ti is an identical but smaller problem than the original one on T . Therefore, the original procedure is called recursively with the smaller projected database Ti and extended suffix Pi. This procedure is repeated for each item i in T .








  • Variations of these strategies are actually used in some implementations of these methods. We stress that the simplified versions are not optimized for efficiency but are provided for clarity.

114 CHAPTER 4. ASSOCIATION PATTERN MINING

Algorithm RecursiveSuffixGrowth(Transactions in terms of frequent 1-items: T ,


Minimum Support: minsup, Current Suffix: P )


begin
for each item i in T do begin


report itemset Pi = {i} ∪ P as frequent;
Extract all transactions Ti from T containing item i;
Remove all items from Ti that are lexicographically ≥ i;
Remove all infrequent items from Ti;

if (Ti = φ) then RecursiveSuffixGrowth(Ti, minsup, Pi); end


end

Figure 4.8: Generic recursive suffix growth on transaction database expressed in terms of frequent 1-items

The projected transaction set Ti will become successively smaller at deeper levels of the recursion in terms of the number of items and the number of transactions. As the number of transactions reduces, all items in it will eventually fall below the minimum support, and the resulting projected database (constructed on only the frequent items) will be empty. In such cases, a recursive call with Ti is not initiated; therefore, this branch of the recursion is not explored. For some data structures, such as the FP-Tree, it is possible to impose stronger boundary conditions to terminate the recursion even earlier. This boundary condition will be discussed in a later section.


The overall recursive approach is presented in Fig. 4.8. While the parameter minsup has always been assumed to be a (relative) fractional value in this chapter, it is assumed to be an absolute integer support value in this section and in Fig. 4.8. This deviation from the usual convention ensures consistency of the minimum support value across different recursive calls in which the size of the conditional transaction database reduces.


4.4.4.1 Implementation with Arrays but No Pointers


So, how can the projected database T be decomposed into the conditional transaction sets T1 . . . T d, corresponding to d different 1-item suffixes? The simplest solution is to use arrays. In this solution, the original transaction database T and the conditional transaction sets T1 . . . Td can be represented in arrays. The transaction database T may be scanned within the “for” loop of Fig. 4.8, and the set Ti is created from T . The infrequent items from Ti are removed within the loop. However, it is expensive and wasteful to repeatedly scan the database T inside a “for” loop. One alternative is to extract all projections Ti of T corresponding to the different suffix items simultaneously in a single scan of the database just before the “for” loop is initiated. On the other hand, the simultaneous creation of many such item-specific projected data sets can be memory-intensive. One way of obtaining an excellent trade-off between computational and storage requirements is by using pointers. This approach is discussed in the next section.


4.4.4.2 Implementation with Pointers but No FP-Tree


The array-based solution either needs to repeatedly scan the database T or simultaneously create many smaller item-specific databases in a single pass. Typically, the latter achieves




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   71   72   73   74   75   76   77   78   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin