Figure 4.9: Illustration of recursive pattern growth with pointers and no FP-Tree
better efficiency but is more memory-intensive. One simple solution to this dilemma is to set up a data structure in the form of pointers in the first pass, which implicitly stores the decomposition of T into different item-specific data sets at a lower memory cost. This data structure is set up at the time that infrequent items are removed from the transaction database T , and then utilized for extracting different conditional transaction sets Ti from T . For each item i in T , a pointer threads through the transactions containing that item in lexicographically sorted (dictionary) order. In other words, after arranging the database T in lexicographically sorted order, each item i in each transaction has a pointer to the same item i in the next transaction that contains it. Because a pointer is required at each item in each transaction, the storage overhead in this case is proportional to that of the original transaction database T . An additional optimization is to consolidate repeated transactions and store them with their counts. An example of a sample database with nine transactions on the five items {a, b, c, d, e} is illustrated in Fig. 4.9. It is clear from the figure that there are five sets of pointers, one for each item in the database.
After the pointers have been set up, Ti is extracted by just “chasing” the pointer thread for item i. The time for doing this is proportional to the number of transactions in Ti. The infrequent items in Ti are removed, and the pointers for the conditional transaction data need to be reconstructed to create a conditional pointer base which is basically the conditional transaction set augmented with pointers. The modified pseudocode with the use of pointers is illustrated in Fig. 4.10. Note that the only difference between the pseudocode of Figs. 4.8 and 4.10 is the setting up of pointers after extraction of conditional transaction sets and the use of these pointers to efficiently extract the conditional transaction data sets Ti. A recursive call is initiated at the next level with the extended suffix Pi = {i} ∪ P , and conditional database Ti.
To illustrate how Ti can be extracted, an example of a transaction database with 5 items and 9 transactions is illustrated in Fig. 4.9. For simplicity, we use a (raw) minimum support value of 1. The transactions corresponding to the item c are extracted, and the irrelevant suffix including and after item c are removed for further recursive calls. Note that this leads to shorter transactions, some of which are repeated. As a result, the conditional database
116 CHAPTER 4. ASSOCIATION PATTERN MINING
Algorithm RecursiveGrowthPointers(Transactions in terms of frequent 1-items: T ,
Minimum Support: minsup, Current Suffix: P )
begin
for each item i in T do begin
report itemset Pi = {i} ∪ P as frequent;
Use pointers to extract all transactions Ti
from T containing item i;
Remove all items from Ti that are lexicographically ≥ i;
Remove all infrequent items from Ti;
Set up pointers for Ti;
if (Ti = φ) then RecursiveGrowthPointers(Ti, minsup, Pi);
end
end
Figure 4.10: Generic recursive suffix growth with pointers
for Ti contains only two distinct transactions after consolidation. The infrequent items from this conditional database need to be removed. No items are removed at a minimum support of 1. Note that if the minimum support had been 3, then the item b would have been removed. The pointers for the new conditional transaction set do need to be set up again because they will be different for the conditional transaction database than in the original transactions. Unlike the pseudocode of Fig. 4.8, an additional step of setting up pointers is included in the pseudocode of Fig. 4.10.
The pointers provide an efficient way to extract the conditional transaction database. Of course, the price for this is that the pointers are a space overhead, with size exactly proportional to the original transaction database T . Consolidating repeated transactions does save some space. The FP-Tree, which will be discussed in the next section, takes this approach one step further by consolidating not only repeated transactions, but also repeated prefixes of transactions with the use of a trie data structure. This representation reduces the space-overhead by consolidating prefixes of the transaction database.
4.4.4.3 Implementation with Pointers and FP-Tree
The FP-Tree is designed with the primary goal of space efficiency of the projected database. The FP-Tree is a trie data structure representation of the conditional transaction database by consolidating the prefixes. This trie replaces the array-based implementation of the previous sections, but it retains the pointers. The path from the root to the leaf in the trie represents a (possibly repeated) transaction in the database. The path from the root to an internal node may represent either a transaction or the prefix of a transaction in the database. Each internal node is associated with a count representing the number of transactions in the original database that contain the prefix corresponding to the path from the root to that node. The count on a leaf represents the number of repeated instances of the transaction defined by the path from the root to that leaf. Thus, the FP-Tree maintains all counts of all the repeated transactions as well as their prefixes in the database. As in a standard trie data-structure, the prefixes are sorted in dictionary order. The lexicographic ordering of items is from the most frequent to the least frequent to maximize the advantages of prefix-based compression. This ordering also provides excellent selectivity in reducing the size of various conditional transaction sets in a balanced way. An example of the FP-Tree
Dostları ilə paylaş: |