Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə77/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   73   74   75   76   77   78   79   80   ...   423
1-Data Mining tarjima

4.4. FREQUENT ITEMSET MINING ALGORITHMS

117

Figure 4.11: Illustration of recursive pattern growth with pointers and FP-Tree


data structure for the same database (as the previous example of Fig. 4.9) is shown in Fig. 4.11. In the example, the number “2” associated with the leftmost item c in the FP-Tree, represents the count of prefix path abc, as illustrated in Fig. 4.11.

The initial FP-Tree F PT can be constructed as follows. First the infrequent items in the database are removed. The resulting transactions are then successively inserted into the trie. The counts on the overlapping nodes are incremented by 1 when the prefix of the inserted transaction overlaps with an existing path in the trie. For the non-overlapping portion of the transaction, a new path needs to be created containing this portion. The newly created nodes are assigned a count of 1. This process of insertion is identical to that of trie creation, except that counts are also associated with nodes. The resulting tree is a compressed representation because common items in the prefixes of multiple transactions are represented by a single node.


The pointers can be constructed in an analogous way to the simpler array data structure of the previous section. The pointer for each item points to the next occurrence of the same item in the trie. Because a trie stores the transactions in dictionary order, it is easy to create pointers threading each of the items. However, the number of pointers is smaller, because many nodes have been consolidated. As an illustrative example, one can examine the relationship between the array-based data structure of Fig. 4.9, and the FP-Tree in Fig. 4.11. The difference is that the prefixes of the arrays in Fig. 4.9 are consolidated and compressed into a trie in Fig. 4.11.


The conditional FP-Tree F PT i (representing the conditional database Ti) needs to be extracted and reorganized for each item i ∈ F PT . This extraction is required to initiate recursive calls with conditional FP-Trees. As in the case of the simple pointer-based struc-ture of the previous section, it is possible to use the pointers of an item to extract the subset of the projected database containing that item. The following steps need to be performed for extraction of the conditional FP-Tree of item i:


118 CHAPTER 4. ASSOCIATION PATTERN MINING





  1. The pointers for item i are chased to extract the tree of conditional prefix paths for the item. These are the paths from the item to the root. The remaining branches are pruned.




  1. The counts of the nodes in the tree of prefix-paths are adjusted to account for the pruned branches. The counts can be adjusted by aggregating the counts on the leaves upwards.




  1. The frequency of each item is counted by aggregating the counts over all occurrences of that item in the tree of prefix paths. The items that do not meet the minimum support requirement are removed from the prefix paths. Furthermore, the last item i is also removed from each prefix path. The resulting conditional FP-Tree might have a completely different organization than the extracted tree of prefix-paths because of the removal of infrequent items. Therefore, the conditional FP-Tree may need to be recreated by reinserting the conditional prefix paths obtained after removing infre-quent items. The pointers for the conditional FP-Tree need to be reconstructed as well.

Consider the example in Fig. 4.11 which is the same data set as in Fig. 4.9. As in Fig. 4.9, it is possible to follow the pointers for item c in Fig. 4.11 to extract a tree of conditional prefix paths (shown in Fig. 4.11). The counts on many nodes in the tree of conditional prefix paths need to be reduced because many branches from the original FP-Tree (that do not contain the item c) are not included. These reduced counts can be determined by aggregating the counts on the leaves upwards. After removing the item c and infrequent items, two frequency-annotated conditional prefix paths ab(2) and a(2) are obtained, which are identical to the two projected and consolidated transactions of Fig. 4.9. The conditional FP-tree is then constructed for item c by reinserting these two conditional prefix paths into a new conditional FP-Tree. Again, this conditional FP- Tree is a trie representation of the conditional pointer base of Fig. 4.9. In this case, there are no infrequent items because a minimum support of 1 is used. If a minimum support of 3 had been used, then the item b would have to be removed. The resulting conditional FP-Tree is used in the next level recursive call. After extracting the conditional FP-Tree F PT i, it is checked whether it is empty. An empty conditional FP-Tree could occur when there are no frequent items in the extracted tree of conditional prefix paths. If the tree is not empty, then the next level recursive call is initiated with suffix Pi = {i} ∪ P , and the conditional FP-Tree F PT i.


The use of the FP-Tree allows an additional optimization in the form of a boundary condition for quickly extracting frequent patterns at deeper levels of the recursion. In par-ticular, it is checked whether all the nodes of the FP-Tree lie on a single path. In such a case, the frequent patterns can be directly extracted from this path by extracting all com-binations of nodes on this path together with the aggregated support counts. For example, in the case of Fig. 4.11, all nodes on the conditional FP-Tree lie on a single path. Therefore, in the next recursive call, the bottom of the recursion will be reached. The pseudocode for FP-growth is illustrated in Fig. 4.12. This pseudocode is similar to the pointer-based pseudocode of Fig. 4.10, except that a compressed FP-Tree is used.


4.4.4.4 Trade-offs with Different Data Structures


The main advantage of an FP-Tree over pointer-based implementation is one of space com-pression. The FP-Tree requires less space than pointer-based implementation because of trie-based compression, although it might require more space than an array-based imple-mentation because of the pointer overhead. The precise space requirements depend on the




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   73   74   75   76   77   78   79   80   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin