Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə73/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   69   70   71   72   73   74   75   76   ...   423
1-Data Mining tarjima

Partition [446] and Monet [273] methods pioneered the concept of vertical database representations of the transaction database T . In the vertical representation, each item is associated with a list of its transaction identifiers (tids). It can also be thought of as using the transpose of the binary transaction data matrix representing the transactions so that columns are transformed to rows. These rows are used as the new “records.” Each item, thus, has a tid list of identifiers of transactions containing it. For example, the vertical representation of the database of Table 4.1 is illustrated in Table 4.2. Note that the binary matrix in Table 4.2 is the transpose of that in Table 4.1.

The intersection of two item tid lists yields a new tid list whose length is equal to the support of that 2-itemset. Further intersection of the resulting tid list with that of another item yields the support of 3-itemsets. For example, the intersection of the tid lists of Milk and Yogurt yields {2, 4, 5} with length 3. Further intersection of the tid list of {M ilk, Y ogurt} with that of Eggs yields the tid list {2, 4} of length 2. This means that the support of



4.4. FREQUENT ITEMSET MINING ALGORITHMS

111

Table 4.2: Vertical representation of market basket data set











































Item

Set of tids

Binary representation














































Bread

{1, 3}

10100










Butter

{1}

10000










Cheese

{3, 5}

00101










Eggs

{2, 3, 4}

01110










M ilk

{1, 2, 3, 4, 5}

11111










Y ogurt

{2, 4, 5}

01011

























{M ilk, Y ogurt} is 3/5 = 0.6 and that of {M ilk, Eggs, Y ogurt} is 2/5 = 0.4. Note that one can also intersect the smaller tid lists of {M ilk, Y ogurt} and {M ilk, Eggs } to achieve the same result. For a pair of k-itemsets that join to create a (k + 1)-itemset, it is possible to intersect the tid lists of the k-itemset pair to obtain the tid-list of the resulting (k + 1)-itemset. Intersecting tid lists of k-itemsets is preferable to intersecting tid lists of 1-itemsets because the tid lists of k -itemsets are typically smaller than those of 1-itemsets, which makes intersection faster. Such an approach is referred to as recursive tid list intersection. This insightful notion of recursive tid list intersection was introduced2 by the Monet [273] and Partition [446] algorithms. The Partition framework [446] proposed a vertical version of the Apriori algorithm with tid list intersection. The pseudocode of this vertical version of the Apriori algorithm is illustrated in Fig. 4.7. The only difference from the horizontal Apriori algorithm is the use of recursive tid list intersections for counting. While the vertical Apriori algorithm is computationally more efficient than horizontal Apriori, it is memory-intensive because of the need to store tid lists with each itemset. Memory requirements can be reduced with the use of a partitioned ensemble in which the database is divided into smaller chunks which are independently processed. This approach reduces the memory requirements at the expense of running-time overheads in terms of postprocessing, and it is discussed in Sect. 4.6.2. For smaller databases, no partitioning needs to be applied. In such cases, the vertical Apriori algorithm of Fig. 4.7 is also referred to as Partition-1, and it is the progenitor of all modern vertical pattern mining algorithms.

The vertical database representation can, in fact, be used in almost any enumeration-tree algorithm with a growth strategy that is different from the breadth-first method. As in the case of the vertical Apriori algorithm, the tid lists can be stored with the itemsets (nodes) during the growth of the tree. If the tid list of any node P is known, it can be intersected with the tid list of a sibling node to determine the support count (and tid list) of the corresponding extension of P . This provides an efficient way of performing the counting. By varying the strategy of growing the tree, the memory overhead of storing the tid lists can be reduced but not the number of operations. For example, while both breadth-first and depth-first strategies will require exactly the same tid list intersections for a particular pair of nodes, the depth-first strategy will have a smaller memory footprint because the tid lists need to be stored only at the nodes on the tree-path being explored and their immediate siblings. Reducing the memory footprint is, nevertheless, important because it increases the size of the database that can be processed entirely in core.


Subsequently, many algorithms, such as Eclat and VIPER, adopted


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   69   70   71   72   73   74   75   76   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin