Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə83/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   79   80   81   82   83   84   85   86   ...   423
1-Data Mining tarjima

E[v(I)]

v(I)































The collective strength is a number between 0 to . A value of 0 indicates a perfect negative correlation, whereas a value of indicates a perfectly positive correlation. The value of 1 is the break-even point. The expected value of v(I) is calculated assuming statistical independence of the individual items. No violation occurs when all items in I are included



4.6. USEFUL META-ALGORITHMS

127

in transaction, or when no items in I are included in a transaction. Therefore, if pi is the fraction of transactions in which the item i occurs, we have:





E[v(I)] = 1

pi (1 − pi).

(4.12)

i∈I

i∈I




Intuitively, if the violation of an itemset in a transaction is a “bad event” from the perspec-tive of trying to establish a high correlation among items, then v(I) is the fraction of bad events, and (1 − v(I)) is the fraction of “good events.” Therefore, collective strength may be understood as follows:





C(I) =




Good Events

·

E[Bad Events]

(4.13)













.




E[Good Events]

Bad Events




The concept of collective-strength may be strengthened to strongly collective itemsets.


Definition 4.5.1 An itemset I is denoted to be strongly collective at level s, if it satisfies the following properties:





  1. The collective strength C(I) of the itemset I is at least s.




  1. Closure property: The collective strength C(J) of every subset J of I is at least s.

It is necessary to force the closure property to ensure that unrelated items may not be present in an itemset. Consider, for example, the case when itemset I1 is {M ilk, Bread} and itemset I 2 is {Diaper, Beer}. If I1 and I2 each have a high collective strength, then it may often be the case that the itemset I1 ∪ I2 may also have a high collective strength, even though items such as milk and beer may be independent. Because of the closure property of this definition, it is possible to design an Apriori-like algorithm for the problem.


4.5.8 Relationship to Negative Pattern Mining


In many applications, it is desirable to determine patterns between items or their absence. Negative pattern mining requires the use of bit-symmetric measures that treat the presence or absence of an item evenly. The traditional support-confidence measure is not designed for finding such patterns. Measures such as the statistical coefficient of correlation, χ2 measure, and collective strength are better suited for finding such positive or negative correlations between items. However, many of these measures are hard to use in practice because they do not satisfy the downward closure property. The multiway Jaccard coefficient and collective strength are among the few measures that do satisfy the downward closure property.


4.6 Useful Meta-algorithms


A number of meta -algorithms can be used to obtain different insights from pattern mining. A meta-algorithm is defined as an algorithm that uses a particular algorithm as a subroutine, either to make the original algorithm more efficient (e.g., by sampling), or to gain new insights. Two types of meta-algorithms are most common in pattern mining. The first type uses sampling to improve the efficiency of association pattern mining algorithms. The second uses preprocessing and postprocessing subroutines to apply the algorithm to other scenarios. For example, after using these wrappers, standard frequent pattern mining algorithms can be applied to quantitative or categorical data.


128 CHAPTER 4. ASSOCIATION PATTERN MINING


4.6.1 Sampling Methods


When the transaction database is very large, it cannot be stored in main memory. This makes the application of frequent pattern mining algorithms more challenging. This is because such databases are typically stored on disk, and only level-wise algorithms may be used. Many depth-first algorithms on the enumeration tree may be challenged by these scenarios because they require random access to the transactions. This is inefficient for disk-resident data. As discussed earlier, such depth-first algorithms are usually the most efficient for memory-resident data. By sampling, it is possible to apply many of these algorithms in an efficient way, with only limited loss in accuracy. When a standard itemset mining algorithm is applied to sampled data, it will encounter two main challenges:






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   79   80   81   82   83   84   85   86   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin