Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	83/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 79 80 81 82 83 84 85 86 ... 423

1-Data Mining tarjima

E[v(I)]	v(I)

The collective strength is a number between 0 to ∞. A value of 0 indicates a perfect negative correlation, whereas a value of ∞ indicates a perfectly positive correlation. The value of 1 is the break-even point. The expected value of v(I) is calculated assuming statistical independence of the individual items. No violation occurs when all items in I are included

4.6. USEFUL META-ALGORITHMS

127

in transaction, or when no items in I are included in a transaction. Therefore, if p_i is the fraction of transactions in which the item i occurs, we have:

E[v(I)] = 1 −	p_i − (1 − p_i).	(4.12)
i∈I	i∈I

Intuitively, if the violation of an itemset in a transaction is a “bad event” from the perspec-tive of trying to establish a high correlation among items, then v(I) is the fraction of bad events, and (1 − v(I)) is the fraction of “good events.” Therefore, collective strength may be understood as follows:

C(I) =	Good Events	·	E[Bad Events]		(4.13)
				.
	E[Good Events]		Bad Events

The concept of collective-strength may be strengthened to strongly collective itemsets.

Definition 4.5.1 An itemset I is denoted to be strongly collective at level s, if it satisfies the following properties:

The collective strength C(I) of the itemset I is at least s.

Closure property: The collective strength C(J) of every subset J of I is at least s.

It is necessary to force the closure property to ensure that unrelated items may not be present in an itemset. Consider, for example, the case when itemset I₁ is {M ilk, Bread} and itemset I ₂ is {Diaper, Beer}. If I₁ and I₂ each have a high collective strength, then it may often be the case that the itemset I₁ ∪ I₂ may also have a high collective strength, even though items such as milk and beer may be independent. Because of the closure property of this definition, it is possible to design an Apriori-like algorithm for the problem.

4.5.8 Relationship to Negative Pattern Mining

In many applications, it is desirable to determine patterns between items or their absence. Negative pattern mining requires the use of bit-symmetric measures that treat the presence or absence of an item evenly. The traditional support-confidence measure is not designed for finding such patterns. Measures such as the statistical coeﬃcient of correlation, χ2 measure, and collective strength are better suited for finding such positive or negative correlations between items. However, many of these measures are hard to use in practice because they do not satisfy the downward closure property. The multiway Jaccard coeﬃcient and collective strength are among the few measures that do satisfy the downward closure property.

4.6 Useful Meta-algorithms

A number of meta -algorithms can be used to obtain diﬀerent insights from pattern mining. A meta-algorithm is defined as an algorithm that uses a particular algorithm as a subroutine, either to make the original algorithm more eﬃcient (e.g., by sampling), or to gain new insights. Two types of meta-algorithms are most common in pattern mining. The first type uses sampling to improve the eﬃciency of association pattern mining algorithms. The second uses preprocessing and postprocessing subroutines to apply the algorithm to other scenarios. For example, after using these wrappers, standard frequent pattern mining algorithms can be applied to quantitative or categorical data.

128 CHAPTER 4. ASSOCIATION PATTERN MINING

4.6.1 Sampling Methods

When the transaction database is very large, it cannot be stored in main memory. This makes the application of frequent pattern mining algorithms more challenging. This is because such databases are typically stored on disk, and only level-wise algorithms may be used. Many depth-first algorithms on the enumeration tree may be challenged by these scenarios because they require random access to the transactions. This is ineﬃcient for disk-resident data. As discussed earlier, such depth-first algorithms are usually the most eﬃcient for memory-resident data. By sampling, it is possible to apply many of these algorithms in an eﬃcient way, with only limited loss in accuracy. When a standard itemset mining algorithm is applied to sampled data, it will encounter two main challenges:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 79 80 81 82 83 84 85 86 ... 423