Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	81/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 77 78 79 80 81 82 83 84 ... 423

1-Data Mining tarjima

χ²(X) =	₂\|X\|	(O_i − E_i)²	.	(4.6)


E_i

	i=1

124

CHAPTER 4. ASSOCIATION PATTERN MINING

For example, when

{

Bread, Butter

}

, one would need to perform the summation

in Eq. 4.6 over the 2

= 4 states corresponding to {Bread, Butter}, {Bread, ¬Butter},

{¬Bread, Butter}, and {¬Bread, ¬Butter}. A value that is close to 0 indicates statistical independence among the items. Larger values of this quantity indicate greater dependence between the variables. However, large χ2 values do not reveal whether the dependence between items is positive or negative. This is because the χ2 test measures dependence between variables, rather than the nature of the correlation between the specific states of these variables.

The χ2 measure is bit-symmetric because it treats the presence and absence of items in a similar way. The χ²-test satisfies the upward closure property because of which an eﬃcient algorithm can be devised for discovering interesting k-patterns. On the other hand, the computational complexity of the measure in Eq. 4.6 increases exponentially with |X|.

4.5.3 Interest Ratio

The interest ratio is a simple and intuitively interpretable measure. The interest ratio of a set of items {i₁ . . . i_k} is denoted as I({i₁, . . . i_k}), and is defined as follows:

I(	i	1	. . . i	k^}	) =	sup({i₁ . . . i_k})	.	(4.7)

	{					k
						_j₌₁^sup(ⁱj )

When the items are statistically independent, the joint support in the numerator will be equal to the product of the supports in the denominator. Therefore, an interest ratio of 1 is the break-even point. A value greater than 1 indicates that the variables are positively correlated, whereas a ratio of less than 1 is indicative of negative correlation.

When some items are extremely rare, the interest ratio can be misleading. For example, if an item occurs in only a single transaction in a large transaction database, each item that co-occurs with it in that transaction can be paired with it to create a 2-itemset with a very high interest ratio. This is statistically misleading. Furthermore, because the interest ratio does not satisfy the downward closure property, it is diﬃcult to design eﬃcient algorithms for computing it.

4.5.4 Symmetric Confidence Measures

The traditional confidence measure is asymmetric between the antecedent and consequent. However, the support measure is symmetric. Symmetric confidence measures can be used to replace the support -confidence framework with a single measure. Let X and Y be two 1-itemsets. Symmetric confidence measures can be derived as a function of the confidence of X ⇒ Y and the confidence of Y ⇒ X. The various symmetric confidence measures can be any one of the minimum, average, or maximum of these two confidence values. The min-imum is not desirable when either X or Y is very infrequent, causing the combined measure to be too low. The maximum is not desirable when either X or Y is very frequent, causing the combined measure to be too high. The average provides the most robust trade-oﬀ in many scenarios. The measures can be generalized to k-itemsets by using all k possible indi-vidual items in the consequent for computation. Interestingly, the geometric mean of the two confidences evaluates to the cosine measure, which is discussed below. The computa-tional problem with symmetric confidence measures is that the relevant itemsets satisfying a specific threshold on the measure do not satisfy the downward closure property.

4.5. ALTERNATIVE MODELS: INTERESTING PATTERNS

125

4.5.5 Cosine Coeﬃcient on Columns

The cosine coeﬃcient is usually applied to the rows to determine the similarity among trans-actions. However, it can also be applied to the columns, to determine the similarity between items. The cosine coeﬃcient is best computed using the vertical tid list representation on the corresponding binary vectors. The cosine value on the binary vectors computes to the following:

cosine(i, j) = _sup₍_i₎_· _sup₍_j₎. (4.8)

The numerator can be evaluated as the length of the intersection of the tid lists of items i and j. The cosine measure can be viewed as the geometric mean of the confidences of the rules {i} ⇒ {j} and {j} ⇒ {i}. Therefore, the cosine is a kind of symmetric confidence measure.

4.5.6 Jaccard Coeﬃcient and the Min-hash Trick

The Jaccard coeﬃcient was introduced in Chap. 3 to measure similarity between sets. The tid lists on a column can be viewed as a set, and the Jaccard coeﬃcient between two tid lists can be used to compute the similarity. Let S₁ and S₂ be two sets. As discussed in Chap. 3, the Jaccard coeﬃcient J(S₁, S₂) between the two sets can be computed as follows:

J(S₁, S₂) =	\|S₁	∩ S₂	\|	.	(4.9)

	\|S₁	∪ S₂	\|

The Jaccard coeﬃcient can easily be generalized to multiway sets, as follows:

J(S₁ . . . S_k) =	\| ∩ S_i\|	.	(4.10)

	\| ∪ S_i\|

When the sets S₁ . . . S_k correspond to the tid lists of k items, the intersection and union of the tid lists can be used to determine the numerator and denominator of the aforementioned expression. This provides the Jaccard-based significance for that k-itemset. It is possible to use a minimum threshold on the Jaccard coeﬃcient to determine all the relevant itemsets.

nice property of Jaccard-based significance is that it satisfies the set-wise monotonicity property. The k-way Jaccard coeﬃcient J(S₁ . . . S_k) is always no smaller than the (k+1)-way Jaccard coeﬃcient J(S₁ . . . S_k₊₁). This is because the numerator of the Jaccard coeﬃcient is monotonically non-increasing with increasing values of k (similar to support), whereas the denominator is monotonically non-decreasing. Therefore, the Jaccard coeﬃcient cannot increase with increasing values of k. Therefore, when a minimum threshold is used on the Jaccard-based significance of an itemset, the resulting itemsets satisfy the downward closure property, as well. This means that most of the traditional algorithms, such as Apriori and enumeration tree methods, can be generalized to the Jaccard coeﬃcient quite easily.

It is possible to use sampling to speed up the computation of the Jaccard coeﬃcient further, and transform it to a standard frequent pattern mining problem. This kind of sampling uses hash functions to simulate sorted samples of the data. So, how can the Jaccard coeﬃcient be computed using sorted sampling? Let D be the n × d binary data matrix representing the n rows and d columns. Without loss of generality, consider the case when the Jaccard coeﬃcient needs to be computed on the first k columns. Suppose one were to sort the rows in D, and pick the first row in which at least one of the first k columns in this row has a value of 1 in this column. Then, it is easy to see that the probability of

126 CHAPTER 4. ASSOCIATION PATTERN MINING

the event that all the k columns have a value of 1 is equal to the k-way Jaccard coeﬃcient. If one were to sort the rows multiple times, it is possible to estimate this probability as the fraction of sorts over which the event of all k columns taking on unit values occurs. Of course, it is rather ineﬃcient to do it in this way because every sort requires a pass over the database. Furthermore, this approach can only estimate the Jaccard coeﬃcient for a

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 77 78 79 80 81 82 83 84 ... 423