χ2(X) =
|
2|X|
|
(Oi − Ei)2
|
.
|
(4.6)
|
|
|
Ei
|
|
|
|
|
|
|
|
i=1
|
|
|
|
124
|
|
|
|
|
CHAPTER 4. ASSOCIATION PATTERN MINING
|
|
For example, when
|
X
|
=
|
{
|
Bread, Butter
|
}
|
, one would need to perform the summation
|
|
|
2
|
|
|
|
|
in Eq. 4.6 over the 2
|
|
= 4 states corresponding to {Bread, Butter}, {Bread, ¬Butter},
|
|
{¬Bread, Butter}, and {¬Bread, ¬Butter}. A value that is close to 0 indicates statistical independence among the items. Larger values of this quantity indicate greater dependence between the variables. However, large χ2 values do not reveal whether the dependence between items is positive or negative. This is because the χ2 test measures dependence between variables, rather than the nature of the correlation between the specific states of these variables.
The χ2 measure is bit-symmetric because it treats the presence and absence of items in a similar way. The χ2-test satisfies the upward closure property because of which an efficient algorithm can be devised for discovering interesting k-patterns. On the other hand, the computational complexity of the measure in Eq. 4.6 increases exponentially with |X|.
4.5.3 Interest Ratio
The interest ratio is a simple and intuitively interpretable measure. The interest ratio of a set of items {i1 . . . ik} is denoted as I({i1, . . . ik}), and is defined as follows:
I(
|
i
|
1
|
. . . i
|
k}
|
) =
|
sup({i1 . . . ik})
|
.
|
(4.7)
|
|
|
|
{
|
|
|
k
|
|
|
|
|
|
|
|
|
j=1 sup(ij )
|
|
|
When the items are statistically independent, the joint support in the numerator will be equal to the product of the supports in the denominator. Therefore, an interest ratio of 1 is the break-even point. A value greater than 1 indicates that the variables are positively correlated, whereas a ratio of less than 1 is indicative of negative correlation.
When some items are extremely rare, the interest ratio can be misleading. For example, if an item occurs in only a single transaction in a large transaction database, each item that co-occurs with it in that transaction can be paired with it to create a 2-itemset with a very high interest ratio. This is statistically misleading. Furthermore, because the interest ratio does not satisfy the downward closure property, it is difficult to design efficient algorithms for computing it.
4.5.4 Symmetric Confidence Measures
The traditional confidence measure is asymmetric between the antecedent and consequent. However, the support measure is symmetric. Symmetric confidence measures can be used to replace the support -confidence framework with a single measure. Let X and Y be two 1-itemsets. Symmetric confidence measures can be derived as a function of the confidence of X ⇒ Y and the confidence of Y ⇒ X. The various symmetric confidence measures can be any one of the minimum, average, or maximum of these two confidence values. The min-imum is not desirable when either X or Y is very infrequent, causing the combined measure to be too low. The maximum is not desirable when either X or Y is very frequent, causing the combined measure to be too high. The average provides the most robust trade-off in many scenarios. The measures can be generalized to k-itemsets by using all k possible indi-vidual items in the consequent for computation. Interestingly, the geometric mean of the two confidences evaluates to the cosine measure, which is discussed below. The computa-tional problem with symmetric confidence measures is that the relevant itemsets satisfying a specific threshold on the measure do not satisfy the downward closure property.
4.5. ALTERNATIVE MODELS: INTERESTING PATTERNS
|
125
|
4.5.5 Cosine Coefficient on Columns
The cosine coefficient is usually applied to the rows to determine the similarity among trans-actions. However, it can also be applied to the columns, to determine the similarity between items. The cosine coefficient is best computed using the vertical tid list representation on the corresponding binary vectors. The cosine value on the binary vectors computes to the following:
cosine(i, j) = sup(i) · sup(j) . (4.8)
The numerator can be evaluated as the length of the intersection of the tid lists of items i and j. The cosine measure can be viewed as the geometric mean of the confidences of the rules {i} ⇒ {j} and {j} ⇒ {i}. Therefore, the cosine is a kind of symmetric confidence measure.
4.5.6 Jaccard Coefficient and the Min-hash Trick
The Jaccard coefficient was introduced in Chap. 3 to measure similarity between sets. The tid lists on a column can be viewed as a set, and the Jaccard coefficient between two tid lists can be used to compute the similarity. Let S1 and S2 be two sets. As discussed in Chap. 3, the Jaccard coefficient J(S1, S2) between the two sets can be computed as follows:
J(S1, S2) =
|
|S1
|
∩ S2
|
|
|
.
|
(4.9)
|
|
|
|S1
|
∪ S2
|
|
|
|
|
|
|
|
The Jaccard coefficient can easily be generalized to multiway sets, as follows:
J(S1 . . . Sk) =
|
| ∩ Si|
|
.
|
(4.10)
|
|
|
|
| ∪ Si|
|
|
|
When the sets S1 . . . Sk correspond to the tid lists of k items, the intersection and union of the tid lists can be used to determine the numerator and denominator of the aforementioned expression. This provides the Jaccard-based significance for that k-itemset. It is possible to use a minimum threshold on the Jaccard coefficient to determine all the relevant itemsets.
nice property of Jaccard-based significance is that it satisfies the set-wise monotonicity property. The k-way Jaccard coefficient J(S1 . . . Sk) is always no smaller than the (k+1)-way Jaccard coefficient J(S1 . . . Sk+1). This is because the numerator of the Jaccard coefficient is monotonically non-increasing with increasing values of k (similar to support), whereas the denominator is monotonically non-decreasing. Therefore, the Jaccard coefficient cannot increase with increasing values of k. Therefore, when a minimum threshold is used on the Jaccard-based significance of an itemset, the resulting itemsets satisfy the downward closure property, as well. This means that most of the traditional algorithms, such as Apriori and enumeration tree methods, can be generalized to the Jaccard coefficient quite easily.
It is possible to use sampling to speed up the computation of the Jaccard coefficient further, and transform it to a standard frequent pattern mining problem. This kind of sampling uses hash functions to simulate sorted samples of the data. So, how can the Jaccard coefficient be computed using sorted sampling? Let D be the n × d binary data matrix representing the n rows and d columns. Without loss of generality, consider the case when the Jaccard coefficient needs to be computed on the first k columns. Suppose one were to sort the rows in D, and pick the first row in which at least one of the first k columns in this row has a value of 1 in this column. Then, it is easy to see that the probability of
126 CHAPTER 4. ASSOCIATION PATTERN MINING
the event that all the k columns have a value of 1 is equal to the k-way Jaccard coefficient. If one were to sort the rows multiple times, it is possible to estimate this probability as the fraction of sorts over which the event of all k columns taking on unit values occurs. Of course, it is rather inefficient to do it in this way because every sort requires a pass over the database. Furthermore, this approach can only estimate the Jaccard coefficient for a
Dostları ilə paylaş: |