Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə80/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   76   77   78   79   80   81   82   83   ...   423
1-Data Mining tarjima

4.5. ALTERNATIVE MODELS: INTERESTING PATTERNS

123

Although it is possible to quantify the affinity of sets of items in ways that are statisti-cally more robust than the support-confidence framework, the major computational problem faced by most such interestingness-based models is that the downward closure property is generally not satisfied. This makes algorithmic development rather difficult on the expo-nentially large search space of patterns. In some cases, the measure is defined only for the special case of 2-itemsets. In other cases, it is possible to design more efficient algorithms. The following contains a discussion of some of these models.


4.5.1 Statistical Coefficient of Correlation


A natural statistical measure is the Pearson coefficient of correlation between a pair of items. The Pearson coefficient of correlation between a pair of random variables X and Y is defined as follows:


ρ = E[X · Y ] − E[X] · E[Y ]. (4.4)
σ(X) · σ(Y )

In the case of market basket data, X and Y are binary variables whose values reflect presence or absence of items. The notation E[X] denotes the expectation of X , and σ(X ) denotes the standard deviation of X . Then, if sup(i) and sup(j) are the relative supports of individual items, and sup({i, j} is the relative support of itemset {i, j}, then the overall correlation can be estimated from the data as follows:




ρij = sup(i) · sup(j) · (1 − sup(i)) · (1 − sup(j)) . (4.5)

The coefficient of correlation always lies in the range [1, 1], where the value of +1 indicates perfect positive correlation, and the value of -1 indicates perfect negative correlation. A value near 0 indicates weakly correlated data. This measure satisfies the bit symmetric property. While the coefficient of correlation is statistically considered the most robust way of measuring correlations, it is often intuitively hard to interpret when dealing with items of varying but low support values.


4.5.2 χ2 Measure

The χ2 measure is another bit-symmetric measure that treats the presence and absence of items in a similar way. Note that for a set of k binary random variables (items), denoted by X, there are 2k-possible states representing presence or absence of different items of X in the transaction. For example, for k = 2 items {Bread, Butter}, the 22 states are {Bread, Butter }, {Bread, ¬Butter}, {¬Bread, Butter}, and {¬Bread, ¬Butter}. The expected fractional presence of each of these combinations can be quantified as the product of the supports of the states (presence or absence) of the individual items. For a given data set, the observed value of the support of a state may vary significantly from the expected value of the support. Let Oi and Ei be the observed and expected values of the absolute support of state i. For example, the expected support Ei of {Bread, ¬Butter} is given by the total number of transactions multiplied by each of the fractional supports of Bread and ¬Butter, respectively. Then, the χ2-measure for set of items X is defined as follows:




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   76   77   78   79   80   81   82   83   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin