Age[85, 95] ⇒ Checkers.
This rule will have the required level of minimum support. In general, for quantitative association rule mining, the quantitative attributes are discretized and converted to binary form. Thus, the entire data set (including the item attributes) can be represented as a binary matrix. A challenge with the use of such an approach is that the appropriate level of discretization is often hard to know a priori. A standard association rule mining algorithm may be applied to this representation. Furthermore, rules on adjacent ranges can be merged to create summarized rules on larger ranges.
4.6.3.2 Categorical Data
Categorical data is common in many application domains. For example, attributes such as the gender and ZIP code are typical categorical. In other cases, the quantitative and categorical data may be mixed. An example of a rule with mixed attributes is as follows:
(Gender = M ale), Age[20, 30] ⇒ Basketball.
Categorical data can be transformed to binary values with the use of the binarization approach discussed in Chap. 2. For each categorical attribute value, a single binary value is used to indicate the presence or absence of the item. This can be used to determine the association rules. In some cases, when domain knowledge is available, clusters on categorical values on may used as binary attributes. For example, the ZIP codes may be clustered by geography into k clusters, and then these k clusters may be treated as binary attributes.
4.7 Summary
The problem of association rule mining is used to identify relationships between different attributes. Association rules are typically generated using a two-phase framework. In the first phase, all the patterns that satisfy the minimum support requirement are determined. In the second phase, rules that satisfy the minimum confidence requirement are generated from the patterns.
130 CHAPTER 4. ASSOCIATION PATTERN MINING
The Apriori algorithm is one of the earliest and most well known methods for frequent pattern mining. In this algorithm, candidate patterns are generated with the use of joins between frequent patterns. Subsequently, a number of enumeration-tree algorithms were proposed for frequent pattern mining techniques. Many of these methods use projections to count the support of transactions in the database more efficiently. The traditional support-confidence framework has the shortcoming that it is not based on robust statistical measures. Many of the patterns generated are not interesting. Therefore, a number of interest measures have been proposed for determining more relevant patterns.
A number of sampling methods have been designed for improving the efficiency of fre-quent pattern mining. Sampling methods result in both false positives and false negatives, though the former can be addressed by postprocessing. A partitioned sample ensemble is also able to avoid false negatives. Association rules can be determined in quantitative and categorical data with the use of type transformations.
4.8 Bibliographic Notes
The problem of frequent pattern mining was first proposed in [55]. The Apriori algorithm discussed in this chapter was first proposed in [56], and an enhanced variant of the approach was proposed in [57]. Maximal and non-maximal frequent pattern mining algorithms are usually different from one another primarily in terms of additional pruning steps in the former. The MaxMiner algorithm used superset-based non-maximality pruning [82] for more efficient counting. However, the exploration is in breadth-first order, to reduce the number of passes over the data. The DepthProject algorithm recognized that superset-based non-maximality pruning is more effective with a depth-first approach.
The FP-growth [252] and DepthProject [3, 4] methods independently proposed the notion of projection-based reuse in the horizontal database layout. A variety of different data struc-tures are used by different projection-based reuse algorithms such as TreeProjection [3], DepthProject [4], FP- growth [252], and H-Mine [419]. A method, known as Opportune-Project [361], chooses opportunistically between array-based and tree-based structures to represent the projected transactions. The TreeProjection framework also recognized that breadth-first and depth -first strategies have different trade-offs. Breadth-first variations of TreeProjection sacrifice some of the power of projection-based reuse to enable fewer disk-based passes on arbitrarily large data sets. Depth-first variations of TreeProjection, such as DepthProject, achieve full projection-based reuse but the projected databases need to be consistently maintained in main memory. A book and a survey on frequent pattern mining methods may be found in [34] and [253], respectively.
The use of the vertical representation for frequent pattern mining was independently pioneered by Holsheimer et al. [273 ] and Savasere et al. [446]. These works introduced the clever insight that recursive tid list intersections provide significant computational savings in support counting because k-itemsets have shorter tid lists than those of (k − 1)-itemsets or individual items. The vertical Apriori algorithm is based on an ensemble component of the Partition framework [446]. Although the use of vertical lists by this algorithm was mentioned [537, 534, 465] in the earliest vertical pattern mining papers, some of the contribu-tions of the Partition algorithm and their relationship to the subsequent work seem to have remained unrecognized by the research community over the years. Savasere et al.’s Apriori-like algorithm, in fact, formed the basis for all vertical algorithms such as Eclat [534] and VIPER [465]. Eclat is described as a breadth-first algorithm in the book by Han et al. [250], and as a depth-first algorithm in the book by Zaki et al. [536]. A careful examination of the
Eclat paper [537] reveals that it is a memory optimization of the breadth-first approach by Savasere et al. [446]. The main contribution of Eclat is a memory optimization of the indi-vidual ensemble component of Savasere et al.’s algorithm with lattice partitioning (instead of data partitioning), thereby increasing the maximum size of the databases that can be processed in memory without the computational overhead of data-partitioned postprocess-ing. The number of computational operations for support counting in a single component version of
Dostları ilə paylaş: |