False positives: These are patterns that meet the support threshold on the sample but not on the base data.
False negatives: These are patterns that do not meet the support threshold on the sample, but meet the threshold on the data.
False positives are easier to address than false negatives because the former can be removed by scanning the disk-resident database only once. However, to address false negatives, one needs to reduce the support thresholds. By reducing support thresholds, it is possible to probabilistically guarantee the level of loss for specific thresholds. Pointers to these proba-bilistic guarantees may be found in the bibliographic notes. Reducing the support thresholds too much will lead to many spurious itemsets and increase the work in the postprocessing phase. Typically, the number of false positives increases rapidly with small changes in sup-port levels.
4.6.2 Data Partitioned Ensembles
One approach that can guarantee no false positives and no false negatives, is the use of partitioned ensembles by the Partition algorithm [446]. This approach may be used either for reduction of disk- access costs or for reduction of memory requirements of projection-based algorithms. In partitioned ensembles, the transaction database is partitioned into k disjoint segments, each of which is main-memory resident. The frequent itemset mining algorithm is independently applied to each of these k different segments with the required minimum support level. An important property is that every frequent pattern must appear in at least one of the segments. Otherwise, its cumulative support across different segments will not meet the minimum support requirement. Therefore, the union of the frequent itemset generated from different segments provides a superset of the frequent patterns. In other words, the union contains false positives but no false negatives. A postprocessing phase of support counting can be applied to this superset to remove the false positives. This approach is particularly useful for memory-intensive projection-based algorithms when the projected databases do not fit in main memory. In the original Partition algorithm, the data structure used to perform projection-based reuse was the vertical tid list. While partitioning is almost always necessary for memory-based implementations of projection-based algorithms in databases of arbitrarily large size, the cost of postprocessing overhead can sometimes be significant. Therefore, one should use the minimum number of partitions based on the available memory. Although Partition is well known mostly for its ensemble approach, an even more significant but unrecognized contribution of the method was to propose the notion of vertical lists. The approach is credited with recognizing the projection-based reuse properties of recursive tid list intersections.
4.6.3 Generalization to Other Data Types
The generalization to other data types is quite straightforward with the use of type-transformation methods discussed in Chap. 2.
4.6.3.1 Quantitative Data
In many applications, it is desirable to discover quantitative association rules when some of the attributes take on quantitative values. Many online merchants collect profile infor-mation, such as age, which have numeric values. For example, in supermarket applications, it may be desirable to relate demographic information to item attributes in the data. An example of such a rule is as follows:
(Age = 90) ⇒ Checkers.
This rule may not have sufficient support if the transactions do not contain enough indi-viduals of that age. However, the rule may be relevant to the broader age group. Therefore, one possibility is to create a rule that groups the different ages into one range:
Dostları ilə paylaş: |