Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə87/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   83   84   85   86   87   88   89   90   ...   423
1-Data Mining tarjima

a, b, c, d
















2

b, c, e, f
















3

a, d, e, f
















4

a, e, f
















5

b, d, f
















Determine the absolute support of itemsets {a, e, f }, and {d, f }. Convert the absolute







support to the relative support.




2.

For the database in Exercise 1, compute all frequent patterns at absolute minimum







support values of 2, 3, and 4.




3.

For the database in Exercise 1, determine all the maximal frequent patterns at absolute







minimum support values of 2, 3, and 4.




4.

Represent the database of Exercise 1 in vertical format.




5.

Consider the transaction database in the table below:





































tid

items








































1

a, c, d, e



















2

a, d, e, f



















3

b, c, d, e, f



















4

b, d, e, f



















5

b, e, f



















6

c, d, e



















7

c, e, f



















8

d, e, f










Determine all frequent patterns and maximal patterns at support levels of 3, 4, and 5.





  1. Represent the transaction database of Exercise 5 in vertical format.




  1. Determine the confidence of the rules {a} ⇒ {f }, and {a, e} ⇒ {f } for the transaction database in Exercise 1.




  1. Determine the confidence of the rules {a} ⇒ {f }, and {a, e} ⇒ {f } for the transaction database in Exercise 5.




  1. Show the candidate itemsets and the frequent itemsets in each level-wise pass of the Apriori algorithm in Exercise 1. Assume an absolute minimum support level of 2.




  1. Show the candidate itemsets and the frequent itemsets in each level-wise pass of the Apriori algorithm in Exercise 5. Assume an absolute minimum support level of 3.




  1. Show the prefix-based enumeration tree of frequent itemsets, for the data set of Exer-cise 1 at an absolute minimum support level of 2. Assume a lexicographic ordering of a, b, c, d, e, f . Construct the tree for the reverse lexicographic ordering.

4.9. EXERCISES

133




  1. Show the prefix-based enumeration tree of frequent itemsets, for the data set in Exer-cise (5), at an absolute minimum support of 3. Assume a lexicographic ordering of a, b, c, d, e, f . Construct the tree for the reverse lexicographic ordering.




  1. Show the frequent suffixes generated in the recursion tree of the generic pattern growth method for the data set and support level in Exercise 9. Assume the lexicographic ordering of a, b, c, d, e, f , and f, e, d, c, b, a. How do these trees compare with those generated in Exercise 11?




  1. Show the frequent suffixes generated in the recursion tree of the generic pattern growth method for the data set and support level in Exercise 10. Assume the lexicographic ordering of a, b, c, d, e, f , and f, e, d, c, b, a. How do these trees compare with those generated in Exercise 12?




  1. Construct a prefix-based FP-Tree for the lexicographic ordering a, b, c, d, e, f for the data set in Exercise 1. Create the same tree for the reverse lexicographic ordering.




  1. Construct a prefix-based FP-Tree for the lexicographic ordering a, b, c, d, e, f for the data set in Exercise 5. Create the same tree for the reverse lexicographic ordering.




  1. The pruning approach in Apriori was inherently designed for a breadth-first strategy because all frequent k-itemsets are generated before (k + 1)-itemsets. Discuss how one might implement such a pruning strategy with a depth-first algorithm.




  1. Implement the pattern growth algorithm with the use of (a) an array-based data structure, (b) a pointer-based data structure with no FP-Tree, and (c) a pointer-based data structure with FP-Tree.




  1. Implement Exercise 18(c) by growing patterns from prefixes and the FP-Tree on suf-fixes.




  1. For the itemset {d, f } and the data set of Exercise 1, compute the (a) statistical corre-lation coefficient, (b) interest ratio, (c) cosine coefficient, and (d) Jaccard coefficient.




  1. For the itemset {d, f } and the data set of Exercise 1, compute the (a) statistical corre-lation coefficient, (b) interest ratio, (c) cosine coefficient, and (d) Jaccard coefficient.




  1. Discuss the similarities and differences between TreeProjection, DepthProject, Verti-calApriori, and FP-growth.



Chapter 5


Association Pattern Mining: Advanced Concepts

Each child is an adventure into a better life—an opportunity to change the old pattern and make it new.”—Hubert H. Humphrey


5.1 Introduction


Association pattern mining algorithms often discover a large number of patterns, and it is difficult to use this large output for application-specific tasks. One reason for this is that a vast majority of the discovered associations may be uninteresting or redundant for a specific application. This chapter discusses a number of advanced methods that are designed to make association pattern mining more application-sensitive:





  1. Summarization: The output of association pattern mining is typically very large. For an end-user, a smaller set of discovered itemsets is much easier to understand and assimilate. This chapter will introduce a number of summarization methods such as finding maximal itemsets, closed itemsets, or nonredundant rules.




  1. Querying: When a large number of itemsets are available, the users may wish to query them for smaller summaries. This chapter will discuss a number of specialized sum-marization methods that are query friendly. The idea is to use a two-phase approach in which the data is preprocessed to create a summary. This summary is then queried.




  1. Constraint incorporation: In many real scenarios, one may wish to incorporate application-specific constraints into the itemset generation process. Although a constraint-based algorithm may not always provide online responses, it does allow for the use of much lower support-levels for mining, than a two-phase “preprocess-once query-many” approach.

These topics are all related to the extraction of interesting summary information from item-sets in different ways. For example, compressed representations of itemsets are very useful






C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 5

135

c Springer International Publishing Switzerland 2015



  1. CHAPTER 5. ASSOCIATION PATTERN MINING: ADVANCED CONCEPTS

Table 5.1: Example of a snapshot of a market basket data set (Replicated from Table 4.1 of Chap. 4)





tid
Set of items








1

{Bread, Butter, M ilk}







2

{Eggs, M ilk, Y ogurt}







3

{Bread, Cheese, Eggs, M ilk}







4

{Eggs, M ilk, Y ogurt}







5

{Cheese, M ilk, Y ogurt}




for querying. A query-friendly compression scheme is very different from a summarization scheme that is designed to assure nonredundancy. Similarly, there are fewer constrained itemsets than unconstrained itemsets. However, the shrinkage of the discovered itemsets is because of the constraints rather than a compression or summarization scheme. This chapter will also discuss a number of useful applications of association pattern mining.


This chapter is organized as follows. The problem of pattern summarization is addressed in Sect. 5.2. A discussion of querying methods for pattern mining is provided in Sect. 5.3. Section 5.4 discusses numerous applications of frequent pattern mining. The conclusions are discussed in Sect. 5.5.


5.2 Pattern Summarization


Frequent itemset mining algorithms often discover a large number of patterns. The size of the output creates challenges for users to assimilate the results and make meaningful infer-ences. An important observation is that the vast majority of the generated patterns are often redundant. This is because of the downward closure property, which ensures that all subsets of a frequent itemset are also frequent. There are different kinds of compact repre-sentations in frequent pattern mining that retain different levels of knowledge about the true set of frequent patterns and their support values. The most well-known representations are those of maximal frequent itemsets, closed frequent itemsets, and other approximate repre-sentations. These representations vary in the degree of information loss in the summarized representation. Closed representations are fully lossless with respect to the support and membership of itemsets. Maximal representations are lossy with respect to the support but lossless with respect to membership of itemsets. Approximate condensed representations are lossy with respect to both but often provide the best practical alternative in application-driven scenarios.


5.2.1 Maximal Patterns


The concept of maximal itemsets was discussed briefly in the previous chapter. For conve-nience, the definition of maximal itemsets is restated here:


Definition 5.2.1 (Maximal Frequent Itemset) A frequent itemset is maximal at a given minimum support level minsup if it is frequent and no superset of it is frequent.


For example, consider the example of Table 5.1, which is replicated from the example of Table 4.1 in the previous chapter. It is evident that the itemset {Eggs, M ilk, Y ogurt} is frequent at a minimum support level of 2 and is also maximal. The support of proper subsets of a maximal itemset is always equal to, or strictly larger than the latter because of the support-monotonicity property. For example, the support of {Eggs, M ilk}, which is a proper subset of the itemset {Eggs, M ilk, Y ogurt}, is 3. Therefore, one strategy for summarization is to mine only the maximal itemsets. The remaining itemsets are derived as subsets of the maximal itemsets.





5.2. PATTERN SUMMARIZATION

137

Although all the itemsets can be derived from the maximal itemsets with the subsetting approach, their support values cannot be derived. Therefore, maximal itemsets are lossy because they do not retain information about the support values. To provide a lossless representation in terms of the support values, the notion of closed itemset mining is used. This concept will be discussed in the next section.


A trivial way to find all the maximal itemsets would be to use any frequent itemset mining algorithm to find all itemsets. Then, only the maximal ones can be retained in a postprocessing phase by examining itemsets in decreasing order of length, and removing proper subsets. This process is continued until all itemsets have either been examined or removed. The itemsets that have not been removed at termination are the maximal ones. However, this approach is an inefficient solution. When the itemsets are very long, the number of maximal frequent itemsets may be orders of magnitude smaller than the number of frequent itemsets. In such cases, it may make sense to design algorithms that can directly prune parts of the search space of patterns during frequent itemset discovery. Most of the tree-enumeration methods can be modified with the concept of lookaheads to prune the search space of patterns. This notion is discussed in the previous chapter in the context of the DepthProject algorithm.


Although the notion of lookaheads is described in the Chap. 4, it is repeated here for completeness. Let P be a frequent pattern in the enumeration tree of itemsets, and F (P ) represent the set of candidate extensions of P in the enumeration tree. Then, if P ∪ F (P ) is a subset of a frequent pattern that has already been found, then it implies that the entire enumeration tree rooted at P is frequent and can, therefore, be removed from further consideration. In the event that the subtree is not pruned, the candidate extensions of P need to be counted. During counting, the support of P ∪ F (P ) is counted at the same time that the supports of single-item candidate extensions of P are counted. If P ∪ F (P ) is frequent then the subtree rooted at P can be pruned as well. The former kind of subset-based pruning approach is particularly effective with depth- first methods. This is because maximal patterns are found much earlier with a depth-first strategy than with a breadth-first strategy. For a maximal pattern of length k, the depth-first approach discovers it after exploring only (k − 1) of its prefixes, rather than the 2k possibilities. This maximal pattern then becomes available for subset-based pruning. The remaining subtrees containing subsets of P ∪ F (P ) are then pruned. The superior lookahead-pruning of depth-first methods was first noted in the context of the DepthProject algorithm.


The pruning approach provides a smaller set of patterns that includes all maximal patterns but may also include some nonmaximal patterns despite the pruning. Therefore, the approach discussed above may be applied to remove these nonmaximal patterns. Refer to the bibliographic notes for pointers to various maximal frequent pattern mining algorithms.


5.2.2 Closed Patterns


A simple definition of a closed pattern, or closed itemset, is as follows:


Definition 5.2.2 (Closed Itemsets) An itemset X is closed, if none of its supersets have exactly the same support count as X.


Closed frequent pattern mining algorithms require itemsets to be closed in addition to being frequent. So why are closed itemsets important? Consider a closed itemset X, and the set S(X) of itemsets which are subsets of X , and which have the same support as X. The only itemset from S(X) that will be returned by a closed frequent itemset mining algorithm, will





  1. CHAPTER 5. ASSOCIATION PATTERN MINING: ADVANCED CONCEPTS

be X. The itemsets contained in S(X) may be referred to as the equi-support subsets of X.


An important observation is as follows:


Observation 5.2.1 Let X be a closed itemset, and S(X) be its equi-support subsets. For any itemset Y ∈ S(X), the set of transactions T (Y ) containing Y is exactly the same. Furthermore, there is no itemset Z outside S(X) such that the set of transactions in T (Z) is the same as T (X).


This observation follows from the downward closed property of frequent itemsets. For any proper subset Y of X, the set of transactions T (Y ) is always a superset of T (X). However, if the support values of X and Y are the same, then T (X) and T (Y ) are the same, as well. Furthermore, if any itemset Z ∈ S(X) yields T (Z) = T (X), then the support of Z ∪ X must be the same as that of X . Because Z is not a subset of X, Z ∪ X must be a proper superset of X. This would lead to a contradiction with the assumption that X is closed.


It is important to understand that the itemset X encodes information about all the nonredundant counting information needed with respect to any itemset in S(X). Every itemset in S(X) describes the same set of transactions, and therefore, it suffices to keep the single representative itemset. The maximal itemset X from S(X) is retained. It should be pointed out that Definition 5.2.2 is a simplification of a more formal definition that is based on the use of a set-closure operator. The formal definition with the use of a set-closure operator is directly based on Observation 5.2.1 (which was derived here from the simplified definition). The informal approach used by this chapter is designed for better understanding. The frequent closed itemset mining problem is defined below.

Definition 5.2.3 (Closed Frequent Itemsets) An itemset X is a closed frequent item-set at minimum support minsup, if it is both closed and frequent.


The set of closed itemsets can be discovered in two ways:





  1. The set of frequent itemsets at any given minimum support level may be determined, and the closed frequent itemsets can be derived from this set.




  1. Algorithms can be designed to directly find the closed frequent patterns during the process of frequent pattern discovery.

While the second class of algorithms is beyond the scope of this book, a brief description of the first approach for finding all the closed itemsets will be provided here. The reader is referred to the bibliographic notes for algorithms of the second type.


A simple approach for finding frequent closed itemsets is to first partition all the frequent itemsets into equi-support groups. The maximal itemsets from each equi-support group may be reported. Consider a set of frequent patterns F, from which the closed frequent patterns need to be determined. The frequent patterns in F are processed in increasing order of support and either ruled in or ruled out, depending on whether or not they are closed. Note that an increasing support ordering also ensures that closed patterns are encountered earlier than their redundant subsets. Initially, all patterns are unmarked. When an unmarked pattern X ∈ F is processed (based on the increasing support order selection), it is added to the frequent closed set CF. The proper subsets of X with the same support cannot be closed. Therefore, all the proper subsets of X with the same support are marked. To achieve this goal, the subset of the itemset lattice representing F can be traversed in depth-first or breadth-first order starting at X , and exploring subsets of X. Itemsets that are subsets of X are marked when they have the same support as X . The traversal process backtracks when an itemset is reached with strictly larger support, or the itemset has already been marked



5.2. PATTERN SUMMARIZATION

139

by the current or a previous traversal. After the traversal is complete, the next unmarked node is selected for further exploration and added to CF. The entire process of marking nodes is repeated, starting from the pattern newly added to CF. At the end of the process, the itemsets in CF represent the frequent closed patterns.


5.2.3 Approximate Frequent Patterns


Approximate frequent pattern mining schemes are almost always lossy schemes because they do not retain all the information about the itemsets. The approximation of the patterns may be performed in one of the following two ways:





  1. Description in terms of transactions: The closure property provides a lossless descrip-tion of the itemsets in terms of their membership in transactions. A generalization of this idea is to allow “almost” closures, where the closure property is not exactly sat-isfied but is approximately specified. Thus, a “play” is allowed in the support values of the closure definition.




  1. Description in terms of itemsets themselves: In this case, the frequent itemsets are clustered, and representatives can be drawn from each cluster to provide a concise summary. In this case, the “play” is allowed in terms of the distances between the representatives and remaining itemsets.

These two types of descriptions yield different insights. One is defined in terms of transaction membership, whereas the other is defined in terms of the structure of the itemset. Note that among the subsets of a 10-itemset X, a 9-itemset may have a much higher support, but a 1-itemset may have exactly the same support as X. In the first definition, the 10-itemset and 1-itemset are “almost” redundant with respect to each other in terms of transaction membership. In the second definition, the 10-itemset and 9-itemset are almost redundant with respect to each other in terms of itemset structure. The following sections will introduce methods for discovering each of these kinds of itemsets.


5.2.3.1 Approximation in Terms of Transactions


The closure property describes itemsets in terms of transactions, and the equivalence of dif-ferent itemsets with this criterion. The notion of “approximate closures” is a generalization of this criterion. There are multiple ways to define “approximate closure,” and a simpler definition is introduced here for ease in understanding.


In the earlier case of exact closures, one chooses the maximal supersets at a particu-lar support value. In approximate closures, one does not necessarily choose the maximal supersets at a particular support value but allows a “play” δ, within a range of supports. Therefore, all frequent itemsets F can be segmented into a disjoint set of k “almost equi-support” groups F1 . . . Fk, such that for any pair of itemsets X, Y within any group Fi, the value of |sup(X) − sup(Y )| is at most δ. From each group, Fi, only the maximal fre-quent representatives are reported. Clearly, when δ is chosen to be 0, this is exactly the set of closed itemsets. If desired, the exact error value obtained by removing individual items from approximately closed itemsets is also stored. There is, of course, still some uncertainty in support values because the support values of itemsets obtained by removing two items cannot be exactly inferred from this additional data.


Note that the “almost equi-support” groups may be constructed in many different ways when δ > 0. This is because the ranges of the “almost equi-support” groups need not exactly





  1. CHAPTER 5. ASSOCIATION PATTERN MINING: ADVANCED CONCEPTS

be δ but can be less than δ. Of course, a greedy way of choosing the ranges is to always pick the itemset with the lowest support and add δ to it to pick the upper end of the range. This process is repeated to construct all the ranges. Then, the frequent closed itemsets can be extracted on the basis of these ranges.


The algorithm for finding frequent “almost closed” itemsets is very similar to that of finding frequent closed itemsets. As in the previous case, one can partition the frequent itemsets into almost equi-support groups, and determine the maximal ones among them. A traversal algorithm in terms of the graph lattice is as follows.


The first step is to decide the different ranges of support for the “almost equi-support” groups. The itemsets in F are processed groupwise in increasing order of support ranges for the “almost equi-support” groups. Within a group, unmarked itemsets are processed in increasing order of support. When these nodes are examined they are added to the almost closed set AC. When a pattern X ∈ F is examined, all its proper subsets within the same group are marked, unless they have already been marked. To achieve this goal, the subset of the itemset lattice representing F can be traversed in the same way as discussed in the previous case of (exactly) closed sets. This process is repeated with the next unmarked node. At the end of the process, the set AC contains the frequent “almost closed” patterns. A variety of other ways of defining “almost closed” itemsets are available in the literature. The bibliographic notes contain pointers to these methods.


5.2.3.2 Approximation in Terms of Itemsets

The approximation in terms of itemsets can also be defined in many different ways and is closely related to clustering. Conceptually, the goal is to create clusters from the set of frequent itemsets calF , and pick representatives J = J1 . . . Jk from the clusters. Because clusters are always defined with respect to a distance function Dist(X, Y ) between itemsets





  • and Y , the notion of δ-approximate sets is also based on a distance function.

Definition 5.2.4 (δ-Approximate Sets) The set of representatives J = {J1 . . . Jk} is




δ-approximate, if for each frequent pattern X ∈ F, and each Ji ∈ J , the following is true:



Dist(X, Ji) ≤ δ

(5.1)

Any distance function for set-valued data, such as the Jaccard coefficient, may be used. Note that the cardinality of the set k defines the level of compression. Therefore, the goal is to determine the smallest value of k for a particular level of compression δ. This objective is closely related to the partition-based formulation of clustering, in which the value of k is fixed, and the average distance of the individual objects to their representatives are optimized. Conceptually, this process also creates a clustering on the frequent itemsets. The frequent itemsets can be either strictly partitioned to their closest representative, or they can be allowed to belong to multiple sets for which their distance to the closest representative is at most δ.


So, how can the optimal size of the representative set be determined? It turns out that a simple greedy solution is very effective in most scenarios. Let C(J ) ⊆ F denote the set of frequent itemsets covered by the representatives in J . An itemset in F is said to be covered by a representative in J , if it lies within a distance of at most δ from at least one representative of J . Clearly, it is desirable to determine J so that C(J ) = F and the size of the set J is as small as possible.


The idea of the greedy algorithm is to start with J = {} and add the first element from F to J that covers the maximum number of itemsets in F. The covered itemsets are then



5.3. PATTERN QUERYING

141

removed from F. This process is repeated iteratively by greedily adding more elements to





  • to maximize coverage in the residual set F. The process terminates when the set F is empty. It can be shown that the function f (J ) = |C(J )| satisfies the submodularity property with respect to the argument J . In such cases, greedy algorithms are generally effective in practice. In fact, in a minor variation of this problem in which |C(J)| is directly optimized for fixed size of J, a theoretical bound can also be established on the quality achieved by the greedy algorithm. The reader is referred to the bibliographic notes for pointers on submodularity.

5.3 Pattern Querying

Although the compression approach provides a concise summary of the frequent itemsets, there may be scenarios in which users may wish to query the patterns with specific prop-erties. The query responses provide the relevant sets of patterns in an application. This relevant set is usually much smaller than the full set of patterns. Some examples are as follows:





  1. Report all the frequent patterns containing X that have a minimum support of minsup.




  1. Report all the association rules containing X that have a minimum support of minsup and a minimum confidence of minconf.

One possibility is to exhaustively scan all the frequent itemsets and report the ones satisfying the user-specified constraints. This is, however, quite inefficient when the number of frequent patterns is large. There are two classes of methods that are frequently used for querying interesting subsets of patterns:





  1. Preprocess-once query-many paradigm: The first approach is to mine all the itemsets at a low level of support and arrange them in the form of a hierarchical or lattice data structure. Because the first phase needs to be performed only once in offline fashion, sufficient computational resources may be available. Therefore, a low level of support is used to maximize the number of patterns preserved in the first phase. Many queries can be addressed in the second phase with the summary created in the first phase.




  1. Constraint-based pattern mining: In this case, the user-specified constraints are pushed directly into the mining process. Although such an approach can be slower for each query, it allows the mining of patterns at much lower values of the support than is possible with the first approach. This is because the constraints can reduce the pattern sizes in the intermediate steps of the itemset discovery algorithm and can, therefore, enable the discovery of patterns at much lower values of the support than an (unconstrained) preprocessing phase.

In this section, both types of methods will be discussed.


5.3.1 Preprocess-once Query-many Paradigm


This particular paradigm is very effective for the case of simpler queries. In such cases, the key is to first determine all the frequent patterns at a very low value of the support. The resulting itemsets can then be arranged in the form of a data structure for querying. The simplest data structure is the itemset lattice, which can be considered a graph data structure for querying. However, itemsets can also be queried with the use of data structures





  1. CHAPTER 5. ASSOCIATION PATTERN MINING: ADVANCED CONCEPTS





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   83   84   85   86   87   88   89   90   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin