Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	88/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 84 85 86 87 88 89 90 91 ... 423

1-Data Mining tarjima

INFREQUENT ITEMSETS abcde
Item Id 1 LIST OF ITEMSET IDENTIFIERS Item Id 2 LIST OF ITEMSET IDENTIFIERS Item Id
CONTAINING ITEMSETS

BORDER BETWEEN

Null

FREQUENT AND

FREQUENT ITEMSETS

INFREQUENT

ITEMSETS

abc

a bd

abe

acd

ace

ade

bcd

bce

bde

cde

abcd

abce

abde

acde

bcde

INFREQUENT ITEMSETS

abcde

Figure 5.1: The itemset lattice (replicated from Fig. 4.1 of Chap. 4)

adapted from the information retrieval literature that use the bag-of-words representation.

Both options will be explored in this chapter.

5.3.1.1 Leveraging the Itemset Lattice

As discussed in the previous chapter, the space of itemsets can be expressed as a lattice. For convenience, Fig. 4.1 of the previous chapter is replicated in Fig. 5.1. Itemsets above the dashed border are frequent, whereas itemsets below the border are infrequent.

In the preprocess-once query- many paradigm, the itemsets are mined at the lowest possible level of support s, so that a large frequent portion of the lattice (graph) of itemsets can be stored in main memory. This stage is a preprocessing phase; therefore, running time is not a primary consideration. The edges on the lattice are implemented as pointers for eﬃcient traversal. In addition, a hash table maps the itemsets to the nodes in the graph. The lattice has a number of important properties, such as downward closure, which enable the discovery of nonredundant association rules and patterns.

This structure can eﬀectively provide responses to many queries that are posed with support minsup ≥ s. Some examples are as follows:

To determine all itemsets containing a set X at a particular level of minsup, one uses the hash table to map to the itemset X. Then, the lattice is traversed to determine the relevant supersets of X and report them. A similar approach can be used to determine all the frequent itemsets contained in X by using a traversal in the opposite direction.

It is possible to determine maximal itemsets directly during the traversal by identi-fying nodes that do not have edges to their immediate supersets at the user-specified minimum support level minsup.

It is possible to identify nodes within a specific hamming distance of X and a specified minimum support, by traversing the lattice structure both upward and downward from X for a prespecified number of steps.

5.3. PATTERN QUERYING

Item Id₁ LIST OF ITEMSET IDENTIFIERS

Item Id₂ LIST OF ITEMSET IDENTIFIERS

Item Id₃ LIST OF ITEMSET IDENTIFIERS

LIST OF ITEMSET IDENTIFIERS

^Item^Idr LIST OF ITEMSET IDENTIFIERS

LIST OF ITEMSET IDENTIFIERS

LIST OF ITEMSET IDENTIFIERS

143

INDEXED BY

ITEMSET ID

SECONDARY

DATA

STRUCTURE

CONTAINING

ITEMSETS

Figure 5.2: Illustration of the inverted lists

It is also possible to determine nonredundant rules with the use of this approach. For example, for any itemset Y ⊆ Y , the rule X ⇒ Y has a confidence and support that is no greater than that of the rule X ⇒ Y . Therefore, the rule X ⇒ Y is redundant with respect to the rule X ⇒ Y . This is referred to as strict redundancy. Furthermore, for any itemset I, the rule I − Y ⇒ Y is redundant with respect to the rule I − Y ⇒ Y only in terms of the confidence. This is referred to as simple redundancy. The lattice structure provides an eﬃcient way to identify such nonredundant rules in terms of both simple redundancy and strict redundancy. The reader is referred to the bibliographic notes for specific search strategies on finding such rules.
5.3.1.2 Leveraging Data Structures for Querying

In some cases, it is desirable to use disk-resident representations for querying. In such cases, the memory-based lattice traversal process is likely to be ineﬃcient. The two most commonly used data structures are the inverted index and the signature table. The major drawback in using these data structures is that they do not allow an ordered exploration of the set of frequent patterns, as in the case of the lattice structure.

The data structures discussed in this section can be used for either transactions or item-sets. However, some of these data structures, such as signature tables, work particularly well for itemsets because they explicitly leverage correlations between itemsets for eﬃcient indexing. Note that correlations are more significant in the case of itemsets than raw trans-actions. Both these data structures are described in some detail below.

Inverted Index: The inverted index is a data structure that is used for retrieving sparse set-valued data, such as the bag-of-words representation of text. Because frequent patterns are also sparse sets drawn over a much larger universe of items, they can be retrieved eﬃciently with an inverted index.

Each itemset is assigned a unique itemset-id. This can easily be generated with a hash function. This itemset-id is similar to the tid that is used to represent transactions. The itemsets themselves may be stored in a secondary data structure that is indexed by the itemset-id. This secondary data structure can be a hash table that is based on the same hash function used to create the itemset-id.

The inverted list contains a list for each item. Each item points to a list of itemset-ids. This list may be stored on disk. An example of an inverted list is illustrated in Fig. 5.2. The inverted representation is particularly useful for inclusion queries over small sets of items. Consider a query for all itemsets containing X , where X is a small set of items. The inverted lists for each item in X is stored on the disk. The intersection of these lists is determined.

CHAPTER 5. ASSOCIATION PATTERN MINING: ADVANCED CONCEPTS

This provides the relevant itemset-ids but not the itemsets. If desired, the relevant itemsets can be accessed from disk and reported. To achieve this goal, the secondary data structure on disk needs to be accessed with the use of the recovered itemset-ids. This is an additional overhead of the inverted data structure because it may require random access to disk. For large query responses, such an approach may not be practical.

While inverted lists are eﬀective for inclusion queries over small sets of items, they are not quite as eﬀective for similarity queries over longer itemsets. One issue with the inverted index is that it treats each item independently, and it does not leverage the significant cor-relations between the items in the itemset. Furthermore, the retrieval of the full itemsets is more challenging than that of only itemset-ids. For such cases, the signature table is the data structure of choice.

Signature Tables: Signature tables were originally designed for indexing market basket transactions. Because itemsets have the same set-wise data structure as transactions, they can be used in the context of signature tables. Signature tables are particularly useful for sparse binary data in which there are significant correlations among the diﬀerent items. Because itemsets are inherently defined on the basis of correlations, and diﬀerent itemsets have large overlaps among them, signature tables are particularly useful in such scenarios.

A signature is a set of items. The set of items U in the original data is partitioned into sets of K signatures S₁ . . . S_K , such that U = ∪^K_i₌₁S_i. The value of K is referred to as the signature cardinality. An itemset X is said to activate a signature S_i at level r if and only if |S_i ∩ X| ≥ r. This level r is referred to as the activation threshold . In other words, the itemset needs to have a user-specified minimum number r of items in common with the signature to activate it.

The super-coordinate of an itemset exists in K-dimensional space, where K is the signa-ture cardinality. Each dimension of the super-coordinate has a unique correspondence with a particular signature and vice versa. The value of this dimension is 0–1, which indicates whether or not the corresponding signature is activated by that itemset. Thus, if the items are partitioned into K signatures {S₁, . . . S_K }, then there are 2K possible super-coordinates. Each itemset maps on to a unique super-coordinate, as defined by the set of signatures acti-vated by that itemset. If S_i₁ , S _i₂ , . . . S_i_l be the set of signatures which an itemset activates, then the super- coordinates of that itemset are defined by setting the l ≤ K dimensions {i₁, i₂, . . . i_l} in this super-coordinate to 1 and the remaining dimensions to 0. Thus, this approach creates a many-to-one mapping, in which multiple itemsets may map into the same super-coordinate. For highly correlated itemsets, only a small number of signatures will be activated by an itemset, provided that the partitioning of U into signatures is designed to ensure that each signature contains correlated items.

The signature table contains a set of 2K entries. One entry in the signature table corre-sponds to each possible super-coordinate. This creates a strict partitioning of the itemsets on the basis of the mapping of itemsets to super- coordinates. This partitioning can be used for similarity search. The signature table can be stored in main memory because the num-ber of distinct super-coordinates can be mapped to main memory when K is small. For example, when K is chosen to be 20, the number of super-coordinates is about a million. The actual itemsets that are indexed by each entry of the signature table are stored on disk. Each entry in the signature table points to a list of pages that contain the itemsets indexed by that super-coordinate. The signature table is illustrated in Fig. 5.3.

A signature can be understood as a small category of items from the universal set of items U . Thus, if the items in each signature are closely correlated, then an itemset is likely to activate a small number of signatures. These signatures provide an idea of the

5.3. PATTERN QUERYING	145

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 84 85 86 87 88 89 90 91 ... 423