4.2. THE FREQUENT PATTERN MINING MODEL
|
95
|
|
|
|
Table 4.1: Example of a snapshot of a market basket data set
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tid
|
Set of items
|
Binary representation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1
|
{Bread, Butter, M ilk}
|
|
110010
|
|
|
|
|
2
|
{Eggs, M ilk, Y ogurt}
|
|
000111
|
|
|
|
|
3
|
{Bread, Cheese, Eggs, M ilk}
|
|
101110
|
|
|
|
|
4
|
{Eggs, M ilk, Y ogurt}
|
|
000111
|
|
|
|
|
5
|
{Cheese, M ilk, Y ogurt}
|
|
001011
|
|
|
|
|
|
|
|
|
|
|
in T1 . . . Tn in which an itemset occurs as a subset provides a crisp quantification of its frequency. This frequency is also known as the support.
Definition 4.2.1 (Support) The support of an itemset I transactions in the database T = {T1 . . . Tn} that contain I
is defined as the fraction of the as a subset.
The support of an itemset I is denoted by sup( I). Clearly, items that are correlated will frequently occur together in transactions. Such itemsets will have high support. Therefore, the frequent pattern mining problem is that of determining itemsets that have the requisite level of minimum support.
Definition 4.2.2 (Frequent Itemset Mining) Given a set of transactions T =
{T1 . . . Tn}, where each transaction Ti is a subset of items from U , determine all item-sets I that occur as a subset of at least a predefined fraction minsup of the transactions in
The predefined fraction minsup is referred to as the minimum support. While the default convention in this book is to assume that minsup refers to a fractional relative value, it is also sometimes specified as an absolute integer value in terms of the raw number of transactions. This chapter will always assume the convention of a relative value, unless specified otherwise. Frequent patterns are also referred to as frequent itemsets, or large itemsets. This book will use these terms interchangeably.
The unique identifier of a transaction is referred to as a transaction identifier, or tid for short. The frequent itemset mining problem may also be stated more generally in set-wise form.
Definition 4.2.3 (Frequent Itemset Mining: Set-wise Definition) Given a set of sets T = {T1 . . . Tn }, where each element of the set Ti is drawn on the universe of ele-ments U , determine all sets I that occur as a subset of at least a predefined fraction minsup of the sets in T .
As discussed in Chap. 1, binary multidimensional data and set data are equivalent. This equivalence is because each multidimensional attribute can represent a set element (or item). A value of 1 for a multidimensional attribute corresponds to inclusion in the set (or transaction). Therefore, a transaction data set (or set of sets) can also be represented as a multidimensional binary database whose dimensionality is equal to the number of items.
Consider the transactions illustrated in Table 4.1. Each transaction is associated with a unique transaction identifier in the leftmost column, and contains a baskets of items that were bought together at the same time. The right column in Table 4.1 contains the binary multidimensional representation of the corresponding basket. The attributes of this binary representation are arranged in the order {Bread, Butter, Cheese, Eggs, M ilk, Y ogurt}. In
96 CHAPTER 4. ASSOCIATION PATTERN MINING
this database of 5 transactions, the support of {Bread, M ilk} is 2/5 = 0.4 because both items in this basket occur in 2 out of a total of 5 transactions. Similarly, the support of {Cheese, Y ogurt} is 0.2 because it appears in only the last transaction. Therefore, if the minimum support is set to 0.3, then the itemset {Bread, M ilk} will be reported but not the itemset {Cheese, Y ogurt}.
The number of frequent itemsets is generally very sensitive to the minimum support level. Consider the case where a minimum support level of 0.3 is used. Each of the items Bread, M ilk, Eggs, Cheese, and Y ogurt occur in more than 2 transactions, and can therefore be considered frequent items at a minimum support level of 0.3. These items are frequent 1-itemsets. In fact, the only item that is not frequent at a support level of 0.3 is Butter. Furthermore, the frequent 2-itemsets at a minimum support level of 0.3 are
Dostları ilə paylaş: |