Data Mining: The Textbook
The sequential pattern mining problem is defined on a set of N sequences. The ith sequence contains ni elements in a specific temporal order. Each element contains a set of items. The complex element is, therefore, a set, such as a basket of items bought by a customer. For example, consider the sequence: Bread, Butter}, {Butter, M ilk}, {Bread, Butter, Cheese}, {Eggs} Here, {Bread, Butter} is an element, and Bread is an item inside the element. A sub-sequence of this sequence is also a temporal ordering of sets, such that each element in the subsequence is a subset of an element in the base sequence in the same temporal order. For example, consider the following sequence: Bread, Butter}, {Bread, Butter}, {Eggs} The second sequence is a subsequence of the first because each element in the second sequence can be matched to a corresponding element in the first sequence by a subset relationship, so that the matching elements are in the same temporal order. Unlike trans-actions that are sets, note that sequences (and the mined subsequences) contain ordered (and possibly repeated) elements, each of which is itself like a transaction. For example, {Bread, Butter} is a repeated element in one of the aforementioned sequences, and it may correspond to two separate visits of a customer to the supermarket at different times. For-mally, a subsequence relationship is defined as follows: Definition 15.2.1 (Subsequence) Let Y = Y 1 . . . Yn and Z = Z 1 . . . Zk be two sequences, such that all the elements Yi and Zi in the sequences are sets. Then, the sequence Z is a subsequence of Y, if k elements Yi1 . . . Yik can be found in Y, such that ii < i2 < . . . < ik, and Zr ⊆ Yir for each r ∈ {1 . . . k}. Consider a sequence database T containing a set of N sequences Y1 . . . YN . The support of a subsequence Z with respect to database T is defined in an analogous way to frequent pattern mining. Definition 15.2.2 (Support) The support of a subsequence Z is defined as the fraction of sequences in the database T = {Y1 . . . YN }, that contain Z as a subsequence. The sequential pattern mining problem is that of identifying all subsequences that satisfy the required level of minimum support minsup. Definition 15.2.3 (Sequential Pattern Mining) Given a sequence database T = {Y1, . . . YN }, determine all subsequences whose support with respect to the database T is at least minsup. It is easy to see that this definition is very similar to that of the definition of association pattern mining in Chap. 4. The minimum support value minsup can be specified either as an absolute value, or as a relative support value. As in the case of frequent pattern mining, a relative value will be assumed, unless otherwise specified. An Apriori-like algorithm, known as Generalized Sequential Pattern Mining (GSP), was proposed as the first algorithm for sequential pattern mining. This algorithm is very similar to Apriori, in terms of how candidates are generated and counted. In fact, many frequent pattern mining algorithms, such as TreeProjection and FP-growth, have direct analogs in sequential pattern mining. This section describes only the GSP algorithm in detail. A later 496 CHAPTER 15. MINING DISCRETE SEQUENCES Algorithm GSP(Sequence Database: T , Minimum Support: minsup) begin
Yüklə 17,13 Mb. Dostları ilə paylaş: |