Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	300/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 296 297 298 299 300 301 302 303 ... 423

1-Data Mining tarjima

15.2. SEQUENTIAL PATTERN MINING

495

The sequential pattern mining problem is defined on a set of N sequences. The ith sequence contains n_i elements in a specific temporal order. Each element contains a set of items. The complex element is, therefore, a set, such as a basket of items bought by a customer. For example, consider the sequence:

Bread, Butter}, {Butter, M ilk}, {Bread, Butter, Cheese}, {Eggs}

Here, {Bread, Butter} is an element, and Bread is an item inside the element. A sub-sequence of this sequence is also a temporal ordering of sets, such that each element in the subsequence is a subset of an element in the base sequence in the same temporal order. For example, consider the following sequence:

Bread, Butter}, {Bread, Butter}, {Eggs}

The second sequence is a subsequence of the first because each element in the second sequence can be matched to a corresponding element in the first sequence by a subset relationship, so that the matching elements are in the same temporal order. Unlike trans-actions that are sets, note that sequences (and the mined subsequences) contain ordered (and possibly repeated) elements, each of which is itself like a transaction. For example, {Bread, Butter} is a repeated element in one of the aforementioned sequences, and it may correspond to two separate visits of a customer to the supermarket at diﬀerent times. For-mally, a subsequence relationship is defined as follows:

Definition 15.2.1 (Subsequence) Let Y = Y ₁ . . . Y_n and Z = Z ₁ . . . Z_k be two sequences, such that all the elements Y_i and Z_i in the sequences are sets. Then, the sequence Z is a subsequence of Y, if k elements Y_i₁ . . . Y_i_k can be found in Y, such that i_i < i₂ < . . . < i_k, and Z_r ⊆ Y_i_r for each r ∈ {1 . . . k}.

Consider a sequence database T containing a set of N sequences Y₁ . . . Y_N . The support of a subsequence Z with respect to database T is defined in an analogous way to frequent pattern mining.

Definition 15.2.2 (Support) The support of a subsequence Z is defined as the fraction of sequences in the database T = {Y₁ . . . Y_N }, that contain Z as a subsequence.

The sequential pattern mining problem is that of identifying all subsequences that satisfy the required level of minimum support minsup.

Definition 15.2.3 (Sequential Pattern Mining) Given a sequence database T =

{Y₁, . . . Y_N }, determine all subsequences whose support with respect to the database T is at least minsup.

It is easy to see that this definition is very similar to that of the definition of association pattern mining in Chap. 4. The minimum support value minsup can be specified either as an absolute value, or as a relative support value. As in the case of frequent pattern mining, a relative value will be assumed, unless otherwise specified.

An Apriori-like algorithm, known as Generalized Sequential Pattern Mining (GSP), was proposed as the first algorithm for sequential pattern mining. This algorithm is very similar to Apriori, in terms of how candidates are generated and counted. In fact, many frequent pattern mining algorithms, such as TreeProjection and FP-growth, have direct analogs in sequential pattern mining. This section describes only the GSP algorithm in detail. A later

496 CHAPTER 15. MINING DISCRETE SEQUENCES

Algorithm GSP(Sequence Database: T , Minimum Support: minsup)

begin

Yüklə 17,13 Mb.

Dostları ilə paylaş: