Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	301/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 297 298 299 300 301 302 303 304 ... 423

1-Data Mining tarjima

k = 1;
F_k = { All Frequent 1-item elements };
while F_k is not empty do begin
Generate C_k₊₁ by joining pairs of sequences in F_k, such that
removing an item from the first element of one sequence matches the sequence obtained by removing an item from the last element of the other;

Prune sequences from C_k₊₁ that violate downward closure; Determine F_k₊₁ by support counting on (C_k₊₁, T ) and retaining

sequences from C_k₊₁ with support at least minsup; k = k + 1;

end;
return(∪^k F );

_i₌₁ _i
Figure 15.1: The GSP algorithm is related to the Apriori algorithm. The reader is encour-aged to compare this pseudocode with the Apriori algorithm described in Fig. 4.2 of Chap. 4

section provides a broad overview of how enumeration tree algorithms can be generalized to sequential pattern mining.

The GSP and Apriori algorithms are similar, except that the former needs to be designed for finding frequent sequences rather than sets. First, the notion of the length of candidates needs to be defined in sequential pattern mining. This notion needs to be defined more care-fully, because the individual elements in the sequence are sets rather than items. The length of a candidate or a frequent sequence is equal to the number of items (not elements) in the

candidate. In other words, a k-sequence) Y ₁ . . . Y_r	is a sequence with length		r	\|Y_i\| = k.
			i=1
Thus, Bread, Butter, Cheese}, {Cheese, Eggs}		is a 5-candidate, even though it contains

only 2 elements. This is because this sequence contains 5 items in total, including a repe-tition of “Cheese” in two distinct elements. A (k − 1)-subsequence of a k-candidate can be generated by removing an item from any element in the k-sequence. The Apriori property continues to hold for sequences because any (k − 1)-subsequence of a k-sequence will have support at least equal to that of the latter. This sets the stage for a candidate generate-and-test approach, together with downward closure pruning, which is analogous to Apriori.

The GSP algorithm starts by generating all frequent 1-item sequences by straightfor-ward counting of individual items. This set of frequent 1-sequences is represented by F₁. Subsequent iterations construct C_k₊₁ by joining pairs of sequence patterns in F_k. The join process is diﬀerent from association pattern mining because of the greater complexity in the definition of sequences. Any pair of frequent k-sequences S₁ and S₂ can be joined, if removing an item from the first element in one frequent k-sequence S₁ is identical to the sequence obtained by removing an item from the last element in the other frequent sequence S₂. For example, the two 5-sequences S₁ = Bread, Butter, Cheese}, {Cheese, Eggs} and S₂ = Bread, Butter }, {M ilk, Cheese, Eggs} can be joined because removing “Cheese” from the first element of S₁ will result in an identical sequence to that obtained by remov-ing “Milk” from the last element of S₂. Note that if S₂ were a 5-candidate with 3 elements corresponding to S₂ = Bread, Butter }, {Cheese, Eggs}, {M ilk} , then a join can also be performed. This is because removing the last item from S₂ creates a sequence with 2

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 297 298 299 300 301 302 303 304 ... 423