Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	308/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 304 305 306 307 308 309 310 311 ... 423

1-Data Mining tarjima

P (S|C_i) = P (s₁|C_i) · P (s₂|s₁, C_i) . . . P (s_n|s₁ . . . s_n−₁, C_i)

(15.3)

This is the generative probability of the sequence S for cluster C_i. Intuitively, the term P (s_j |s₁ . . . s_j−₁ , C_i ) represents the fraction of times that s_j follows s₁ . . . s_j−₁ in cluster C _i. This term can be estimated in a data-driven manner from the sequences in C_i. When a cluster is highly similar to a sequence, this value will be high. A relative similarity can be computed by comparing with a sequence generation model in which all symbols are generated randomly in proportion to their presence in the full data set. The probability of

such a random generation is given by		n		(s_j ), where P (s_j ) is estimated as the fraction
		j=1 ^P
of sequences containing symbol s_j . Then, the similarity of S to cluster C_i is defined as
follows:					P (S\|C_i)
sim(S,	C_i		) =				(15.4)
					n	P (s_j )
					j=1

One issue is that many parts of the sequence S may be noisy and not match the cluster well. Therefore, the similarity is computed as the maximum similarity of any contiguous segment of S to C_i . In other words, if S_kl be the contiguous segment of S from positions k to l, then the final similarity SIM (S, C_i) is computed as follows:

SIM (S, C_i) = max₁_≤k≤l≤nsim(S_kl, C_i)

(15.5)

The maximum similarity value can be computed by computing sim(S_kl, C_i) over all pairs [k, l]. This is the similarity value used for assigning sequences to their relevant clusters.

One problematic issue is that the computation of each of the terms P (s_j |s₁ . . . s_j ₋₁ , C_i) on the right-hand side of Eq. 15.3 may require the examination of all the sequences in the cluster C_i for probability estimation purposes. Fortunately, these terms can be estimated eﬃciently using a data structure, referred to as Probabilistic Suﬃx Trees. The CLUSEQ algorithm always dynamically maintains the Probabilistic Suﬃx Trees (PST) whenever new clusters are created or sequences are added to clusters. This data structure will be described in detail in Sect. 15.4.1.1.

15.3.4.2 Mixture of Hidden Markov Models

This approach can be considered the string analog of the probabilistic models discussed in Sect. 6.5 of Chap. 6 for clustering multidimensional data. Recall that a generative mixture model is used in that case, where each component of the mixture has a Gaussian distribution. A Gaussian distribution is, however, appropriate only for generating numerical data, and is not appropriate for generating sequences. A good generative model for sequences is referred to as Hidden Markov Models (HMM). The discussion of this section will assume the use of HMM as a black box. The actual details of HMM will be discussed in a later section. As we will see later in Sect. 15.5, the HMM can itself be considered a kind of mixture model, in which states represent dependent components of the mixture. Therefore, this approach can be considered a two-level mixture model. The discussion in this section should be combined with the description of HMMs in Sect. 15.5 to provide a complete picture of HMM-based clustering.

The broad principle of a mixture-based generative model is to assume that the data was generated from a mixture of k distributions with the probability distributions G₁ . . . G_k, where each G_i is a Hidden Markov Model. As in Sect. 6.5 of Chap. 6, the approach assumes

15.4. OUTLIER DETECTION IN SEQUENCES

507

the use of prior probabilities α₁ . . . α_k for the diﬀerent components of the mixture. Therefore, the generative process is described as follows:

Select one of the k probability distributions with probability α_i where i ∈ {1 . . . k}. Let us assume that the rth one is selected.

Generate a sequence from G_r, where G_r is a Hidden Markov Model.

One nice characteristic of mixture models is that the change in the data type and corre-sponding mixture distribution does not aﬀect the broader framework of the algorithm. The analogous steps can be applied in the case of sequence data, as they are applied in multidi-mensional data. Let S_j represent the j th sequence and Θ be the entire set of parameters to be estimated for the diﬀerent HMMs. Then, the E-step and M-step are exactly analogous to the case of the multidimensional mixture model.

(E-step) Given the current state of the trained HMM and priors α_i, determine the posterior probability P (G_i|S_j , Θ) of each sequence S_j using the HMM generative prob-abilities P (S_j |G_i, Θ) of S_j from the ith HMM, and priors α₁ . . . α_k in conjunction with the Bayes rule. This is the posterior probability that the sequence S_j was generated by the ith HMM.

(M-step) Given the current probabilities of assignments of data points to clusters, use the Baum–Welch algorithm on each HMM to learn its parameters. The assignment probabilities are used as weights for averaging the estimated parameters. The Baum– Welch algorithm is described in Sect. 15.5.4 of this chapter. The value of each α_i is estimated to be proportional to average assignment probability of all sequences to cluster i. Thus, the M-step results in the estimation of the entire set of parameters Θ.

Note that there is an almost exact correspondence in the steps used here, and to those used for mixture modeling in Sect. 6.5 of Chap. 6. The major drawback of this approach is that it can be rather slow. This is because the process of training each HMM is computationally expensive.

15.4 Outlier Detection in Sequences

Outlier detection in sequence data shares a number of similarities with timeseries data. The main diﬀerence between sequence data and timeseries data is that sequence data is discrete, whereas timeseries data is continuous. The discussion in the previous chapter showed that time series outliers can be either point outliers, or shape outliers. Because sequence data is the discrete analog of timeseries data, an identical principle can be applied to sequence data. Sequence data outliers can be either position outliers or combination outliers.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 304 305 306 307 308 309 310 311 ... 423