P (S|Ci) = P (s1|Ci) · P (s2|s1, Ci) . . . P (sn|s1 . . . sn−1, Ci)
|
(15.3)
|
This is the generative probability of the sequence S for cluster Ci. Intuitively, the term P (sj |s1 . . . sj−1 , Ci ) represents the fraction of times that sj follows s1 . . . sj−1 in cluster C i. This term can be estimated in a data-driven manner from the sequences in Ci. When a cluster is highly similar to a sequence, this value will be high. A relative similarity can be computed by comparing with a sequence generation model in which all symbols are generated randomly in proportion to their presence in the full data set. The probability of
such a random generation is given by
|
n
|
|
(sj ), where P (sj ) is estimated as the fraction
|
|
j=1 P
|
|
of sequences containing symbol sj . Then, the similarity of S to cluster Ci is defined as
|
|
follows:
|
|
|
|
P (S|Ci)
|
|
|
sim(S,
|
Ci
|
) =
|
|
(15.4)
|
|
|
|
|
n
|
P (sj )
|
|
|
|
|
|
|
j=1
|
|
|
|
One issue is that many parts of the sequence S may be noisy and not match the cluster well. Therefore, the similarity is computed as the maximum similarity of any contiguous segment of S to Ci . In other words, if Skl be the contiguous segment of S from positions k to l, then the final similarity SIM (S, Ci) is computed as follows:
SIM (S, Ci) = max1≤k≤l≤nsim(Skl, Ci)
|
(15.5)
|
The maximum similarity value can be computed by computing sim(Skl, Ci) over all pairs [k, l]. This is the similarity value used for assigning sequences to their relevant clusters.
One problematic issue is that the computation of each of the terms P (sj |s1 . . . sj −1 , Ci) on the right-hand side of Eq. 15.3 may require the examination of all the sequences in the cluster Ci for probability estimation purposes. Fortunately, these terms can be estimated efficiently using a data structure, referred to as Probabilistic Suffix Trees. The CLUSEQ algorithm always dynamically maintains the Probabilistic Suffix Trees (PST) whenever new clusters are created or sequences are added to clusters. This data structure will be described in detail in Sect. 15.4.1.1.
15.3.4.2 Mixture of Hidden Markov Models
This approach can be considered the string analog of the probabilistic models discussed in Sect. 6.5 of Chap. 6 for clustering multidimensional data. Recall that a generative mixture model is used in that case, where each component of the mixture has a Gaussian distribution. A Gaussian distribution is, however, appropriate only for generating numerical data, and is not appropriate for generating sequences. A good generative model for sequences is referred to as Hidden Markov Models (HMM). The discussion of this section will assume the use of HMM as a black box. The actual details of HMM will be discussed in a later section. As we will see later in Sect. 15.5, the HMM can itself be considered a kind of mixture model, in which states represent dependent components of the mixture. Therefore, this approach can be considered a two-level mixture model. The discussion in this section should be combined with the description of HMMs in Sect. 15.5 to provide a complete picture of HMM-based clustering.
The broad principle of a mixture-based generative model is to assume that the data was generated from a mixture of k distributions with the probability distributions G1 . . . Gk, where each Gi is a Hidden Markov Model. As in Sect. 6.5 of Chap. 6, the approach assumes
15.4. OUTLIER DETECTION IN SEQUENCES
|
507
|
the use of prior probabilities α1 . . . αk for the different components of the mixture. Therefore, the generative process is described as follows:
Select one of the k probability distributions with probability αi where i ∈ {1 . . . k}. Let us assume that the rth one is selected.
Generate a sequence from Gr, where Gr is a Hidden Markov Model.
One nice characteristic of mixture models is that the change in the data type and corre-sponding mixture distribution does not affect the broader framework of the algorithm. The analogous steps can be applied in the case of sequence data, as they are applied in multidi-mensional data. Let Sj represent the j th sequence and Θ be the entire set of parameters to be estimated for the different HMMs. Then, the E-step and M-step are exactly analogous to the case of the multidimensional mixture model.
(E-step) Given the current state of the trained HMM and priors αi, determine the posterior probability P (Gi|Sj , Θ) of each sequence Sj using the HMM generative prob-abilities P (Sj |Gi, Θ) of Sj from the ith HMM, and priors α1 . . . αk in conjunction with the Bayes rule. This is the posterior probability that the sequence Sj was generated by the ith HMM.
(M-step) Given the current probabilities of assignments of data points to clusters, use the Baum–Welch algorithm on each HMM to learn its parameters. The assignment probabilities are used as weights for averaging the estimated parameters. The Baum– Welch algorithm is described in Sect. 15.5.4 of this chapter. The value of each αi is estimated to be proportional to average assignment probability of all sequences to cluster i. Thus, the M-step results in the estimation of the entire set of parameters Θ.
Note that there is an almost exact correspondence in the steps used here, and to those used for mixture modeling in Sect. 6.5 of Chap. 6. The major drawback of this approach is that it can be rather slow. This is because the process of training each HMM is computationally expensive.
15.4 Outlier Detection in Sequences
Outlier detection in sequence data shares a number of similarities with timeseries data. The main difference between sequence data and timeseries data is that sequence data is discrete, whereas timeseries data is continuous. The discussion in the previous chapter showed that time series outliers can be either point outliers, or shape outliers. Because sequence data is the discrete analog of timeseries data, an identical principle can be applied to sequence data. Sequence data outliers can be either position outliers or combination outliers.
Dostları ilə paylaş: |