15.5. HIDDEN MARKOV MODELS
|
|
|
517
|
|
GRADE DISTRIBUTION
|
|
GRADE DISTRIBUTION
|
|
|
A=80%
|
|
|
A=20%
|
|
|
|
B=20%
|
|
|
B=80%
|
|
|
|
|
|
0.01
|
|
|
|
0.79
|
DOER
|
SLACKER
|
0.70
|
|
|
|
|
0.10
|
|
|
|
|
0.20
|
0.40
|
0.20
|
0.40
|
|
|
GRADE DISTRIBUTION
|
|
|
GRADE DISTRIBUTION
|
|
A=40%
|
DOER ON
|
|
|
|
SLACKER AT
|
A=60%
|
|
B=60%
|
|
BREAK
|
|
WORK
|
|
|
|
B=40%
|
|
|
|
|
|
|
|
|
0.60
|
|
0.60
|
|
|
Figure 15.7: Extending the model in Fig. 15.6 with two more states provides greater expres-sive power for modeling sequences
slackers may sometimes become temporarily inspired to be doers, but may eventually return to what they are best at. Such episodes will result in local portions of the sequence that are distinctive from the remaining sequence. These scenarios can be captured with the four-state Markov Model illustrated in Fig. 15.7. The larger the number of states, the more complex the scenarios that can be captured. Of course, more training data is required to learn the (larger number of) parameters of such a model, or this may result in overfitting. For smaller data sets, the transition probabilities and symbol-generation probabilities are not estimated accurately.
15.5.1 Formal Definition and Techniques for HMMs
In this section, Hidden Markov Models will be formally introduced along with the associated training methods. It is assumed that a Hidden Markov Model contains n states denoted by {s1 . . . sn}. The symbol set from which the observations are generated is denoted by
= {σ1 . . . σ|Σ|}. The symbols are generated from the model by a sequence of transitions from one state to the other. Each visit to a state (including self-transitions) generates a symbol drawn from a categorical4 probability distribution on Σ. The symbol emission distribution is specific to each state. The probability P (σi|sj ) that the symbol σi is generated from state sj is denoted by θj (σi). The probability of a transition from state si to sj is denoted by pij . The initial state probabilities are denoted by π1 . . . πn for the n different states. The topology of the model can be expressed as a network G = (M, A), in which M is the set of states {s1 . . . sn}. The set A represents the possible transitions between the states. In the most common scenario, where the architecture of the model is constructed with a domain-specific understanding, the set A is not the complete network. In cases where domain-specific knowledge is not available, the set A may correspond to the complete
518 CHAPTER 15. MINING DISCRETE SEQUENCES
network, including self-transitions. The goal of training the HMM model is to learn the
initial state probabilities, transition probabilities, and the symbol emission probabilities from the training database {T1 . . . TN }. Three methodologies are commonly leveraged in creating and using a Hidden Markov Model:
Training: Given a set of training sequences T1 . . . TN , estimate the model parameters, such as the initial probabilities, transition probabilities, and symbol emission proba-bilities with an Expectation-Maximization algorithm. The Baum–Welch algorithm is used for this purpose.
Evaluation: Given a test sequence V (or comparison unit Ui), determine the proba-bility that it fits the HMM. This is used to determine the anomaly scores. A recursive forward algorithm is used to compute this.
Explanation: Given a test sequence V , determine the most likely sequence of states that generated this test sequence. This is helpful for providing an understanding of why a sequence should be considered an anomaly (in outlier detection) or belong to a specific class (in data classification). The idea is that the states correspond to an intuitive understanding of the underlying system. In the example of Fig. 15.6, it would be useful to know that an observed sequence is an anomaly because of the unusual oscillation of a student between doer and slacker states. This can provide the intensional knowledge for understanding the state of a system. This most likely sequence of states is computed with the Viterbi algorithm.
Since the description of the training procedure relies on technical ideas developed for the evaluation method, we will deviate from the natural order of presentation and present the training algorithms last. The evaluation and explanation techniques will assume that the model parameters, such as the transition probabilities, are already available from the training phase.
15.5.2 Evaluation: Computing the Fit Probability for Observed Sequence
One approach for determining the fit probability of a sequence V = a1 . . . am would be to compute all the nm possible sequences of states (paths) in the HMM, and compute the probability of each, based on the observed sequence, symbol-generation probabilities, and transition probabilities. The sum of these values can be reported as the fit probability. Obviously, such an approach is not practical because it requires the enumeration of an exponential number of possibilities.
This computation can be greatly reduced by recognizing that the fit probability of the first r symbols (and a fixed value of the rth state) can be recursively computed in terms of the corresponding fit probability of first (r − 1) observable symbols (and a fixed (r − 1)th state). Specifically, let αr(V, sj ) be the probability that the first r symbols in V are generated by the model, and the last state in the sequence is sj . Then, the recursive computation is as follows:
n
αr(V, sj ) = αr−1(V, si) · pij · θj (ar)
i=1
This approach recursively sums up the probabilities for all the n different paths for different penultimate nodes. The aforementioned relationship is iteratively applied for r = 1 . . . m. The probability of the first symbol is computed as α1(V, sj ) = πj · θj (a1) for initializing the
Dostları ilə paylaş: |