Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	314/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 310 311 312 313 314 315 316 317 ... 423

1-Data Mining tarjima

15.5. HIDDEN MARKOV MODELS

517

GRADE DISTRIBUTION

A=80%

A=20%

B=20%

B=80%

0.01

0.79

DOER

SLACKER

0.70

0.10

0.20

0.40

0.20

0.40

GRADE DISTRIBUTION

A=40%

DOER ON

SLACKER AT

A=60%

B=60%

BREAK

WORK

B=40%

0.60

Figure 15.7: Extending the model in Fig. 15.6 with two more states provides greater expres-sive power for modeling sequences

slackers may sometimes become temporarily inspired to be doers, but may eventually return to what they are best at. Such episodes will result in local portions of the sequence that are distinctive from the remaining sequence. These scenarios can be captured with the four-state Markov Model illustrated in Fig. 15.7. The larger the number of states, the more complex the scenarios that can be captured. Of course, more training data is required to learn the (larger number of) parameters of such a model, or this may result in overfitting. For smaller data sets, the transition probabilities and symbol-generation probabilities are not estimated accurately.
15.5.1 Formal Definition and Techniques for HMMs

In this section, Hidden Markov Models will be formally introduced along with the associated training methods. It is assumed that a Hidden Markov Model contains n states denoted by {s₁ . . . s_n}. The symbol set from which the observations are generated is denoted by

= {σ₁ . . . σ_|_Σ_|}. The symbols are generated from the model by a sequence of transitions from one state to the other. Each visit to a state (including self-transitions) generates a symbol drawn from a categorical4 probability distribution on Σ. The symbol emission distribution is specific to each state. The probability P (σ_i|s_j ) that the symbol σ_i is generated from state s_j is denoted by θj (σ_i). The probability of a transition from state s_i to s_j is denoted by p_ij . The initial state probabilities are denoted by π₁ . . . π_n for the n diﬀerent states. The topology of the model can be expressed as a network G = (M, A), in which M is the set of states {s₁ . . . s_n}. The set A represents the possible transitions between the states. In the most common scenario, where the architecture of the model is constructed with a domain-specific understanding, the set A is not the complete network. In cases where domain-specific knowledge is not available, the set A may correspond to the complete

⁴HMMs can also generate continuous time series, though they are less commonly used in timeseries analysis.

518 CHAPTER 15. MINING DISCRETE SEQUENCES

network, including self-transitions. The goal of training the HMM model is to learn the

initial state probabilities, transition probabilities, and the symbol emission probabilities from the training database {T₁ . . . T_N }. Three methodologies are commonly leveraged in creating and using a Hidden Markov Model:

Training: Given a set of training sequences T₁ . . . T_N , estimate the model parameters, such as the initial probabilities, transition probabilities, and symbol emission proba-bilities with an Expectation-Maximization algorithm. The Baum–Welch algorithm is used for this purpose.

Evaluation: Given a test sequence V (or comparison unit U_i), determine the proba-bility that it fits the HMM. This is used to determine the anomaly scores. A recursive forward algorithm is used to compute this.

Explanation: Given a test sequence V , determine the most likely sequence of states that generated this test sequence. This is helpful for providing an understanding of why a sequence should be considered an anomaly (in outlier detection) or belong to a specific class (in data classification). The idea is that the states correspond to an intuitive understanding of the underlying system. In the example of Fig. 15.6, it would be useful to know that an observed sequence is an anomaly because of the unusual oscillation of a student between doer and slacker states. This can provide the intensional knowledge for understanding the state of a system. This most likely sequence of states is computed with the Viterbi algorithm.

Since the description of the training procedure relies on technical ideas developed for the evaluation method, we will deviate from the natural order of presentation and present the training algorithms last. The evaluation and explanation techniques will assume that the model parameters, such as the transition probabilities, are already available from the training phase.

15.5.2 Evaluation: Computing the Fit Probability for Observed Sequence

One approach for determining the fit probability of a sequence V = a₁ . . . a_m would be to compute all the nm possible sequences of states (paths) in the HMM, and compute the probability of each, based on the observed sequence, symbol-generation probabilities, and transition probabilities. The sum of these values can be reported as the fit probability. Obviously, such an approach is not practical because it requires the enumeration of an exponential number of possibilities.

This computation can be greatly reduced by recognizing that the fit probability of the first r symbols (and a fixed value of the rth state) can be recursively computed in terms of the corresponding fit probability of first (r − 1) observable symbols (and a fixed (r − 1)th state). Specifically, let α_r(V, s_j ) be the probability that the first r symbols in V are generated by the model, and the last state in the sequence is s_j . Then, the recursive computation is as follows:

n
α_r(V, s_j ) = α_r−₁(V, s_i) · p_ij · θ^j (a_r)

i=1

This approach recursively sums up the probabilities for all the n diﬀerent paths for diﬀerent penultimate nodes. The aforementioned relationship is iteratively applied for r = 1 . . . m. The probability of the first symbol is computed as α₁(V, s_j ) = π_j · θ^j (a₁) for initializing the

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 310 311 312 313 314 315 316 317 ... 423