CHAPTER 15. MINING DISCRETE SEQUENCES
|
|
GRADE DISTRIBUTION
|
GRADE DISTRIBUTION
|
|
|
A=80%
|
A=20%
|
|
|
|
B=20%
|
B=80%
|
|
|
0.99
|
|
0.01
|
0.90
|
|
DOER
|
SLACKER
|
|
|
|
|
|
|
0.10
|
|
|
|
SOME EXAMPLES OF STUDENT GRADE SEQUENCES
|
|
|
|
AAABAAAABAAAA
|
DOER (VERY COMMON)
|
|
|
|
BBBBABBBABBBB
|
SLACKER (LESS COMMON)
|
|
|
|
AAABAAABBABBB
|
DOER TURNS SLACKER (VERY RARE)
|
|
|
|
ABABABABABABA
|
UNPREDICTABLE (EXTREMELY RARE)
|
|
|
Figure 15.6: Generating grade sequences from a Hidden Markov Model
values are reversed. Although these probabilities are explicitly specified here for illustrative purposes, they need to be learned or estimated from the observed grade sequences for the different students and are not known a priori . The precise status (state) of any student in a given week is not known to the analyst at any given time. These grade sequences are, in fact, the only observable outputs for the analyst. Therefore, from the perspective of the analyst, this is a Hidden Markov Model, which generates the sequences of grades from an unknown sequence of states, representing the state transitions of the students. The precise sequence of transitions between the states can be only estimated for a particular observed sequence.
The two-state Hidden Markov Model for the aforementioned example is illustrated in Fig. 15.6. This model contains two states, denoted by doer and slacker, that represent the state of a student in a particular week. It is possible for a student to transition from one state to another each week, though the likelihood of this is rather low. It is assumed that set of initial state probabilities governs the a priori distribution of doers and slackers. This distri-bution represents the a priori understanding about the students when they join the course. Some examples of typical sequences generated from this model, along with their rarity level, are illustrated in Fig. 15.6. For example, the sequence AAABAAAABAAAA is most likely gener-ated by a student who is consistently in a doer state, and the sequence BBBBABBBABBBB is most likely generated by a student who is consistently in slacker state. The second sequence is typically rarer than the first because the population mostly contains3 doers. The sequence AAABAAABBABBB corresponds to a doer who eventually transitions into a slacker. This case is even rarer because it requires a transition from the doer state to a slacker state, which has a very low probability. The sequence ABABABABABABA is extremely anomalous because it does not represent temporally consistent doer or slacker behavior that is implied by the model. Correspondingly, such a sequence has very low probability of fitting the model.
A larger number of states in the Markov Model can be used to encode more complex scenarios. It is possible to encode domain knowledge with the use of states that describe different generating scenarios. In the example discussed earlier, consider the case that doers sometimes slacks off for short periods and then return to their usual state. Alternatively,
The assumption is that the initial set of state probabilities are approximately consistent with the steady state behavior of the model for the particular set of transition probabilities shown in Fig. 15.6.
Dostları ilə paylaş: |