Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə309/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   305   306   307   308   309   310   311   312   ...   423
1-Data Mining tarjima

Position outliers: In position-based outliers, the values at specific positions are pre-dicted by a model. This is used to determine the deviation from the model and predict specific positions as outliers. Typically, Markovian methods are used for predictive out-lier detection. This is analogous to deviation-based outliers discovered in timeseries data with the use of regression models. Unlike regression models, Markovian models are better suited to discrete data. Such outliers are referred to as contextual outliers because they are outliers in the context of their immediate temporal neighborhood.

508 CHAPTER 15. MINING DISCRETE SEQUENCES





  1. Combination outliers: In combination outliers, an entire test sequence is deemed to be unusual because of the combination of symbols in it. This could be the case because this combination may rarely occur in a sequence database, or its distance (or simi-larity) to most other subsequences of similar size may be very large (or small). More complex models, such as Hidden Markov Models, can also be used to model the fre-quency of presence in terms of generative probabilities. For a longer test sequence, smaller subsequences are extracted from it for testing, and then the outlier score of the entire sequence is predicted as a combination of these values. This is analogous to the determination of unusual shapes in timeseries data. Such outliers are referred to as collective outliers because they are defined by combining the patterns from multiple data items.

The following section will discuss these different types of outliers.


15.4.1 Position Outliers


In the case of continuous timeseries data discussed in the previous chapter, an important class of outliers was designed by determining significant deviations from expected values at timestamps. Thus, these methods intimately combine the problems of forecasting and deviation-detection. A similar principle applies to discrete sequence data, in which the dis-crete positions at specific timestamps can be predicted with the use of different models. When a position has very low probability of matching its forecasted value, it is considered an outlier. For example, consider an RFID application, in which event sequences are asso-ciated with product items in a superstore with the use of semantic extraction from RFID tags. A typical example of a normal event sequence is as follows:


PlacedOnShelf, RemovedFromShelf, CheckOut, ExitStore.


On the other hand, in a shoplifting scenario, the event sequence may be unusually different.


An example of an event sequence in the shoplifting scenario is as follows:


PlacedOnShelf, RemovedFromShelf, ExitStore.


Clearly, the sequence symbol ExitStore is anomalous in the second case but not in the first case. This is because it does not depict the expected or forecasted value for that position in the second case. It is desirable to detect such anomalous positions on the basis of expected values. Such anomalous positions may appear anywhere in the sequence and not necessarily in the last element, as in the aforementioned example. The basic problem definition for position outlier detection is as follows:


Definition 15.4.1 Given a set of N training sequences D = T1 . . . TN , and a test sequence



  • = a1 . . . an, determine if the position ai in the test sequence should be considered an anomaly based on its expected value.

Some formulations do not explicitly distinguish between training and test sequences. This is because a sequence can be used for both model construction and outlier analysis when it is very long.


Typically, the position ai can be predicted in temporal domains only from the positions before ai, whereas in other domains, such as biological data, both directions may be rele-vant. The discussion below will assume the temporal scenario, though generalization to the placement scenario (as in biological data) is straightforward by examining windows on both sides of the position.




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   305   306   307   308   309   310   311   312   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin