14.6. TIME SERIES OUTLIER DETECTION
|
481
|
this case because it does not make any assumptions on the relative lengths of the different time series. The approach is described in detail in Sect. 6.3.4 of Chap. 6. The main dif-ference from the description provided in this section is that of the choice of the similarity function. Any of the similarity functions described in Sect. 3.4.1 of Chap. 3 may be used. The CLARANS method discussed in Sect. 7.3.1 of Chap. 7 can also be generalized to this case.
14.5.2.3 Hierarchical Methods
The hierarchical methods, discussed in Sect. 6.4 of Chap. 6, can also be generalized to any data type because they work with pairwise distances between the different data objects. In these methods, the main challenge is that distance computations between all pairs of time series are required. Many time series distance and similarity functions require expensive dynamic programming methods. This is a major disadvantage in the use of hierarchical methods. Nevertheless, the approach can still be used quite effectively in cases where the total number of time series is small.
14.5.2.4 Graph-Based Methods
Graph-based methods provide a transformational approach to time series data clustering. The idea is to transform the time series data set into a single large graph, on which commu-nity detection algorithms can be applied. As discussed in Sect. 2.2.2.9 of Chap. 2, any data type can be converted to a similarity graph, once a similarity function has been defined. Each node in this graph corresponds to a data object. Each node is connected to its k-nearest neighbors, and the weight of the edge is equal to the similarity between the correspond-ing pair of objects. Once a similarity graph has been defined, any of the graph clustering algorithms discussed in Sect. 19.3 of Chap. 19 can be used to determine node clusters. The spectral method of Sect. 19.3.4 is most commonly used. The clusters (communities) of nodes can then be mapped back to clusters of time series by using the correspondence between nodes and time series data objects.
14.6 Time Series Outlier Detection
As in the case of time series clustering, the problem of outlier detection in time series can be defined in two different ways.
Point outliers: A point outlier is a sudden change in a time series value at a given times-tamp. This problem is closely related to forecasting, because an outlier is defined as a significant deviation from expected (or forecasted) values. Such outliers are referred to as contextual outliers because they are outliers in the context of their immediate history.
Shape outliers: In this case, a consecutive pattern of data points in a contiguous window may be defined as an anomaly. For example, in an ECG series, an irregular heart beat may be considered an anomaly when considered together, although no individual point in the series may be considered an anomaly. Such outliers are referred to as collective outliers because they are defined by combining the patterns from multiple data items.
482 CHAPTER 14. MINING TIME SERIES DATA
|
117
|
|
|
|
|
|
|
|
|
|
140
|
|
|
|
|
|
|
RELATIVEVALUEOFS&P 500
|
116
|
|
|
|
|
|
|
|
|
RELATIVEVALUEOFS&P 500
|
135
|
|
|
|
|
|
|
115
|
|
|
|
|
|
|
|
|
130
|
|
|
|
|
|
|
114
|
|
|
|
|
|
|
|
|
125
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113
|
|
|
|
|
|
|
|
|
120
|
|
|
|
|
|
|
112
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
115
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
111
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
110
|
|
|
|
|
|
|
|
110
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
105
|
|
|
|
|
|
|
|
109
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
108
|
|
|
|
|
|
|
|
|
|
100
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1070
|
50
|
100
|
150
|
200
|
250
|
300
|
350
|
400
|
|
95 0
|
50
|
100
|
150
|
200
|
250
|
|
|
|
|
PROGRESS OF TIME (MAY 6, 2010)
|
|
|
|
|
PROGRESS OF TIME (YEAR 2011)
|
|
|
|
|
|
|
|
(a)
|
|
|
|
|
|
|
|
(b)
|
|
|
|
|
Figure 14.11: Behavior of the S&P 500 on the day of the flash crash (May 6, 2010) (a), and year 2001 (b)
To illustrate the distinction between these two kinds of anomalies, an example from financial markets will be used. The two cases illustrated in Fig. 14.11a, b show the behavior3 of the S&P 500 over different periods in time. Figure 14.11a illustrates the movement of the S&P 500 on May 16, 2010. This was the date of the stock market flash crash. This is a very unusual event both from the perspective of the point deviation at the time of the drop, and from the perspective of the shape of the drop. A different scenario is illustrated in Fig. 14.11b. Here, the variation of the S& P 500 during the year 2001 is illustrated. There are two significant drops over the course of the year, both because of stock market weakness, and also because of the 9/11 terrorist attacks. While the specific timestamps of drop may be considered somewhat abnormal based on deviation analysis over specific windows, the actual shape of these time series is not unusual because it is frequently encountered during bear markets (periods of market weakness). Thus, these two kinds of outliers require dedicated methods for analysis. It should be pointed out that a similar distinction between the two kinds of outliers can be defined in many contextual data types such as discrete sequence data. These are referred to as point outliers and combination outliers, respectively, for the case of discrete sequence data. The combination outliers in discrete sequence data are analogous to shape outliers in continuous time series data. This is discussed in greater detail in Chap. 15.
14.6.1 Point Outliers
Point outliers are closely related to the problem of forecasting in time series data. A data point is considered an outlier if it deviates significantly from its expected (or forecasted) value. Such point outliers correspond to unsupervised events in the underlying data. Event detection is often considered a synonym for temporal outlier detection when it performed in real time.
Point outliers can be defined for either univariate or multivariate data. The case of univariate data and multivariate data is almost identical. Therefore, the more general case of multivariate data will be discussed. As in previous sections, assume that the multivariate series on which the outliers are to be detected is denoted by Y1 . . . Yn. The overall approach comprises four steps:
The tracking Exchange Traded Fund (ETF) SPY was used.
Dostları ilə paylaş: |