Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə294/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   290   291   292   293   294   295   296   297   ...   423
1-Data Mining tarjima

14.6. TIME SERIES OUTLIER DETECTION

483




  1. Determine the forecasted values of the time series at each timestamp. Depending on the nature of the underlying series, any of the univariate or multivariate methodologies discussed in Sect. 14.3 may be used. Let the forecasted value at the rth timestamp tr be denoted by Wr




  1. Compute the (possibly multivariate) time series of deviations Δ1 . . . Δr . . .. In other words, for the rth timestamp tr, the deviation is computed as follows:







=









(14.21)




Δr

Wr

Yr.







  1. Let the d different components of Δr be denoted by (δr1 . . . δrd). These can be separated out into d different univariate series of deviations along each dimension. The values of the ith series are denoted by δ1i . . . δni. Let the mean and standard deviation of the ith series of deviations be denoted by μi and σi.




  1. Compute the normalized deviations δzri as follows:




δzi

=

δri − μi

.

(14.22)







r




σi



















The resulting deviation is essentially equal to the Z -value of a normal distribution. This approach provides a continuous alarm level of outlier scores for each of the d dimensions of the multivariate time series. Unusual time instants can be detected by using thresholding on these scores. Because of the Z-value interpretation, an absolute threshold of 3 is usually considered sufficient.


In some cases, it is desirable to create a unified alarm level of deviation scores rather than creating a separate alarm level for each of the series. This problem is closely related to that of outlier ensemble analysis that is discussed in Sect. 9.4 of Chap. 9. The unified alarm level Ur at timestamp r can be reported as the maximum of the scores across the different components of the multivariate series:





Ur = max

i {1...d}

δzi .

(14.23)







r







The score across different detectors can be combined in other ways, such as by using the average or squared aggregate over different series.


14.6.2 Shape Outliers


One of the earliest methods for finding shape-based outliers is the Hotsax approach. In this approach, outliers are defined on windows of the time series. A k-nearest neighbor method is used to determine the outlier scores. Specifically, the Euclidean distance of a data point to its kth-nearest neighbors is used to define the outlier score.


The outlier analysis is performed over windows of length W . Therefore, the approach reports windows of unusual shapes from a time series of data points. The first step is to extract all windows of length W from the time series by using a sliding-window approach. The analysis is then performed over these newly created data objects. For each extracted window, its Euclidean distance to the other nonoverlapping windows is computed. The windows with the highest k-nearest neighbor distance values are reported as outliers. The reason for using nonoverlapping windows is to minimize the impact of trivial matches to overlapping windows. While a brute-force k-nearest neighbor approach can determine the


484 CHAPTER 14. MINING TIME SERIES DATA


outliers, the complexity will scale with the square of the number of data points. Therefore, a pruning method is used for improving the efficiency. While this method optimizes the efficiency, and it does not affect the final result reported by the method.


The general principle of pruning for more efficient outlier detection in nearest neighbor methods was introduced in Sect. 8.5.1.2 of Chap. 8. The algorithm examines the candidate subsequences iteratively in an outer loop. For each such candidate subsequence, the k-nearest neighbors are computed progressively in an inner loop with distance computations to other subsequences. Each candidate subsequence is either included in the current set of best n outlier estimates at the end of an outer loop iteration, or discarded via early abandonment of the inner loop without computing the exact value of the k-nearest neighbor. This inner loop can be terminated early when the currently approximated k- nearest neighbor distance for that candidate subsequence is less than the score for the nth best outlier found so far. Clearly, such a subsequence cannot be an outlier. To obtain the best pruning results, the subsequences need to be heuristically ordered so that the earliest candidate subsequences examined in the outer loop have the greatest tendency to be outliers. Furthermore, the pruning performance is also most effective when the true outliers are found early. It remains to explain, how the heuristic orderings required for good pruning are achieved.


Pruning is facilitated by an approach that can measure the clustering behavior of the underlying subsequences. Clustering has a well known relationship of complementarity with outlier analysis. Therefore it is useful to examine those subsequences first in the outer loop that are members of clusters containing very few (or one) members. The SAX representation is used to create a simple mapping of the subsequences into clusters. Subsequences that map to the same SAX word, are assumed to belong to a single cluster. The piecewise aggregate approximations of SAX are performed over windows of length w < W . Therefore, the length of a SAX word is W/w, and the number of distinct possibilities for a SAX word is small if W/w is small. These distinct words correspond to the different clusters. Multiple subsequences map to the same cluster. Therefore, the ordering of the candidates is based on the number of data objects in the same cluster. Candidates in clusters with fewer objects are examined first because they are more likely to be outliers.


This cluster-based ordering is used to design an efficient pruning mechanism for outlier analysis. The candidates in the clusters are examined one by one in an outer loop. The k-nearest neighbor distances to these candidates are computed in an inner loop. For each candidate subsequence, those subsequences that map to the same word as the candidate may be considered first for computing the nearest neighbor distances in the inner loop. This provides quick and tight upper bounds on the nearest neighbor distances. As these distances are computed one by one, a tighter and tighter upper bound on the nearest neighbor distance is computed over the progression of the inner loop. A candidate can be pruned when an upper


bound on its nearest neighbor distance is guaranteed to be smaller (i.e., more similar) than the nth best outlier distance found so far. Therefore, for any given candidate series, it is not necessary to determine its exact nearest neighbor by comparing to all subsequences. Rather, early termination of the inner loop is often possible during the computation of the nearest neighbor distance. This forms the core of the pruning methodology of Hotsax, and is similar in principle to the nested-loop pruning methodology discussed in Sect. 8.5.1.2 of Chap. 8 on multidimensional outlier analysis. The main difference is in terms of how the SAX representation is used both for ordering the candidates in the outer loop, and ordering the distance computations in the inner loop.




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   290   291   292   293   294   295   296   297   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin