Data Mining: The Textbook

δ_rⁱ − μ_i

Yüklə 17,13 Mb.

səhifə	294/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 290 291 292 293 294 295 296 297 ... 423

1-Data Mining tarjima

14.6. TIME SERIES OUTLIER DETECTION

483

Determine the forecasted values of the time series at each timestamp. Depending on the nature of the underlying series, any of the univariate or multivariate methodologies discussed in Sect. 14.3 may be used. Let the forecasted value at the rth timestamp t_r be denoted by W_r

Compute the (possibly multivariate) time series of deviations Δ₁ . . . Δ_r . . .. In other words, for the rth timestamp t_r, the deviation is computed as follows:

	=		−		(14.21)
Δ_r	=	W_r	−	Y_r.	(14.21)

Let the d diﬀerent components of Δ_r be denoted by (δ_r1 . . . δ_rd). These can be separated out into d diﬀerent univariate series of deviations along each dimension. The values of the ith series are denoted by δ₁i . . . δ_ni. Let the mean and standard deviation of the ith series of deviations be denoted by μ_i and σ_i.

Compute the normalized deviations δz_ri as follows:

The resulting deviation is essentially equal to the Z -value of a normal distribution. This approach provides a continuous alarm level of outlier scores for each of the d dimensions of the multivariate time series. Unusual time instants can be detected by using thresholding on these scores. Because of the Z-value interpretation, an absolute threshold of 3 is usually considered suﬃcient.

In some cases, it is desirable to create a unified alarm level of deviation scores rather than creating a separate alarm level for each of the series. This problem is closely related to that of outlier ensemble analysis that is discussed in Sect. 9.4 of Chap. 9. The unified alarm level U_r at timestamp r can be reported as the maximum of the scores across the diﬀerent components of the multivariate series:

U_r = max	i {1...d}	δzⁱ .	(14.23)
	i {1...d}	r

The score across diﬀerent detectors can be combined in other ways, such as by using the average or squared aggregate over diﬀerent series.

14.6.2 Shape Outliers

One of the earliest methods for finding shape-based outliers is the Hotsax approach. In this approach, outliers are defined on windows of the time series. A k-nearest neighbor method is used to determine the outlier scores. Specifically, the Euclidean distance of a data point to its kth-nearest neighbors is used to define the outlier score.

The outlier analysis is performed over windows of length W . Therefore, the approach reports windows of unusual shapes from a time series of data points. The first step is to extract all windows of length W from the time series by using a sliding-window approach. The analysis is then performed over these newly created data objects. For each extracted window, its Euclidean distance to the other nonoverlapping windows is computed. The windows with the highest k-nearest neighbor distance values are reported as outliers. The reason for using nonoverlapping windows is to minimize the impact of trivial matches to overlapping windows. While a brute-force k-nearest neighbor approach can determine the

484 CHAPTER 14. MINING TIME SERIES DATA

outliers, the complexity will scale with the square of the number of data points. Therefore, a pruning method is used for improving the eﬃciency. While this method optimizes the eﬃciency, and it does not aﬀect the final result reported by the method.

The general principle of pruning for more eﬃcient outlier detection in nearest neighbor methods was introduced in Sect. 8.5.1.2 of Chap. 8. The algorithm examines the candidate subsequences iteratively in an outer loop. For each such candidate subsequence, the k-nearest neighbors are computed progressively in an inner loop with distance computations to other subsequences. Each candidate subsequence is either included in the current set of best n outlier estimates at the end of an outer loop iteration, or discarded via early abandonment of the inner loop without computing the exact value of the k-nearest neighbor. This inner loop can be terminated early when the currently approximated k- nearest neighbor distance for that candidate subsequence is less than the score for the nth best outlier found so far. Clearly, such a subsequence cannot be an outlier. To obtain the best pruning results, the subsequences need to be heuristically ordered so that the earliest candidate subsequences examined in the outer loop have the greatest tendency to be outliers. Furthermore, the pruning performance is also most eﬀective when the true outliers are found early. It remains to explain, how the heuristic orderings required for good pruning are achieved.

Pruning is facilitated by an approach that can measure the clustering behavior of the underlying subsequences. Clustering has a well known relationship of complementarity with outlier analysis. Therefore it is useful to examine those subsequences first in the outer loop that are members of clusters containing very few (or one) members. The SAX representation is used to create a simple mapping of the subsequences into clusters. Subsequences that map to the same SAX word, are assumed to belong to a single cluster. The piecewise aggregate approximations of SAX are performed over windows of length w < W . Therefore, the length of a SAX word is W/w, and the number of distinct possibilities for a SAX word is small if W/w is small. These distinct words correspond to the diﬀerent clusters. Multiple subsequences map to the same cluster. Therefore, the ordering of the candidates is based on the number of data objects in the same cluster. Candidates in clusters with fewer objects are examined first because they are more likely to be outliers.

This cluster-based ordering is used to design an eﬃcient pruning mechanism for outlier analysis. The candidates in the clusters are examined one by one in an outer loop. The k-nearest neighbor distances to these candidates are computed in an inner loop. For each candidate subsequence, those subsequences that map to the same word as the candidate may be considered first for computing the nearest neighbor distances in the inner loop. This provides quick and tight upper bounds on the nearest neighbor distances. As these distances are computed one by one, a tighter and tighter upper bound on the nearest neighbor distance is computed over the progression of the inner loop. A candidate can be pruned when an upper

bound on its nearest neighbor distance is guaranteed to be smaller (i.e., more similar) than the nth best outlier distance found so far. Therefore, for any given candidate series, it is not necessary to determine its exact nearest neighbor by comparing to all subsequences. Rather, early termination of the inner loop is often possible during the computation of the nearest neighbor distance. This forms the core of the pruning methodology of Hotsax, and is similar in principle to the nested-loop pruning methodology discussed in Sect. 8.5.1.2 of Chap. 8 on multidimensional outlier analysis. The main diﬀerence is in terms of how the SAX representation is used both for ordering the candidates in the outer loop, and ordering the distance computations in the inner loop.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 290 291 292 293 294 295 296 297 ... 423