Data Mining: The Textbook

series X be the PAA of X = (x₁ . . . x_n), and Y be the PAA of Y = (y₁ . . . y_n), then it can

be shown that:

Dist(X, Y ) ≥ m · Dist(X , Y )

Yüklə 17,13 Mb.

səhifə	290/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 286 287 288 289 290 291 292 293 ... 423

1-Data Mining tarjima

14.4. TIME SERIES MOTIFS

475

The proof of this result is as follows. Consider the time series Z = X − Y . Over any window of m data points, the second moment of elements of Z in that window, is at least2 equal to m times the square of the mean of the same elements. Other faster methods for approximation exist, such as the use of the SAX representation. When the SAX representation is used, a table of precomputed distances can be maintained for all pairs of discrete values, and a simple table lookup is required for lower bounding. Furthermore, some other time series distance functions such as dynamic time warping can also be bounded from below. The bibliographic notes contain pointers to some of these bounds. Many variations of the basic approach are possible by adding another layer of nesting, which accounts for variations in the window size.
14.4.2 Transformation to Sequential Pattern Mining

A particularly convenient method for discovering motifs in time series is to transform the problem to the sequential pattern mining problem. The setting for this case is somewhat diﬀerent, where a database of N series is available, and it is desired to determine all frequent motifs at a specified minimum support level. Since motif (pattern) mining is more naturally defined in the discrete case, this transformation facilitates the use of a wide variety of tools available for the discrete scenario. Furthermore, such an approach can also enable the discovery of noncontiguous patterns in the time series. This is because the subsequences in sequential pattern mining are allowed to be noncontiguous.

The first step is to convert the time series into discrete sequences, by discretizing the behavioral attribute value at each timestamp into categorical values. It is possible to combine discretization with binning to create a robust sequence representation. It should be pointed out that there are many diﬀerent ways of converting a time series to discrete sequences, depending on application-specific goals. For example, the discretization of the diﬀerence of the behavioral attribute values between successive timestamps is equivalent to using discretized wavelet coeﬃcients of the most detailed level of granularity. Lower order wavelet coeﬃcients will provide insights into trends over larger segments of the time series. Thus, it is even possible to perform multiresolution motif analysis by using discretized wavelet coeﬃcients of diﬀerent orders, and creating separate base sequences for wavelets of each order. In general, the approach for converting time series to discrete sequences will heavily influence the nature of the motifs found.

For all these methods, the final result of the discretization is a sequence of discrete values for each of the N time series in the database. After this new database of sequences has been constructed, any sequential pattern mining algorithm can be applied. The GSP algorithm is described in Sect. 15.2 of Chap. 15. It is important to note that the algorithms in Chap. 15 allow gaps between successive elements of the sequence. However, these algorithms can be trivially generalized to the contiguous case, by adding a maximum gap constraint in the sequential pattern mining algorithm. Constrained sequential pattern mining algorithms are briefly discussed in Sect. 15.2.2 of Chap. 15. It should be pointed out that the diﬀerent constraints discussed in Sect. 15.2.2 correspond to diﬀerent kinds of motifs. Because of the wide variation in the kinds of motifs that can be found by varying either the discretization approach or the sequential pattern mining approach, this methodology is very flexible, and it can be tailored to many diﬀerent application scenarios.

The mean of the squares is always no less than the square of the mean for any set of numeric elements. The diﬀerence between the two is equal to the variance, which is always nonnegative.

476 CHAPTER 14. MINING TIME SERIES DATA

14.4.3 Periodic Patterns

Just as DWT is used for discovering local patterns in a time series, DFT is often used for discovering periodic patterns. Recall from Sect. 14.2.4.2 that the r th component of a time series x₀ . . . x_n−₁ can be expressed in terms of n complex Fourier coeﬃcients X₀ . . . X_n−₁ as follows:

₁n−1

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 286 287 288 289 290 291 292 293 ... 423