Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə284/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   280   281   282   283   284   285   286   287   ...   423
1-Data Mining tarjima

n−1

n−1

n−1

Xk =

xr · e−irωk =

xr · cos(rωk) − ixr · sin(rωk) ∀k ∈ {0 . . . n − 1}. (14.6)

r=0

r=0

r=0

Here, ω is set to 2nπ radians, and the notation i denotes the imaginary number 1. There-fore, Xk is a complex value. One property of the Fourier coefficients is that Xn−k can be derived from Xk by flipping the sign of the imaginary part for k ≥ 1 (see Exercise 7). Therefore, only the first n/2 complex coefficients need to be retained. Furthermore, only the coefficients Xk = ak + ibk with large energy a 2k + b2k need to be retained. The top-m retained coefficients (together with their index k) can be used to approximate the time series in a compact way. Both the real and imaginary parts of the coefficients can be stored in a real-valued vector data structure. This vector provides the reduced representation of the series. The original series can be reconstructed from the coefficients as follows:








1 n−1




1 n−1

n−1




xr =







Xk · eirωk =







Xk · cos(rωk) + iXk · sin(rωk)∀r ∈ {0 . . . n − 1}.




n

n










k=0







k=0

k=0




Note that each Xk is a complex value. However, the imaginary part of the right-hand side of this equation will always evaluate to zero to yield the real series value xr.




DFT has several properties, which make it useful for data mining applications. It satisfies the additivity property; the Fourier coefficients of the sum (or difference) of two series can be obtained as the sum (or difference) of their Fourier coefficients. It also satisfies Parseval’s theorem, which states that if Xk = ak + ibk is the kth Fourier coefficient, then we have

n1 x2

=




1

n−1

(a2

+ b2 ). Because of these properties, one can compute the (scaled)




n




r=0 r




k=0

k

k




Euclidean distance between two time series by computing the Euclidean distance between their Fourier coefficients. Like DWT, DFT can also be viewed as the transformation of the time series to a new (rotated) orthogonal basis system, except that each basis vector Bk =

464 CHAPTER 14. MINING TIME SERIES DATA


[1, eiωk, e2iωk, . . . , e(n−1)iωk] of the Fourier coefficient Xk is a complex vector. Therefore, the time series may be decomposed in terms of the mutually orthogonal basis vectors B0 . . . Bn−1 as follows:



(x0 . . . xn−1) =

1 n−1

(14.7)




n

XkBk






k=0

Typically, off-the-shelf mathematical packages are available to compute the coefficients with the use of the fast Fourier transform (FFT). A closely related transform, known as the discrete cosine transform (DCT), provides even better compression.


14.2.4.3 Symbolic Aggregate Approximation (SAX)


This approach converts a time series to discrete sequence data. The basic idea is to determine piecewise aggregate approximates by averaging behavioral attribute values over successive and equally-spaced windows of the time series. The resulting continuous values are then discretized into a small number of discrete values. Depending on the application, the number of breakpoints may vary between 3 and 10. The approach selects the break points of the discretization, so that each of the symbolic values has an approximately equal frequency of representation. One possibility is to use equi-depth discretization of the continuous values, though this can be impractical or infeasible for long series or streaming series. For long series or streaming series, a Gaussian distribution assumption of the resulting averages is used to determine the discretization breakpoints. The idea is to select points on the Gaussian curve, so that the area between successive breakpoints is equal, and therefore the different symbols have approximately the same frequency.


14.2.5 Time Series Similarity Measures


Time series similarity measures are typically designed with application-specific goals in mind. The most common methods for time series similarity computation are Euclidean distance and dynamic time warping (DTW ). The Euclidean distance is defined in an iden-tical way to multidimensional data where the behavioral attribute values at the different timestamps are interpreted as dimensions. The Euclidean distance can be used only when the two series have the same length, and a one-to-one correspondence exists between the data points. This is not appropriate in unsynchronized time series where the data may be generated at different rates over different portions of the time series. The DTW method stretches and shrinks the time dimension differently in different portions of one of the series to create an optimal matching. As discussed in Sect. 16.3.4.1 of Chap. 16, DTW can also be extended to multivariate time series such as trajectory data. Two other similarity/distance functions include the Edit Distance and the Longest Common Subsequence. These mea-sures are used more commonly for discrete sequences, rather than continuous time series. All these measures are described in detail in Sect. 3.4.1 of Chap. 3.


14.3 Time Series Forecasting

Forecasting is one of the most common applications of time series analysis. The prediction of future trends has applications in retail sales, economic indicators, weather forecasting, stock markets, and many other application scenarios. In this case, we have one or more series of data values, and it is desirable to predict the future values of the series using the history of previous values.




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   280   281   282   283   284   285   286   287   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin