Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	284/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 280 281 282 283 284 285 286 287 ... 423

1-Data Mining tarjima

n−1	n−1
n−1
X_k =	x_r · e^−irωk =	x_r · cos(rωk) − ix_r · sin(rωk) ∀k ∈ {0 . . . n − 1}. (14.6)
r=0	r=0	r=0

Here, ω is set to 2_nπ radians, and the notation i denotes the imaginary number ^√−1. There-fore, X_k is a complex value. One property of the Fourier coeﬃcients is that X_n−k can be derived from X_k by flipping the sign of the imaginary part for k ≥ 1 (see Exercise 7). Therefore, only the first n/2 complex coeﬃcients need to be retained. Furthermore, only the coeﬃcients X_k = a_k + ib_k with large energy a ²_k + b²_k need to be retained. The top-m retained coeﬃcients (together with their index k) can be used to approximate the time series in a compact way. Both the real and imaginary parts of the coeﬃcients can be stored in a real-valued vector data structure. This vector provides the reduced representation of the series. The original series can be reconstructed from the coeﬃcients as follows:

	₁n−1			₁n−1		n−1
x_r =			X_k · e^irωk =				X_k · cos(rωk) + iX_k · sin(rωk)∀r ∈ {0 . . . n − 1}.
	n			n
		k=0			k=0		k=0

Note that each X_k is a complex value. However, the imaginary part of the right-hand side of this equation will always evaluate to zero to yield the real series value x_r.

DFT has several properties, which make it useful for data mining applications. It satisfies the additivity property; the Fourier coeﬃcients of the sum (or diﬀerence) of two series can be obtained as the sum (or diﬀerence) of their Fourier coeﬃcients. It also satisfies Parseval’s theorem, which states that if X_k = a_k + ib_k is the kth Fourier coeﬃcient, then we have

ⁿ⁻¹_x2	=		1	n−1	(a²	+ b2 ). Because of these properties, one can compute the (scaled)
		n
r=0 r				k=0	k	k

Euclidean distance between two time series by computing the Euclidean distance between their Fourier coeﬃcients. Like DWT, DFT can also be viewed as the transformation of the time series to a new (rotated) orthogonal basis system, except that each basis vector B_k =

464 CHAPTER 14. MINING TIME SERIES DATA

[1, e^iωk, e²^iωk, . . . , e⁽ⁿ⁻¹⁾^iωk] of the Fourier coeﬃcient X_k is a complex vector. Therefore, the time series may be decomposed in terms of the mutually orthogonal basis vectors B₀ . . . B_n−₁ as follows:

(x₀ . . . x_n−₁) =	₁n−1		(14.7)
	n	X_kB_k

k=0

Typically, oﬀ-the-shelf mathematical packages are available to compute the coeﬃcients with the use of the fast Fourier transform (FFT). A closely related transform, known as the discrete cosine transform (DCT), provides even better compression.

14.2.4.3 Symbolic Aggregate Approximation (SAX)

This approach converts a time series to discrete sequence data. The basic idea is to determine piecewise aggregate approximates by averaging behavioral attribute values over successive and equally-spaced windows of the time series. The resulting continuous values are then discretized into a small number of discrete values. Depending on the application, the number of breakpoints may vary between 3 and 10. The approach selects the break points of the discretization, so that each of the symbolic values has an approximately equal frequency of representation. One possibility is to use equi-depth discretization of the continuous values, though this can be impractical or infeasible for long series or streaming series. For long series or streaming series, a Gaussian distribution assumption of the resulting averages is used to determine the discretization breakpoints. The idea is to select points on the Gaussian curve, so that the area between successive breakpoints is equal, and therefore the diﬀerent symbols have approximately the same frequency.

14.2.5 Time Series Similarity Measures

Time series similarity measures are typically designed with application-specific goals in mind. The most common methods for time series similarity computation are Euclidean distance and dynamic time warping (DTW ). The Euclidean distance is defined in an iden-tical way to multidimensional data where the behavioral attribute values at the diﬀerent timestamps are interpreted as dimensions. The Euclidean distance can be used only when the two series have the same length, and a one-to-one correspondence exists between the data points. This is not appropriate in unsynchronized time series where the data may be generated at diﬀerent rates over diﬀerent portions of the time series. The DTW method stretches and shrinks the time dimension diﬀerently in diﬀerent portions of one of the series to create an optimal matching. As discussed in Sect. 16.3.4.1 of Chap. 16, DTW can also be extended to multivariate time series such as trajectory data. Two other similarity/distance functions include the Edit Distance and the Longest Common Subsequence. These mea-sures are used more commonly for discrete sequences, rather than continuous time series. All these measures are described in detail in Sect. 3.4.1 of Chap. 3.

14.3 Time Series Forecasting

Forecasting is one of the most common applications of time series analysis. The prediction of future trends has applications in retail sales, economic indicators, weather forecasting, stock markets, and many other application scenarios. In this case, we have one or more series of data values, and it is desirable to predict the future values of the series using the history of previous values.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 280 281 282 283 284 285 286 287 ... 423