Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	283/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 279 280 281 282 283 284 285 286 ... 423

1-Data Mining tarjima

14.2. TIME SERIES PREPARATION AND SIMILARITY

461

	200								200
	195								195
PRICE	190							PRICE	190
PRICE								PRICE
STOCK	185							STOCK	185
STOCK								STOCK
IBM	180							IBM	180
IBM								IBM
	175								175
	170				ACTUAL VALUES				170
	170				20−DAY MOVING AVERAGE				170
					20−DAY MOVING AVERAGE
					50−DAY MOVING AVERAGE
	165	50		100	150	200	250		165
		50		100	150	200	250
			NUMBER OF TRADING DAYS

(a) Moving average smoothing

ACTUAL VALUES

EXP. SMOOTHING (α=0.1)

EXP. SMOOTHING (α=0.05)


50		100	150	200	250
	NUMBER OF TRADING DAYS

(b) Exponential smoothing

Figure 14.1: Various smoothing methods applied to IBM stock price from September 5, 2013 to September 4, 2014

Exponential Smoothing

In exponential smoothing, the smoothed value y_i is defined as a linear combination of the current value y_i, and the previously smoothed value y_i−₁. The smoothing parameter

∈ (0, 1) is used for this purpose.

y_i = α · y_i + (1 − α) · y_i−₁

(14.2)

The value of y₀ is typically set to the first point in the series. When the value of α is 1, there are no smoothing eﬀects, and the smoothed series is the same as the original series. When the value of α is 0, the entire series becomes smoothed to the constant value of y₀. The approach is referred to as exponential smoothing because the value of y_i can be expressed as an exponentially decayed sum of the series values. By recursively substituting the aforementioned equation into itself, the following can be shown:

i
y_i = (1 − α)ⁱ · y₀ + α · y_j · (1 − α)^i−j .	(14.3)
j=1

The choice of α regulates the decay factor. Unlike moving averages, exponential smoothing provides more importance to recent data points. Data points are not lost at the beginning of the series, and the impact of the lag is reduced for the same level of smoothing. Examples of moving average and exponential smoothing are illustrated in Fig. 14.1a, b, respectively. It is evident that exponential smoothing does not lose any points at the beginning of the series and generally provides slightly better smoothing for lower lag.

14.2.3 Normalization

Time series typically need to be normalized, especially when multiple series are analyzed simultaneously. For example, one series might measure temperature, whereas another might measure pressure. Because these values are measured on diﬀerent scales, they cannot be compared meaningfully. Therefore, two normalization methods are commonly used to adjust for such variations.

462 CHAPTER 14. MINING TIME SERIES DATA

Range-based normalization: In range-based normalization, the minimum and maxi-mum value of the time series are determined. Let these values be denoted by min and max, respectively. Then, the time series value y_i is mapped to the new value y_i in the range (0, 1) as follows:

y_i =	y_i − min	.	(14.4)
	max − min

Standardization: In standardization, the mean and standard deviation of the series are used for normalization. This is essentially the Z-value of the time series. Let μ and σ represent the mean and standard deviation of the values in the time series. Then, the time series value y_i is mapped to a new value z_i as follows:

z_i =	y_i − μ	.	(14.5)

	σ

Standardization is generally the preferred method. However, it does not guarantee a specific range of the time series values.

14.2.4 Data Transformation and Reduction

A variety of preprocessing methods exist for transforming and reducing the time series data into a reduced representation. Some of these methods transform the data into a smaller number of numeric coeﬃcients, whereas other methods transform the data into discrete values.

14.2.4.1 Discrete Wavelet Transform

The discrete wavelet transform (DWT) converts a time series to multidimensional data. While time series can also be considered as multidimensional data by viewing1 the values at the diﬀerent timestamps as dimensions, the values in successive timestamps are highly related to one another. A direct application of multidimensional methods ignores the tem-poral continuity in data values. In wavelets, the coeﬃcients describe properties of diﬀerent contiguous temporal regions of the series. Each coeﬃcient is equal to half the diﬀerence in the average value of the behavioral attribute between a pair of carefully chosen contiguous segments of the series. The resulting representation can be more easily analyzed like multi-dimensional data because temporal locality is already incorporated within the coeﬃcients. By using only the largest coeﬃcients for representation, it is possible to reconstruct the entire time series accurately. Typically, the number of retained coeﬃcients is much smaller than the length of the original time series. Thus, the approach is a dimensionality reduction method as well. DWT is described in detail in Sect. 2.4.4.1 of Chap. 2.

14.2.4.2 Discrete Fourier Transform

Wavelets are most eﬀective when most of the variations in the series can be captured in specific local regions of the series. In cases where the series contain global periodicity, the discrete Fourier transform (DFT) is more eﬀective. Examples of scenarios in which either of these methods would perform well are provided in Fig. 14.2. The basic idea is that any series

The concept of “dimension” can be defined in two ways for time series data. Each behavioral attribute in a multivariate series can be viewed as a dimension. Alternatively, the diﬀerent values in a univariate time series can be viewed as dimensions. The usage is often dependent on the semantics of the application at hand.

14.2. TIME SERIES PREPARATION AND SIMILARITY

463

	6
					DECOMPOSABLE INTO PERIODIC VARIATIONS
					DECOMPOSABLE INTO LOCAL VARIATIONS
	5
	4
	3
VALUE	2	GOOD FOR DISCRETE WAVELET TRANSFORM
	2

	1
	0
	−1
		GOOD FOR DISCRETE FOURIER TRANSFORM
	10	20	30	40		50	60	70	80	90	100
						TIME INDEX

Figure 14.2: Preferred scenarios for DFT and DWT

of length n can be expressed as a linear combination of smooth periodic sinusoidal series. Along with a single constant term, the n − 1 sinusoidal series have periodicity drawn from n, n/2, n/3, . . . n/(n − 1) . The data can be reduced using this decomposition because only a small number of these constituent series have large enough contributions to be included. Consider a time series x₀ . . . x_n−₁. Each coeﬃcient X_k of the Fourier transform is a complex value which is defined as follows:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 279 280 281 282 283 284 285 286 ... 423