Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə283/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   279   280   281   282   283   284   285   286   ...   423
1-Data Mining tarjima

14.2. TIME SERIES PREPARATION AND SIMILARITY

461








200



















200







195



















195




PRICE

190
















PRICE

190




























STOCK

185
















STOCK

185




























IBM

180
















IBM

180































175



















175







170







ACTUAL VALUES







170













20−DAY MOVING AVERAGE








































50−DAY MOVING AVERAGE













165

50

100

150

200

250




165




























NUMBER OF TRADING DAYS













(a) Moving average smoothing


ACTUAL VALUES


EXP. SMOOTHING (α=0.1)


EXP. SMOOTHING (α=0.05)























50

100

150

200

250




NUMBER OF TRADING DAYS







(b) Exponential smoothing



Figure 14.1: Various smoothing methods applied to IBM stock price from September 5, 2013 to September 4, 2014


Exponential Smoothing


In exponential smoothing, the smoothed value yi is defined as a linear combination of the current value yi, and the previously smoothed value yi−1. The smoothing parameter



  • ∈ (0, 1) is used for this purpose.




yi = α · yi + (1 − α) · yi−1

(14.2)

The value of y0 is typically set to the first point in the series. When the value of α is 1, there are no smoothing effects, and the smoothed series is the same as the original series. When the value of α is 0, the entire series becomes smoothed to the constant value of y0. The approach is referred to as exponential smoothing because the value of yi can be expressed as an exponentially decayed sum of the series values. By recursively substituting the aforementioned equation into itself, the following can be shown:





i




yi = (1 − α)i · y0 + α · yj · (1 − α)i−j .

(14.3)

j=1




The choice of α regulates the decay factor. Unlike moving averages, exponential smoothing provides more importance to recent data points. Data points are not lost at the beginning of the series, and the impact of the lag is reduced for the same level of smoothing. Examples of moving average and exponential smoothing are illustrated in Fig. 14.1a, b, respectively. It is evident that exponential smoothing does not lose any points at the beginning of the series and generally provides slightly better smoothing for lower lag.


14.2.3 Normalization


Time series typically need to be normalized, especially when multiple series are analyzed simultaneously. For example, one series might measure temperature, whereas another might measure pressure. Because these values are measured on different scales, they cannot be compared meaningfully. Therefore, two normalization methods are commonly used to adjust for such variations.



462 CHAPTER 14. MINING TIME SERIES DATA



  1. Range-based normalization: In range-based normalization, the minimum and maxi-mum value of the time series are determined. Let these values be denoted by min and max, respectively. Then, the time series value yi is mapped to the new value yi in the range (0, 1) as follows:

yi =

yi − min

.

(14.4)




max − min







  1. Standardization: In standardization, the mean and standard deviation of the series are used for normalization. This is essentially the Z-value of the time series. Let μ and σ represent the mean and standard deviation of the values in the time series. Then, the time series value yi is mapped to a new value zi as follows:




zi =

yi − μ

.

(14.5)










σ







Standardization is generally the preferred method. However, it does not guarantee a specific range of the time series values.


14.2.4 Data Transformation and Reduction


A variety of preprocessing methods exist for transforming and reducing the time series data into a reduced representation. Some of these methods transform the data into a smaller number of numeric coefficients, whereas other methods transform the data into discrete values.


14.2.4.1 Discrete Wavelet Transform

The discrete wavelet transform (DWT) converts a time series to multidimensional data. While time series can also be considered as multidimensional data by viewing1 the values at the different timestamps as dimensions, the values in successive timestamps are highly related to one another. A direct application of multidimensional methods ignores the tem-poral continuity in data values. In wavelets, the coefficients describe properties of different contiguous temporal regions of the series. Each coefficient is equal to half the difference in the average value of the behavioral attribute between a pair of carefully chosen contiguous segments of the series. The resulting representation can be more easily analyzed like multi-dimensional data because temporal locality is already incorporated within the coefficients. By using only the largest coefficients for representation, it is possible to reconstruct the entire time series accurately. Typically, the number of retained coefficients is much smaller than the length of the original time series. Thus, the approach is a dimensionality reduction method as well. DWT is described in detail in Sect. 2.4.4.1 of Chap. 2.


14.2.4.2 Discrete Fourier Transform


Wavelets are most effective when most of the variations in the series can be captured in specific local regions of the series. In cases where the series contain global periodicity, the discrete Fourier transform (DFT) is more effective. Examples of scenarios in which either of these methods would perform well are provided in Fig. 14.2. The basic idea is that any series





  • The concept of “dimension” can be defined in two ways for time series data. Each behavioral attribute in a multivariate series can be viewed as a dimension. Alternatively, the different values in a univariate time series can be viewed as dimensions. The usage is often dependent on the semantics of the application at hand.

14.2. TIME SERIES PREPARATION AND SIMILARITY

463







6














































DECOMPOSABLE INTO PERIODIC VARIATIONS






















DECOMPOSABLE INTO LOCAL VARIATIONS










5


































4


































3































VALUE

2

GOOD FOR DISCRETE WAVELET TRANSFORM















































































1


































0


































−1





































GOOD FOR DISCRETE FOURIER TRANSFORM



















10

20

30

40

50

60

70

80

90

100



















TIME INDEX



















Figure 14.2: Preferred scenarios for DFT and DWT


of length n can be expressed as a linear combination of smooth periodic sinusoidal series. Along with a single constant term, the n − 1 sinusoidal series have periodicity drawn from n, n/2, n/3, . . . n/(n − 1) . The data can be reduced using this decomposition because only a small number of these constituent series have large enough contributions to be included. Consider a time series x0 . . . xn−1. Each coefficient Xk of the Fourier transform is a complex value which is defined as follows:






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   279   280   281   282   283   284   285   286   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin