Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə39/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   35   36   37   38   39   40   41   42   ...   423
1-Data Mining tarjima

SERIES

WAVELET

WAVELET




BASIS













AVERAGES

COEFFICIENT

SHAPE




VECTOR













(8, 6, 2, 3, 4, 6, 6, 5)

1




1

1000000










0.5




0 0 1

10000







1




0 0001

1 0 0







0.5




0000001

1




(7, 2.5, 5, 5.5)

2.25




1 1

1

10000





































0.25




0 00 011

1

1




(4.75, 5.25)

0.25




1111

1

1

1

1













(5)

5




11111111




Figure 2.5: Illustration of the wavelet decomposition


wavelet basis vector is a time series that represents the temporal range of this variation in the form of a simple step function. The wavelet coefficients are of different orders, depending on the length of the time-series segment analyzed, which also represents the granularity of analysis. The higher- order coefficients represent the broad trends in the series because they correspond to larger ranges. The more localized trends are captured by the lower-order coefficients. Before providing a more notational description, a simple recursive description of wavelet decomposition of a time series segment S is provided below in two steps:





  1. Report half the average difference of the behavioral attribute values between the first and second temporal halves of S as a wavelet coefficient.




  1. Recursively apply this approach to first and second temporal halves of S.

At the end of the process, a reduction process is performed, where larger (normalized) coefficients are retained. This normalization step will be described in detail later.


A more formal and notation-intensive description will be provided at this point. For ease in discussion, assume that the length q of the series is a power of 2. For each value of k ≥ 1, the Haar wavelet decomposition defines 2k−1 coefficients of order k. Each of these 2k−1 coefficients corresponds to a contiguous portion of the time series of length q/2k−1 . The ith of these 2k−1 coefficients corresponds to the segment in the series starting from position (i − 1) · q/2k− 1 + 1 to the position i · q/2k−1 . Let us denote this coefficient by ψki and the corresponding time-series segment by Ski. At the same time, let us define the average value of the first half of the S ki by aik and that of the second half by bik. Then, the value of ψki is given by (aik − bik)/2. More formally, if Φik denote the average value of the Ski, then the value of ψki can be defined recursively as follows:





ψki = (Φk2·+1i−1 Φk2·+1i)/2

(2.14)

The set of Haar coefficients is defined by all the coefficients of order 1 to log2(q ). In addition, the global average Φ11 is required for the purpose of perfect reconstruction. The total number of coefficients is exactly equal to the length of the original series, and the dimensionality reduction is obtained by discarding the smaller (normalized) coefficients. This will be discussed later.





52













CHAPTER 2. DATA PREPARATION










[1, 8]

5

SERIES
















AVERAGE































+
















RELEVANT




[1, 8]

0 25













RANGES



















+











































[1, 4] 2.25







0.25

[5, 8]




+










+





































1




0.5







1

0.5




[1, 2]




[3, 4]

[5, 6]

[7, 8]
















RELEVANT RANGES




+




+

+




+
















8

6

2

3

4

6

6

5




Figure 2.6: The error tree from the wavelet decomposition


The coefficients of different orders provide an understanding of the major trends in the data at a particular level of granularity. For example, the coefficient ψki is half the quantity by which the first half of the segment Ski is larger than the second half of the same segment. Because larger values of k correspond to geometrically reducing segment sizes, one can obtain an understanding of the basic trends at different levels of granularity. This definition of the Haar wavelet makes it very easy to compute by a sequence of averaging and differencing operations. Table 2.2 shows the computation of the wavelet coefficients for the sequence (8, 6, 2, 3, 4, 6, 6, 5). This decomposition is illustrated in graphical form in Fig. 2.5. Note that each value in the original series can be represented as a sum of log2(8) = 3 wavelet coefficients with a positive or negative sign attached in front. In general, the entire decomposition may be represented as a tree of depth 3, which represents the hierarchical decomposition of the entire series. This is also referred to as the error tree. In Fig. 2.6, the error tree for the wavelet decomposition in Table 2.2 is illustrated. The nodes in the tree contain the values of the wavelet coefficients, except for a special super-root node that contains the series average.

The number of wavelet coefficients in this series is 8, which is also the length of the original series. The original series has been replicated just below the error tree in Fig. 2.6, and can be reconstructed by adding or subtracting the values in the nodes along the path leading to that value. Each coefficient in a node should be added, if we use the left branch below it to reach to the series values. Otherwise, it should be subtracted. This natural decomposition means that an entire contiguous range along the series can be reconstructed by using only the portion of the error tree which is relevant to it.


As in all dimensionality reduction methods, smaller coefficients are ignored. We will explain the process of discarding coefficients with the help of the notion of the basis vectors associated with each coefficient:




The wavelet representation is a decomposition of the original time series of length q into the weighted sum of a set of q “simpler” time series (or wavelets) that are orthogonal to one another. These “simpler” time series are the basis vectors, and the wavelet coefficients represent the weights of the different basis vectors in the decomposition.

Figure 2.5 shows these “simpler” time series along with their corresponding coefficients.


The number of wavelet coefficients (and basis vectors) is equal to the length of the series q.





2.4. DATA REDUCTION AND TRANSFORMATION

53

The length of the time series representing each basis vector is also q . Each basis vector has a +1 or 1 value in the contiguous time-series segment from which a particular coefficient was derived by a differencing operation. Otherwise, the value is 0 because the wavelet is not related to variations in that region of the time series. The first half of the nonzero segment of the basis vector is +1, and the second half is 1. This gives it the shape of a wavelet when it is plotted as a time series, and also reflects the differencing operation in the relevant time -series segment. Multiplying a basis vector with the coefficient has the effect of creating a weighted time series in which the difference between the first half and second half reflects the average difference between the corresponding segments in the original time series. Therefore, by adding up all these weighted wavelets over different levels of granularity in the error tree, it is possible to reconstruct the original series. The list of basis vectors in Fig. 2.5 are the rows of the following matrix:






1



1

0




0

0




0

0

0







0

0

1



1

0




0

0

0







0




0

0

0

1



1

0

0










0




0

0




0

0

0

1

1








































0










1




1

1




1

0




0

0



















0

0

























0




0

1




1

1

1





































1

1










1




1

1




1

1




1




























1

1

1

1










1




1

1




1

















































Note that the dot product of any pair of basis vectors is 0, and therefore these series are orthogonal to one another. The most detailed coefficients have only one +1 and one 1, whereas the most coarse coefficient has four +1 and 1 entries. In addition, the vector (11111111) is needed to represent the series average.


For a time series T , let W1 . . . Wq be the corresponding basis vectors. Then, if a1 . . . aq are the wavelet coefficients for the basis vectors W1 . . . Wq, the time series T can be represented as follows:



q







q











































Wi










T =

ai




=

(ai||




||)







(2.15)




Wi

Wi
















||Wi

||




i=1







i=1
















The coefficients represented in Fig. 2.5 are unnormalized because the underlying basis vec-tors do not have unit norm. While ai is the unnormalized value from Fig. 2.5, the values a ||W || represent normalized coefficients. The values of ||W || are different for coefficients
i i √ √ √ i
of different orders, and are equal to 2, 4, or 8 in this particular example. For example, in Fig. 2.5, the broadest level unnormalized coefficient is 0.25, whereas the corresponding normalized value is 0.258. After normalization, the basis vectors W1 . . . Wq are orthonor-mal, and, therefore, the sum of the squares of the corresponding (normalized) coefficients is equal to the retained energy in the approximated time series. Because the normalized coefficients provide a new coordinate representation after axis rotation, Euclidean distances between time series are preserved in this new representation if coefficients are not dropped. It can be shown that by retaining the coefficients with the largest normalized values, the error loss from the wavelet representation is minimized.

The previous discussion focused on the approximation of a single time series. In practice, one might want to convert a database of N time series into N multidimensional vectors. When a database of multiple time series is available, then two strategies can be used:





  1. The coefficient for the same basis vector is selected for each series to create a mean-ingful multidimensional database of low dimensionality. Therefore, the basis vectors




54













CHAPTER 2. DATA PREPARATION





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   35   36   37   38   39   40   41   42   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin