Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə291/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   287   288   289   290   291   292   293   294   ...   423
1-Data Mining tarjima

n−1




xr =










Xk · cos(rωk) + iXk · sin(rωk)∀r ∈ {0 . . . n − 1}




n













k=0

k=0




Here ω is set to 2nπ radians. Since the imaginary part of this summation is always 0 for real values of xr, let us expand the real part by assuming Xk = ak + ibk:











1 n−1

n−1







xr =










(ak + ibk) · cos(rωk) + i (ak + ibk) · sin(rωk)

r ∈ {0 . . . n − 1}




n













k=0

k=0







By ignoring the imaginary part, we obtain:











1

n−1
















n−1






















xr =










ak · cos(rωk)




bk · sin(rωk)










r ∈ {0 . . . n − 1}




n




























k=0
















k=0




























1










n−1


















































































=







ak2 + bk2 ·

cos(rωk + θk) ∀r ∈ {0 . . . n − 1}




n




























k=0


































Here, we have θ







= cos1







ak







. All terms with k 1 are periodic. In other words,































k















a2 +b2
















































k

k






















the time series can be decomposed into n



1 periodic sinusoidal components, so that the


































n
















2

2







kth component has a periodicity of




k

and amplitude of

ak

+ bk. Therefore, if a periodic




component has very high amplitude relative to other components, the entire series will be dominated by its periodic behavior. In order to detect such components, the mean and standard deviation of all the n amplitudes are determined. Any amplitude a2k + b2k, which is at least δ standard deviations greater than the mean is flagged. Such a component has periodicity nk , and its periodicity will be apparent in the series because of its high amplitude. Note that the smaller Fourier coefficients are also discarded in the case of dimensionality reduction. However, when the threshold δ is chosen more aggressively (i.e., very large pos-itive values such as 3), only 2 or 3 coefficients remain, and the periodicity of the residual series becomes apparent. Furthermore, only values of k ∈ (β, αn ) are relevant for discovering patterns that have period at least α ≥ 2 and have appeared at least β ≥ 2 times in the series. The bibliographic notes contain pointers to methods for discovering partial periodic patterns.
14.5 Time Series Clustering

Time series data clustering can be defined in two different ways, depending on the application-specific scenario.





  1. In the first approach, real-time clustering is performed on time series that are received simultaneously in time. For example, in a financial market application, it may be desirable to segment the series into groups that coevolve over time. In this case, the

14.5. TIME SERIES CLUSTERING

477

values in the different time series are compared to one another in an approximately synchronized way. Typically, the analysis is performed on a small window of the recent history. The time series are clustered into groups based on correlations between series in the window. Furthermore, the clustering is performed in online fashion, and the different series may move across different clusters. For example, a stock ticker for IBM may move along with Microsoft on one day, but not the next.





  1. In the second approach, a database of time series is available. These different time series may or may not have been collected at the same instant. It is desirable to cluster these series, on the basis of their shapes. For example, in an application containing electrocardiogram (ECG) time series, the different patients may have contributed a time series to the database at different instants. Shape matching typically requires the use of time series similarity functions discussed in Sect. 3.4.1 of Chap. 3. Thus, both the contextual attribute and the behavioral attribute(s) may be warped or scaled, depending on the nature of the similarity function. In such cases, the different time series may not even be of equal length.

In this section, the different kinds of clustering methods will be discussed in detail. The problem becomes much more difficult when shape-based clustering is applied to multivariate time series. One solution is to generalize the similarity functions to the multivariate case. Time series similarity functions can be generalized to the multivariate case, though a full discussion of this topic is beyond the scope of this book. Relevant pointers may be found in the bibliographic notes.


For shape-based clustering, the special case of bivariate and trivariate series can also be addressed with the use of trajectory clustering. An example of how multivariate series may be converted to trajectory data is found in Sect. 1.3.2.3 of Chap. 1. Methods for trajectory clustering are discussed in Sect. 16.3.4 of Chap. 16.


14.5.1 Online Clustering of Coevolving Series


The problem of online clustering of coevolving series is based on determining correlations across the series, in online fashion. This is useful in many real-time applications such as financial markets because it provides an understanding of the aggregate trends in the series. In these cases, the time series are clustered based on their correlations in a window of length p. Because of the use of correlations to define similarity, the approach is referred to as time series correlation clustering. The ORCLUS correlation clustering algorithm for multidimensional data was discussed in Chap. 7. A similar principle applies to time series data, except that the correlation is measured between different components (behavioral dimensions) of the multivariate time series. The same temporal window is used for the different time series in order to compute the correlations. Therefore, the analysis of the different streams is temporally synchronized.


A natural approach is to use regression-based similarity functions to compute the sim-ilarities between different streams. It is not necessary for the two streams to be positively correlated. Rather, the streams may be highly negatively correlated. The key issue here is the predictability of the different time series with respect to each other. For example, in Fig. 14.8, the series A and B are very similar because they are perfectly negatively corre-lated with one another. This is because these two series can be predicted from one another. On the other hand, series C is very different, and has low predictability with respect to either stream, and it is useful in applications where it is desired to maximize the predictive power of cluster representatives. An example is sensor selection, where a subset of sensors


478 CHAPTER 14. MINING TIME SERIES DATA








5

























4

























3







SERIES A

























SERIES B













VALUE










SERIES C













2

















































1

























0

























−10

5

10

15

20

25

30
















TIME INDEX













Figure 14.8: Time series correlation clustering


need to be selected which maximize the ability to predict the values of all other sensors. Because prediction is one of the most fundamental problems in real-time time series analy-sis, the use of regression-based similarity is natural in such scenarios. This is different from offline shape- based analysis where more conventional time series similarity functions, such as DTW, are used. A method that directly uses regression analysis for real-time time series clustering is referred to as online time series correlation clustering.


For ease in discussion, we will treat the d time series as a single multivariate series with





  • behavioral attributes. The multivariate time series of length t is denoted by Y1 . . . Yt. The value Yt of each of the d streams at the tth tick is (yt1 . . . ytd). The goal is to therefore always to maintain a partition of the d series, so that highly correlated components are assigned to the same partition. A representative-based approach is used for clustering. The basic idea is to incrementally maintain a set of k representative time series from the d series in real-time. This representative set, denoted by J, is similar to the representative set of a k-medoids algorithm. After the representatives have been determined, all of the time series streams can be assigned to one of the representatives with the use of a time series similarity function. Each series can be assigned to its closest representative. This similarity function will be discussed later in more detail.

A natural approach is to incrementally maintain the representatives, and add or drop streams to the set J where necessary. The clustering is implicitly defined by assignment of the d time series to their closest representatives. Therefore, when a new time series data point arrives, the current set of representatives J need to be updated. Streams are itera-tively exchanged between the current cluster representatives and the non-representatives to optimize a quality criterion based on minimizing the error. The similarity between a representative stream i and nonrepresentative stream j, is the regression error of predicting stream j from stream i. The idea is that true cluster representatives can be used to accu-rately predict the other streams. To predict the stream j from stream i, a similar model as the autoregressive model is used, except that the elements of stream i are used to predict stream j, instead of its own elements. Thus, the regression model is as follows:





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   287   288   289   290   291   292   293   294   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin