Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə292/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   288   289   290   291   292   293   294   295   ...   423
1-Data Mining tarjima

p
yj = a · yi + c +
t r tr t

This is similar to the AR(p) model, except that the elements of stream i are being used to predict those of stream j. As in the case of the AR(p) model, least-squares regression



14.5. TIME SERIES CLUSTERING

479

Algorithm UpdateClusters(Multivariate Stream: Y1 . . . Yt . . .


Current Set of Representatives: J)


begin
Receive next time-stamp Yt of multivariate stream;


repeat
Add a stream to J that leads to the maximum decrease in regression error of the clustering; Drop the stream from J that leads to the least increase in regression error of the clustering;


Assign each series to closest representative in J to create the clustering C;

until(J did not change in previous iteration); return(J, C);


end

Figure 14.9: Dynamically maintaining cluster representatives


can be used to learn p coefficients. Furthermore, the training data is restricted to a win-dow of size w > p to allow for stream evolution. The squared average of the white noise error terms, or the R2-statistic over the window of size w > p, can be used as the distance (similarity) between the two streams. Note that the regression coefficients can also be main-tained incrementally because they are already known at the previous timestamp, and the model simply needs to be updated with the addition of a single data point. Most iterative optimization methods for least-squares regression, such as gradient descent, converge very fast when starting with a near-optimal solution.

This regression-based similarity function is not symmetric because the error of predicting stream j from stream i is different from the error of predicting stream i from stream j. The representative set J can also be used to create a clustering C of the streams, by assigning each stream to the representative, that best predicts it. Thus, at each step, the set of representatives J and clusters C can be incrementally reported, after updating the model. The pseudocode for the online stream clustering approach is illustrated in Fig. 14.9. This approach can be useful for trend analysis in financial markets, where a representative set of stocks needs be tracked from the vast universe of stocks. Another relevant application is that of sensor selection, where a subset of representative sensors need to be determined to lower the operational costs of sensor networks. The description in this section is based on a simplification of a cost-based dynamic sensor selection algorithm [50].


14.5.2 Shape-Based Clustering


The second type of clustering is derived from shape-based clustering. In this case, the different time series may not be synchronized in time. The time series are clustered on the basis of the similarity of the shape of the overall series. A first step is the design of a shape-based similarity function. A major challenge in this case is that the different series may be scaled, translated, or stretched differently. This issue was discussed in Sect. 3.4.1 of Chap. 3. The illustration of Fig. 3.7 is replicated in Fig. 14.10. This figure illustrates different hypothetical stock tickers. In these cases, the three stocks show similar patterns, but with different scaling and random variations. Furthermore, in some cases, the time dimension may also be warped. For example, in Fig. 14.10, the entire set of values for


480 CHAPTER 14. MINING TIME SERIES DATA








40











































STOCK A































35




STOCK B (DROPPED READINGS)
































































STOCK C































30




STOCK A (WARPED)
































































25


































VALUE

20









































































15





































10





































5





































0





































0

50

100

150

200

250

300

350

400

450

500






















TIME



















Figure 14.10: Impact of scaling, translation and noise on clustering (revisiting Fig. 3.7)


stock A was stretched because the time-granularity information was not available to the analyst. This is referred to as time warping. Fortunately, the dynamic time warping (DTW) similarity function, discussed in Sect. 3.4.1 of Chap. 3, can address these issues. The design of an effective similarity function is, therefore, one of the most crucial steps in time series clustering.

Many existing methods can be adapted to shape-based time series clustering with the use of different time series similarity functions. The k-medoids and graph-based methods can be used with almost any similarity function. Methods such as k -means can also be used, though in a more limited way. This is because the different time series need to be of the same length in order for the mean of a cluster to be defined meaningfully.


14.5.2.1 k-Means


The k-means method for multidimensional data is discussed in Sect. 6.3.1 of Chap. 6. This method can be adapted to time series data, by changing the similarity function and the computation of the means of the time series. The computation of the similarity function can be adapted from Sect. 3.4.1 of Chap. 3. The precise choice of the similarity function may depend on the application at hand, though the k-means approach is optimized for the Euclidean distance function. This is because the k-means approach can be viewed as an iterative solution to an optimization problem, in which the objective function is constructed with the Euclidean distance. This aspect is discussed in more detail in Chap. 6.


The Euclidean distance function on a time series is defined in the same way as multidi-mensional data. The means of the different time series are also defined in a similar way to multidimensional data. The k-means method is best used for databases of series of the same length with a one-to-one correspondence between the time points. Thus, the centroid for each time point can be defined using the correspondence. Time warping is typically difficult to use in k-means algorithms in a meaningful way, because of the assumption of one-to-one correspondence between the time series data points. For more generic distance functions such as DTW, other methods for time series clustering are more appropriate.


14.5.2.2 k-Medoids


The main problem with the k-means approach is the fact that it cannot incorporate arbitrary similarity (or distance) functions. The k-medoids approach can be used more effectively in




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   288   289   290   291   292   293   294   295   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin