Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	292/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 288 289 290 291 292 293 294 295 ... 423

1-Data Mining tarjima

p
y^j = a · yⁱ + c +
_t _r _t₋_r _t

This is similar to the AR(p) model, except that the elements of stream i are being used to predict those of stream j. As in the case of the AR(p) model, least-squares regression

14.5. TIME SERIES CLUSTERING

479

Algorithm UpdateClusters(Multivariate Stream: Y₁ . . . Y_t . . .

Current Set of Representatives: J)

begin
Receive next time-stamp Y_t of multivariate stream;

repeat
Add a stream to J that leads to the maximum decrease in regression error of the clustering; Drop the stream from J that leads to the least increase in regression error of the clustering;

Assign each series to closest representative in J to create the clustering C;

until(J did not change in previous iteration); return(J, C);

end

Figure 14.9: Dynamically maintaining cluster representatives

can be used to learn p coeﬃcients. Furthermore, the training data is restricted to a win-dow of size w > p to allow for stream evolution. The squared average of the white noise error terms, or the R2-statistic over the window of size w > p, can be used as the distance (similarity) between the two streams. Note that the regression coeﬃcients can also be main-tained incrementally because they are already known at the previous timestamp, and the model simply needs to be updated with the addition of a single data point. Most iterative optimization methods for least-squares regression, such as gradient descent, converge very fast when starting with a near-optimal solution.

This regression-based similarity function is not symmetric because the error of predicting stream j from stream i is diﬀerent from the error of predicting stream i from stream j. The representative set J can also be used to create a clustering C of the streams, by assigning each stream to the representative, that best predicts it. Thus, at each step, the set of representatives J and clusters C can be incrementally reported, after updating the model. The pseudocode for the online stream clustering approach is illustrated in Fig. 14.9. This approach can be useful for trend analysis in financial markets, where a representative set of stocks needs be tracked from the vast universe of stocks. Another relevant application is that of sensor selection, where a subset of representative sensors need to be determined to lower the operational costs of sensor networks. The description in this section is based on a simplification of a cost-based dynamic sensor selection algorithm [50].

14.5.2 Shape-Based Clustering

The second type of clustering is derived from shape-based clustering. In this case, the diﬀerent time series may not be synchronized in time. The time series are clustered on the basis of the similarity of the shape of the overall series. A first step is the design of a shape-based similarity function. A major challenge in this case is that the diﬀerent series may be scaled, translated, or stretched diﬀerently. This issue was discussed in Sect. 3.4.1 of Chap. 3. The illustration of Fig. 3.7 is replicated in Fig. 14.10. This figure illustrates diﬀerent hypothetical stock tickers. In these cases, the three stocks show similar patterns, but with diﬀerent scaling and random variations. Furthermore, in some cases, the time dimension may also be warped. For example, in Fig. 14.10, the entire set of values for

480 CHAPTER 14. MINING TIME SERIES DATA

	40
				STOCK A
	35		STOCK B (DROPPED READINGS)
	35
				STOCK C
	30			STOCK A (WARPED)
	30
	25
VALUE	20
VALUE
	15
	10
	5
	0
	0	50		100	150	200	250	300	350	400	450	500
							TIME

Figure 14.10: Impact of scaling, translation and noise on clustering (revisiting Fig. 3.7)

stock A was stretched because the time-granularity information was not available to the analyst. This is referred to as time warping. Fortunately, the dynamic time warping (DTW) similarity function, discussed in Sect. 3.4.1 of Chap. 3, can address these issues. The design of an eﬀective similarity function is, therefore, one of the most crucial steps in time series clustering.

Many existing methods can be adapted to shape-based time series clustering with the use of diﬀerent time series similarity functions. The k-medoids and graph-based methods can be used with almost any similarity function. Methods such as k -means can also be used, though in a more limited way. This is because the diﬀerent time series need to be of the same length in order for the mean of a cluster to be defined meaningfully.

14.5.2.1 k-Means

The k-means method for multidimensional data is discussed in Sect. 6.3.1 of Chap. 6. This method can be adapted to time series data, by changing the similarity function and the computation of the means of the time series. The computation of the similarity function can be adapted from Sect. 3.4.1 of Chap. 3. The precise choice of the similarity function may depend on the application at hand, though the k-means approach is optimized for the Euclidean distance function. This is because the k-means approach can be viewed as an iterative solution to an optimization problem, in which the objective function is constructed with the Euclidean distance. This aspect is discussed in more detail in Chap. 6.

The Euclidean distance function on a time series is defined in the same way as multidi-mensional data. The means of the diﬀerent time series are also defined in a similar way to multidimensional data. The k-means method is best used for databases of series of the same length with a one-to-one correspondence between the time points. Thus, the centroid for each time point can be defined using the correspondence. Time warping is typically diﬃcult to use in k-means algorithms in a meaningful way, because of the assumption of one-to-one correspondence between the time series data points. For more generic distance functions such as DTW, other methods for time series clustering are more appropriate.

14.5.2.2 k-Medoids

The main problem with the k-means approach is the fact that it cannot incorporate arbitrary similarity (or distance) functions. The k-medoids approach can be used more eﬀectively in

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 288 289 290 291 292 293 294 295 ... 423