Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	332/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 328 329 330 331 332 333 334 335 ... 423

1-Data Mining tarjima

DT W (i, j) = distance(xi, yj ) + min
DT W (i − 1, j)
repeat yj

(16.2)

DT W (i − 1, j − 1)

otherwise

In the case of a 2-dimensional trajectory, we have a multivariate time series for each trajec-tory, corresponding to the two coordinates of each trajectory. Thus, the first trajectory has two coordinates corresponding to X1 = (x1₁ . . . x1_m) and X2 = (x 2₁ . . . x2_m). The second trajectory has two coordinates, corresponding to Y 1 = (y1₁ . . . y1_n) and Y 2 = (y 2₁ . . . y2_n). Let X_i = ( x1_i, x2_i) represent the 2-dimensional position of the first trajectory at the ith timestamp, and let Y_j = (y1_j , y2_j ) represent the 2-dimensional position of the second trajec-tory at the jth timestamp. Then, the only diﬀerence from the case of univariate time series

data is the substitution of the 1-dimensional distances in the recursion with 2-dimensional distances. Therefore, the modified multidimensional DTW recursion M DT W (i, j) is as fol-lows:

M DT W (i, j − 1)

repeat

M DT W (i, j) = distance(X_i, Y_j ) + min

X_i

M DT W (i

1, j)

repeat Y_j(16.3)

−

1, j − 1)

otherwise.

M DT W (i −

Note that the multidimensional DTW recursion is virtually identical to the univariate case, except for the term distance(X_i, Y _j ) that is now a multidimensional distance between spa-tial coordinates. For example, one might use the Euclidean distances. The simplicity of the generalization is a result of the fact that time warping has little to do with the dimen-sionality of the time series. All the dimensions in the time series are warped in exactly the same way. Therefore, the 1-dimensional distance in the recursion can be substituted with multidimensional distances. It should also be pointed out that this general principle applies to most of the dynamic programming methods for computing distances between temporal series and sequences.

16.3.4.2 Similarity-Based Clustering Methods

Many conventional clustering methods, such as k-medoids and graph-based methods, are based on the similarity between data objects. Therefore, once a similarity function has been defined, these methods can be used directly for virtually any data type. It should be pointed out that these methods are very popular in diﬀerent data domains, such as time series data or discrete sequence data. It was shown in Chaps. 14 and 15, how these methods may be used for time series and discrete sequence data, respectively. The approach used here is exactly analogos to the description in these chapters, except that multivariate time series similarity measures are used for computation in this case. The reader is referred to Chap. 6 for the basic description of the k-medoids and graph-based methods, as applied to multidimensional data. The main problem with similarity-based methods, is that they work best only when the trajectory segments are relatively short. For longer trajectories, it becomes more diﬃcult to compute the similarity between pairs of objects because many portions of the trajectories may be noisy. Therefore, the choice of similarity function becomes more important. Some of the similarity functions discussed in Sect. 3.4.1 of Chap. 3, allow for gaps in the similarity computation. However, the eﬀectiveness of these methods for multivariate time series and trajectories is highly data-specific. In general, these similarity functions are best used for trajectories of shorter lengths.

16.3. TRAJECTORY MINING

551

16.3.4.3 Trajectory Clustering as a Sequence Clustering Problem

Trajectory clustering methods can be performed with the same grid-based discretization methods that are used for frequent pattern mining in trajectories. A two-step approach is used. The first step is to use grid-based discretization to convert the trajectory into a 1-dimensional discrete sequence. Once this transformation has been performed, any of the sequence clustering methods discussed in Chap. 15 may be used. The overall clustering approach may be described as follows:

Use grid-based discretization, as discussed in Sect. 16.3.3.1, to convert the N trajec-tories to N discrete sequences.

Apply any of the sequence clustering methods of Sect. 15.3 in Chap. 15 to create clusters from the sequences.

Map the sequence clusters back to trajectory clusters.

As discussed in Sect. 16.3.3.1, the grid-based sequences constructed in the first step can be based on either the grid identifiers (spatial tile transformation) only, or on a combination of grid-identifiers and time-interval identifiers (spatiotemporal tile transformation). In the first case, the resulting clusters correspond to trajectories that are close together in space, but not necessarily in time. In the second case, the trajectories in a cluster will to be close together in space and occur at the same time. In other words, such clusters represent objects that are moving together in time.

One advantage of the sequence clustering approach, over similarity-based methods, is that many of the sequence clustering methods can ignore the irrelevant parts of the sequences in the clustering process. This is because many sequence clustering methods, such as subsequence-based clustering (Sect. 15.3.3 of Chap. 15), naturally allow for noisy gaps in the trajectories during the clustering process. This is important because longer tra-jectories often share significant segments in common, but may have gaps or regions where they are not similar. The ability to account for such nonmatching regions is not quite as eﬀective with similarity function methods that compute distances between trajectories as a whole.

16.3.5 Trajectory Outlier Detection

In this problem, it is assumed that N diﬀerent trajectories are available, and it is desirable to determine outlier trajectories, as those that diﬀer significantly from the trends in the underlying data. As with all data types, the problem of trajectory outlier detection is closely related to that of trajectory clustering. In particular, both problems utilize the notion of similarity between data objects. As in the case of data clustering, one can use either a similarity-based approach, or a transformational approach to outlier detection.

16.3.5.1 Distance-Based Methods

The ability to design a distance function between trajectories provides a way to define out-liers with the use of distance-based methods. In particular, the k-nearest neighbor method, or any distance-based method can easily be generalized to trajectories, once the distance function has been defined. For example, one may use the multidimensional time warping distance function to compute the distance of a trajectory to the N − 1 other trajectories. The kth nearest neighbor distance is reported as the outlier score. Other distance-based

552 CHAPTER 16. MINING SPATIAL DATA

methods such as LOF can also be extended to trajectory data because these methods are based only on distance values, and are agnostic to the underlying data type. As in the case of clustering, the major drawback of these methods is that it can be used eﬀectively for shorter trajectories, but not quite as eﬀectively in the case of longer trajectories. This is because longer trajectories will often have many noisy segments that are not truly indicative of anomalous behavior, but are disruptive to the underlying distance function.

16.3.5.2 Sequence-Based Methods

The spatial and spatiotemporal tile transformation discussed at the beginning of Sect. 16.3.3.1 can be used to transform trajectory outlier detection into sequence outlier detection. The advantage of this approach is that many methods are available for sequence outlier detection. As in the case of the other problems such as trajectory pattern mining and clustering, the approach consists of two steps:

Convert each of the N trajectories to sequences using either spatial tile transformation or spatiotemporal tile transformation, discussed at the beginning of Sect. 16.3.3.1.

Use any of the sequence outlier detection methods discussed in Sect. 15.4 of Chap. 15, to determine the outlier sequences.

Map the sequence outliers onto trajectory outliers.

This approach is particularly rich in terms of the types of the outliers it can find, by varying on the specific subroutines used in each of the aforementioned steps. Some examples of such variations are as follows:

In the first step of sequence transformation, either spatial or spatiotemporal tiles may be used. When spatial tiles are used, the discovered outliers are based only on the shape of the trajectory, and they are not based on the objects moving together. From an application-centric perspective, consider the case where trajectories of taxis are tracked by GPS, and it is desirable to determine taxis that take anomalous routes relative to other taxis at any period of time. Such an application can be addressed well with spatial tiles. Spatiotemporal tiles track online trends. For example, for a flock of GPS-tagged animals, if a particular animal deviates from its flock, it is reported as an outlier.

The formulations for sequence outlier detection are particularly rich. For example, sequence outlier detection allows the reporting of either position outliers or combina-tion outliers. This is discussed in detail in Sect. 15.4 of Chap. 15. Position outliers in the transformed sequences, map onto anomalous positions in the trajectories. For example, a taxi cab making an unusual turn at a junction will be detected. On the other hand, combination outliers will map onto unusual segments of trajectories.

Thus, the sequence-based transformation is particularly useful in being able to detect a rich diversity of diﬀerent types of outliers. It can determine outliers based on patterns of movements either over all periods, or at a particular period. It can also discover small segments of outliers in any portion of the trajectory.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 328 329 330 331 332 333 334 335 ... 423