16.2. MINING WITH CONTEXTUAL SPATIAL ATTRIBUTES
|
543
|
distance-based proximity. Thus, graph representations allow more generic interpretations of the contextual attribute.
Let S be the set of neighbors of a given node i. Then, the concept of spatial continuity can be used to create a predicted value of the behavioral attribute based on those of its (spatial) neighbors. The strength of the links between i and its neighbors can also be used to compute the predicted values as either the weighted mean or median on the behavioral attribute of the k nearest spatial neighbors. For a given spatial object o with the behavioral attribute value f (o), let o1 . . . ok be its k linked neighbors based on the relationship graph. Let the weight of the link (o, oi) be w(o, oi). Then, the linkage-based weighted mean may be used to compute the predicted value g(o) of the object o.
|
k
|
|
g(o) =
|
i=1 w(o, oi) · f (oi)
|
|
k
|
|
|
|
|
i=1 w(o, oi)
|
|
Alternatively, the weighted median of the neighbor values may be used for predictive pur-poses. Since the true value of the behavioral attribute is known, this can be used to model the deviations of the behavioral attributes from their predicted values. As in the case of multidimensional methods, the value of f (o) − g(o ) represents a deviation from the pre-dicted values. Extreme value analysis can be used on these deviations to determine the spatial outliers. This process is identical to that in the multidimensional case. The nodes with high values of the normalized deviation may be reported as outliers.
16.2.5.2 Shape Outliers
Shape-based outliers are relatively easy to determine in spatial data, with the use of the transformation from spatial data to time series described in Sect. 16.2.1. After the trans-formation has been performed, a k-nearest neighbor outlier detector can be applied to the resulting time series. The distance to the k th-nearest neighbor can be reported as the outlier score. A few key issues need to be kept in mind, while computing the outlier score.
The distance function needs to be modified to account for the rotation invariance of shape matching. This is achieved by comparing all cyclic shifts of one time series to the other. The rotation invariant distance can be captured by Eq. 16.1.
In some applications, mirror image invariance also needs to be accounted for. In such cases, all cyclic shifts and their reversals need to be included in the aforementioned comparison. The outliers are determined with respect to this enhanced database.
While a vanilla k-nearest neighbor detector can determine the outliers correctly, the approach can be made faster by pruning. The basic idea is similar to the Hotsax method discussed in Chap. 14, where a nested loop structure is used to maintain the top-n outliers. The outer loop corresponds to the selection of different candidates, and the inner loop cor-responds to the computation of the k-nearest neighbors of each of these candidates. The inner loop can be terminated early, when the k-nearest neighbor value is less than the nth best outlier found so far. For optimal performance, the candidates in the outer loop and the computations in the inner loop need to be ordered appropriately.
This ordering is performed as follows. A combination of the SAX representation and LSH-hashing is used to create clusters on the candidates. Candidates which map to clusters with few members are examined first in the outer loop to discover high quality outliers early in the algorithm execution. Objects which appear in the same cluster as the outer
544 CHAPTER 16. MINING SPATIAL DATA
loop candidate are examined first in the inner loop to ensure fast termination of the inner loop. This facilitates better pruning performance. The bibliographic notes contain pointers to specific details of the creation of SAX-based clusters in shape outlier detection.
16.2.6 Classification of Shapes
It is assumed that a set of N labeled shapes are used to conduct the training. This trained model is used to perform classification of test instances, for which the label is unknown. The transformation from spatial into time series data is a useful tool for distance-based classification algorithms. As in the case of clustering and outlier detection, the first step of the process is to transform the shapes into time series. This transforms the problem to the time series classification problem. A number of methods for the classification of time series are discussed in Sect. 14.7 of Chap. 14. The main difference is that the rotation invariance of the shapes needs to be accounted for. Any of the distance-based methods proposed in Sect. 14.7 of Chap. 14 for time series classification may be used after the shapes have been transformed into time series. This is because distance-based methods can be easily made rotation-invariant by using Eq. 16.1. The two main distance-based methods for time series classification include the nearest neighbor method and the graph-based collective classification approach. While the nearest-neighbor method is straightforward, the graph-based method is discussed in some detail below.
The graph-based method is transductive because it requires the test instances to be available at the time of training. When a larger number of test instances are available along with the training data, the latter method may be used. Therefore, the different methods may be more suitable in different scenarios. The overall approach for graph-based classification may be described as follows:
Transform both the training and test shapes into time series, by using the centroid sweep method described in Sect. 16.2.1.
Use any of the distance functions described in Sect. 3.4.1 of Chap. 3 to construct a neighborhood graph on the shapes. If needed, use a rotation-invariant version of the distance function, as discussed in Eq. 16.1. Each shape represents a node, which is connected to its k-nearest neighbors with edges. The labeled shapes correspond to labeled nodes. The collective classification method described in Sect. 19.4 of Chap. 19 is used to assign labels to the unlabeled nodes (i.e., the test shapes).
In some cases, rotation invariance may not be an application-specific need. In such cases, the efficiency of distance computation is improved.
16.3 Trajectory Mining
Trajectory data arises in a wide variety of spatial applications. The proliferation of GPS-enabled devices, such as mobile phones, has enabled the large-scale collection of trajectory data. Trajectory data can be analyzed for a very wide variety of insights, such as determining co-location patterns, clusters and outliers. Trajectory data is different from the other kinds of spatial data discussed in this chapter in the following respects:
In the spatial data applications addressed so far in this chapter, spatial attributes are contextual, whereas other types of attributes (e.g., temperature in a meteorolog-ical application) are behavioral. In the case of trajectory data, spatial attributes are behavioral.
Dostları ilə paylaş: |