Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	326/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 322 323 324 325 326 327 328 329 ... 423

1-Data Mining tarjima

TEMPERATURE 1 1 1
AVERAGE = 75 1 1 1
COEFFICIENT = 75 1 1 1
CUT ALONG X AXIS BASE DATA
AVERAGE TEMPERATURE DIFFERENCE BETWEEN LEFT AND RIGHT BLOCKS = 7/4 COEFFICIENT=
TEMPERATURE 1
TOP AND BOTTOM 1

GLOBAL	1	1	1	1	75 76 75 72	SEA SURFACE

TEMPERATURE	1	1	1	1	77 73 73 74	TEMPERATURES
AVERAGE = 75	1	1	1	1	72 71 78 80	ALONG SPATIAL
COEFFICIENT = 75	1	1	1	1	74 75 79 76	GRID
	CUT ALONG X AXIS				BASE DATA

AVERAGE
TEMPERATURE
DIFFERENCE
BETWEEN LEFT
AND RIGHT
BLOCKS = 7/4
COEFFICIENT=

AVERAGE TEMP. DIFFERENCE BETWEEN TOP AND BOTTOM BLOCKS = 9/4 COEFFICIENT= 9/8

	1	1	1	1	BINARY
	1	1	1	1	MATRICES
	1	1	1	1	REPRESENT
					2 DIMENSIONAL
	1	1	1	1	BASIS MATRICES
7/8		CUT ALONG

			Y AXIS
1	1	0	0		0	0	1	1	AVERAGE
									TEMPERATURE
1	1	0	0		0	0	1	1
									DIFFERENCE BETWEEN
1	1 0		0		0	0	1	1	TOP AND BOTTOM
1	1 0		0		0	0	1	1	BLOCKS = 19/4
			CUT ALONG						COEFFICIENT = 19/8

X AXIS

Figure 16.5: Illustration of the top levels of the wavelet decomposition for spatial data in a grid containing sea surface temperatures (Fig. 2.7 of Chap. 2 revisited)

be addressed by performing the decomposition separately for each behavioral attribute, and creating a separate set of dimensions for each behavioral attribute.

Like the time series wavelet, the spatial wavelet is a multiresolution representation. Trends at diﬀerent levels of spatial granularity are represented in the coeﬃcients. Higher-level coeﬃcients represent trends in larger spatial areas, whereas lower-level coeﬃcients represent trends in smaller spatial areas. Therefore, this approach is very powerful, and has broad usability for many spatial applications. Spatial wavelets can be used eﬀectively for many image clustering and classification applications where (contextual) spatial data can be converted to (noncontextual) multidimensional data. Once the transformation has been performed, virtually all the multidimensional methods discussed in Chaps. 4 to 11 can be used on this representation. Such an approach opens the door to the use of a wide array of multidimensional data mining methods.

16.2.3 Spatial Colocation Patterns

In this problem, the contextual attributes are spatial and the behavioral attributes are typically boolean and nonspatial. Non-boolean behavioral attributes can be addressed with the use of type conversion via discretization or binarization. The goal of spatial colocation pattern mining is to discover combinations of features occurring at the same spatial location. Consider an ecology data set, where one has behavioral attributes such as fire ignition source, needle vegetation type, and a drought indicator. The spatial colocation of these features may often be a risk factor for forest fires. Therefore, the discovery of such patterns is useful in the context of data mining analysis. In many cases, a spatial event indicator of interest (e.g., disease outbreak, vegetation event, or climate event) is added to the other behavioral attributes. The discovery of useful patterns that include this indicator of interest can be

16.2. MINING WITH CONTEXTUAL SPATIAL ATTRIBUTES

539

used for discovering event causality. This problem is also closely related to rule-based spatial classification, where the likelihood of the event occurring in previously unseen test regions can be estimated from the resulting patterns.

One challenge in the mining process is that the diﬀerent behavioral attributes may be derived from diﬀerent data sources, and therefore may not have precisely the same value of the contextual (spatial) attribute in their measurements. Therefore, proper data preprocessing is crucial. The data can be homogenized by partitioning the spatial region into smaller regions. For each of these regions, each behavioral attribute’s value is derived heuristically from the values in the original data source. For example, if the boolean attribute has a value of 1 more than predefined fraction of the time in a spatial region, then its value is set to 1. The contextual (spatial) attribute can be set to the centroid of that region. The mining can be performed on this preprocessed data. The overall approach is as follows:

Preprocess the data to create the behavioral attribute values at the same set of spatial locations.

For each spatial location, create a transaction containing the corresponding combina-tion of boolean values.

Use any frequent pattern mining algorithm to discover the relevant patterns in these transactions.

For each discovered pattern, map it back to the spatial regions containing the pattern. Cluster the relevant spatial regions for each pattern, if necessary for summarization.

In cases where a particular behavioral attribute is an event of interest (e.g., disease out-break), the transactions containing values of 0 and 1, respectively, for this attribute can be separately processed to discover two sets of patterns on the other behavioral attributes. The diﬀerences between these two sets of patterns can provide insights into discriminative factors for the event of interest at each spatial location. Such patterns are also useful for spatial classification of previously unseen test regions. This approach is identical to that of associative classifiers in Chap. 10.

This model can also address time-changing data in a seamless way. In such cases, the time becomes another contextual attribute in addition to the spatial attributes. Patterns can be discovered at diﬀerent temporal snapshots using the aforementioned methodology. The key changes in these patterns over time can provide insights into the nature of the spatial evolution.

16.2.4 Clustering Shapes

In many applications, it may be desirable to cluster similar shapes prior to analysis. It is assumed that a database of N shapes is available and that a total of k groups of similar shapes need to be created. This can be a useful preprocessing task in many shape cat-egorization applications. The conversion of a shape to a time series (Sect. 16.2.1) is the appropriate approach in this scenario. Many of the time series clustering algorithms dis-cussed in Sect. 14.5 of Chap. 14 may be used eﬀectively, once the shape has been converted to a time series. The k-medoids, hierarchical, and graph-based methods are particularly suitable because they require only the design of an appropriate similarity function for the corresponding time series. This is an issue that will be discussed in more detail later. The main steps of shape-based clustering are as follows:

540 CHAPTER 16. MINING SPATIAL DATA

Use the centroid-based sweep method discussed in Sect. 16.2.1 to convert each shape into a time series. This results in a database of N diﬀerent time series.

Use any time series clustering algorithm, such as hierarchical, k-medoids or graph-based method on time series data as discussed in Sect. 14.5 of Chap. 14. This will cluster the N time series into k groups.

Map the k groups of time series clusters to k groups of shape clusters, by mapping each time series into its relevant shape.

The aforementioned clustering algorithm depends only on the choice of the distance func-tion. Any of the time series measures discussed in Sect. 3.4.1 of Chap. 3 may be used, depending on the desired degree of error tolerance or distortion (warping) allowed in the matching. Another important issue is the adjustment of the distance function with the vary-ing rotations of the diﬀerent shapes. In the following, the Euclidean distance will be used as an example, although the general principle can be applied to any distance function.

It is evident from the example of Fig. 16.4 that a rotation of the shape leads to a linear

cyclic shifting of the time series generated by using the distances of the centroid of the shape to the contours of the shape. For a time series of length n denoted by a₁a₂ . . . a_n, a cyclic translation by i units leads to the time series a_i₊₁a_i₊₂ . . . a_na₁a₂ . . . a_i. Then, the rotation invariant Euclidean distance RIDist(T₁, T₂) between two time series T₁ = a₁ . . . a_n and T₂ = b₁ . . . b_n is given by the minimum distance between T₁ and all possible rotational translations of T₂ (or vice versa). Therefore, the rotation-invariant distance is expressed as follows:
n
^RIDist(^T1^{, T}2) = minⁿi=1 (^aj ⁻ ^b₁₊₍_j₊_i₎ _mod _n)²^.
j=1

In general, if a cyclic shift of the time series T₂ by i units is denoted by T₂i, then the rotation invariant distance, using any distance function Dist(T₁, T₂) may be expressed as follows:

RIDist(T₁, T₂) = min_iⁿ₌₁Dist(T₁, T₂ⁱ).

(16.1)

Note that the reversal of a time series corresponds to the mirror image of the underlying shape. Therefore, mirror images can also be addressed using this approach, by incorpo-rating the reversals of the series (and its rotations) in the distance function. This will increase the computation by a factor of 2. The precise choice of distance function used is highly application-specific, depending on whether rotations or mirror image conversions are required.

16.2.5 Outlier Detection

In the context of spatial data, outliers can be either point outliers and shape outliers. These two kinds of outliers are also encountered in time series data, and in discrete sequences. In the case of spatial data, these two kinds of outliers are defined as follows:

Point outliers: These outliers are defined on a single spatial object with a variety of spatial and behavioral attributes. For example, a weather map is a spatial object that contains both spatial locations, and environmental measurements (behavioral values) at these locations. Abrupt changes in the behavioral attributes that violate spatial continuity provide useful information about the underlying contextual anomalies. For example, consider a meteorological application in which sea surface temperatures and

16.2. MINING WITH CONTEXTUAL SPATIAL ATTRIBUTES

541

BEHAVIORAL

2.5

X OUTLIER

1.5

0.5

0.8		1
		1
0.6		0.8
0.4		0.6
0.4		0.4
	0.2	0.4
	0.2	0.2
	0	0.2
SPATIAL Y	0	0
SPATIAL Y		SPATIAL X
		SPATIAL X

Figure 16.6: Example of point outlier for spatial data

pressure are measured. Unusually high sea surface temperature in a very small local-ized region is a hot-spot that may be the result of volcanic activity under the surface. Similarly, unusually low or high pressure in a small localized region may suggest the formation of hurricanes or cyclones. In all these cases, spatial continuity is violated by the attribute of interest. Such attributes are often tracked in meteorological appli-cations on a daily basis. An example of a point outlier for spatial data is illustrated in Fig. 16.6.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 322 323 324 325 326 327 328 329 ... 423