Figure 3.3: Global data distributions impact distance computations
|
|
|
similarity P Select(
|
|
|
|
, kd) is defined as follows:
|
|
|
|
|
|
|
X,
|
Y
|
|
|
|
|
|
|
P Select(
|
|
|
|
, kd) =
|
1
|
|xi − yi|
|
p 1/p
|
|
(3.5)
|
|
X,
|
|
Y
|
.
|
|
|
− mi − ni
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
i S(X,Y ,kd)
The value of the aforementioned expression will vary between 0 and |S(X, Y , kd)| because each individual expression in the summation lies between 0 and 1. This is a similarity function because larger values imply greater similarity.
The aforementioned similarity function guarantees a nonzero similarity component only for dimensions mapping to the same bucket. The use of equidepth partitions ensures that the probability of two records sharing a bucket for a particular dimension is given by 1/kd. Thus, on average, the aforementioned summation is likely to have d/kd nonzero compo-nents. For more similar records, the number of such components will be greater, and each individual component is also likely to contribute more to the similarity value. The degree of dissimilarity on the distant dimensions is ignored by this approach because it is often dom-inated by noise. It has been shown theoretically [7] that picking kd ∝ d achieves a constant level of contrast in high-dimensional space for certain data distributions. High values of kd result in more stringent quality bounds for each dimension. These results suggest that in high-dimensional space, it is better to aim for higher quality bounds for each dimension, so that a smaller percentage (not number) of retained dimensions are used in similarity computation. An interesting aspect of this distance function is the nature of its sensitivity to data dimensionality. The choice of kd with respect to d ensures that for low-dimensional applications, it bears some resemblance to the Lp-norm by using most of the dimensions; whereas for high-dimensional applications, it behaves similar to text domain-like similarity functions by using similarity on matching attributes. The distance function has also been shown to be more effective for a prototypical nearest-neighbor classification application.
3.2.1.6 Impact of Data Distribution
The L p-norm depends only on the two data points in its argument and is invariant to the global statistics of the remaining data points. Should distances depend on the underlying data distribution of the remaining points in the data set? The answer is yes. To illustrate this point, consider the distribution illustrated in Fig. 3.3 that is centered at the origin. In
70 CHAPTER 3. SIMILARITY AND DISTANCES
Figure 3.4: Impact of nonlinear distributions on distance computations
addition, two data points A= (1, 2) and B= (1, −2) are marked in the figure. Clearly, A and B are equidistant from the origin according to any Lp-norm. However, a question arises, as to whether A and B should truly be considered equidistant from the origin O. This is because the straight line from O to A is aligned with a high-variance direction in the data, and statistically, it is more likely for data points to be further away in this direction. On the other hand, many segments of the path from O to B are sparsely populated, and the corresponding direction is a low- variance direction. Statistically, it is much less likely for B to be so far away from O along this direction. Therefore, the distance from O to A ought to be less than that of O to B.
The Mahalanobis distance is based on this general principle. Let Σ be its d × d covariance matrix of the data set. In this case, the (i, j)th entry of the covariance matrix is equal to the covariance between the dimensions i and j. Then, the Mahalanobis distance M aha(X, Y ) between two d-dimensional data points X and Y is as follows:
M aha(X, Y ) = (X − Y )Σ−1(X − Y )T .
A different way of understanding the Mahalanobis distance is in terms of principal compo-nent analysis (PCA). The Mahalanobis distance is similar to the Euclidean distance, except that it normalizes the data on the basis of the interattribute correlations. For example, if the axis system were to be rotated to the principal directions of the data (shown in Fig. 3.3), then the data would have no (second order) interattribute correlations. The Mahalanobis distance is equivalent to the Euclidean distance in such a transformed (axes-rotated) data set after dividing each of the transformed coordinate values by the standard deviation of the data along that direction. As a result, the data point B will have a larger distance from the origin than data point A in Fig. 3.3.
3.2.1.7 Nonlinear Distributions: ISOMAP
We now examine the case in which the data contain nonlinear distributions of arbitrary shape. For example, consider the global distribution illustrated in Fig. 3.4. Among the three data points A, B, and C, which pair are the closest to one another? At first sight, it would seem that data points A and B are the closest on the basis of Euclidean distance. However, the global data distribution tells us otherwise. One way of understanding distances is as the shortest length of the path from one data point to another, when using only point-to-point jumps from data points to one of their k-nearest neighbors based on a standard metric
3.2. MULTIDIMENSIONAL DATA
|
71
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 3.5: Impact of ISOMAP embedding on distances
such as the Euclidean measure. The intuitive rationale for this is that only short point-to-point jumps can accurately measure minor changes in the generative process for that point. Therefore, the overall sum of the point-to-point jumps reflects the aggregate change (distance) from one point to another (distant) point more accurately than a straight-line distance between the points. Such distances are referred to as geodesic distances. In the case of Fig. 3.4, the only way to walk from A to B with short point-to-point jumps is to walk along the entire elliptical shape of the data distribution while passing C along the way. Therefore, A and B are actually the farthest pair of data points (from A, B, and C) on this basis! The implicit assumption is that nonlinear distributions are locally Euclidean but are globally far from Euclidean.
Such distances can be computed by using an approach that is derived from a nonlin-ear dimensionality reduction and embedding method, known as ISOMAP. The approach consists of two steps:
Compute the k-nearest neighbors of each point. Construct a weighted graph G with nodes representing data points, and edge weights (costs) representing distances of these k-nearest neighbors.
For any pair of points X and Y , report Dist(X, Y ) as the shortest path between the corresponding nodes in G.
These two steps are already able to compute the distances without explicitly performing dimensionality reduction. However, an additional step of embedding the data into a multidi-mensional space makes repeated distance computations between many pairs of points much faster, while losing some accuracy. Such an embedding also allows the use of algorithms that work naturally on numeric multidimensional data with predefined distance metrics.
This is achieved by using the all-pairs shortest-path problem to construct the full set of distances between any pair of nodes in G. Subsequently, multidimensional scaling (MDS) (cf. Sect. 2.4.4.2 of Chap. 2) is applied to embed the data into a lower dimensional space. The overall effect of the approach is to “straighten out” the nonlinear shape of Fig. 3.4 and embed it into a space where the data are aligned along a flat strip. In fact, a 1-dimensional representation can approximate the data after this transformation. Furthermore, in this new space, a distance function such as the Euclidean metric will work very well as long as metric MDS was used in the final phase. A 3-dimensional example is illustrated in Fig. 3.5a, in which the data are arranged along a spiral. In this figure, data points A and C seem much
72 CHAPTER 3. SIMILARITY AND DISTANCES
Figure 3.6: Impact of local distributions on distance computations
closer to each other than data point B. However, in the ISOMAP embedding of Fig. 3.5b, the data point B is much closer to each of A and C. This example shows the drastic effect of data distributions on distance computation.
In general, high-dimensional data are aligned along nonlinear low-dimensional shapes, which are also referred to as manifolds. These manifolds can be “flattened out” to a new representation where metric distances can be used effectively. Thus, this is a data trans-formation method that facilitates the use of standard metrics. The major computational challenge is in performing the dimensionality reduction. However, after the one-time pre-processing cost has been paid for, repeated distance computations can be implemented efficiently.
Nonlinear embeddings can also be achieved with extensions of PCA . PCA can be extended to discovering nonlinear embeddings with the use of a method known as the kernel trick. Refer to Sect. 10.6.4.1 of Chap. 10 for a brief description of kernel PCA.
3.2.1.8 Impact of Local Data Distribution
The discussion so far addresses the impact of global distributions on the distance computa-tions. However, the distribution of the data varies significantly with locality. This variation may be of two types. For example, the absolute density of the data may vary significantly with data locality, or the shape of clusters may vary with locality. The first type of variation is illustrated in Fig. 3.6a, which has two clusters containing the same number of points, but one of them is denser than the other. Even though the absolute distance between (A, B) is identical to that between (C, D), the distance between C and D should be considered greater on the basis of the local data distribution. In other words, C and D are much farther away in the context of what their local distributions look like. This problem is often encoun-tered in many distance-based methods such as outlier detection. It has been shown that methods that adjust for the local variations in the distances typically perform much better than those that do not adjust for local variations. One of the most well-known methods for outlier detection, known as Local Outlier Factor (LOF), is based on this principle.
A second example is illustrated in Fig. 3.6b, which illustrates the impact of varying local orientation of the clusters. Here, the distance between (A, B) is identical to that between (C, D) using the Euclidean metric. However, the local clusters in each region show very different orientation. The high-variance axis of the cluster of data points relevant to (A, B)
3.2. MULTIDIMENSIONAL DATA
|
73
|
is aligned along the path from A to B. This is not true for (C, D). As a result, the intrinsic distance between C and D is much greater than that between A and B. For example, if the local Mahalanobis distance is computed using the relevant cluster covariance statistics, then the distance between C and D will evaluate to a larger value than that between A and B.
Shared Nearest-Neighbor Similarity: The first problem can be at least partially alle-viated with the use of a shared nearest-neighbor similarity. In this approach, the k -nearest neighbors of each data point are computed in a preprocessing phase. The shared nearest-neighbor similarity is equal to the number of common neighbors between the two data points. This metric is locally sensitive because it depends on the number of common neigh-bors, and not on the absolute values of the distances. In dense regions, the k-nearest neighbor distances will be small, and therefore data points need to be closer together to have a larger number of shared nearest neighbors. Shared nearest-neighbor methods can be used to define a similarity graph on the underlying data points in which pairs of data points with at least one shared neighbor have an edge between them. Similarity graph-based methods are almost always locality sensitive because of their local focus on the k-nearest neighbor distribution.
Generic Methods: In generic local distance computation methods, the idea is to divide the space into a set of local regions. The distances are then adjusted in each region using the local statistics of this region. Therefore, the broad approach is as follows:
Dostları ilə paylaş: |