Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə46/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   42   43   44   45   46   47   48   49   ...   423
1-Data Mining tarjima

3.2. MULTIDIMENSIONAL DATA

65




























































































































































































































































































































Figure 3.1: Reduction in distance contrasts with increasing dimensionality and norms


are high dimensional because of the varying impact of data sparsity, distribution, noise, and feature relevance. This chapter will discuss these broader principles in the context of distance function design.

3.2.1.1 Impact of Domain-Specific Relevance


In some cases, an analyst may know which features are more important than others for a particular application. For example, for a credit-scoring application, an attribute such as salary is much more relevant to the design of the distance function than an attribute such as gender, though both may have some impact. In such cases, the analyst may choose to weight the features differently if domain-specific knowledge about the relative importance of different features is available. This is often a heuristic process based on experience and skill. The generalized Lp-distance is most suitable for this case and is defined in a similar way to the Lp-norm, except that a coefficient ai is associated with the ith feature. This coefficient is used to weight the corresponding feature component in the Lp-norm:




















d

1/p































Dist(










) =

ai · |xi − yi|p




(3.2)




X,

Y

.




i=1

This distance is also referred to as the generalized Minkowski distance. In many cases, such domain knowledge is not available. Therefore, the Lp-norm may be used as a default option. Unfortunately, without knowledge about the most relevant features, the Lp -norm is suscep-tible to some undesirable effects of increasing dimensionality, as discussed subsequently.


3.2.1.2 Impact of High Dimensionality


Many distance-based data mining applications lose their effectiveness as the dimensionality of the data increases. For example, a distance-based clustering algorithm may group unre-lated data points because the distance function may poorly reflect the intrinsic semantic distances between data points with increasing dimensionality. As a result, distance-based models of clustering, classification, and outlier detection are often qualitatively ineffective. This phenomenon is referred to as the “curse of dimensionality,” a term first coined by Richard Bellman.


66 CHAPTER 3. SIMILARITY AND DISTANCES


To better understand the impact of the dimensionality curse on distances, let us examine a unit cube of dimensionality d that is fully located in the nonnegative quadrant, with one corner at the origin O. What is the Manhattan distance of the corner of this cube (say, at the origin) to a randomly chosen point X inside the cube? In this case, because one end point is the origin, and all coordinates are nonnegative, the Manhattan distance will sum up the coordinates of X over the different dimensions. Each of these coordinates is uniformly distributed in [0, 1]. Therefore, if Yi represents the uniformly distributed random variable in [0, 1], it follows that the Manhattan distance is as follows:





d




Dist(O, X) = (Yi 0).

(3.3)



i=1

The result is a random variable with a mean of μ = d/2 and a standard deviation of





  • = d/12. For large values of d, it can be shown by the law of large numbers that the vast majority of randomly chosen points inside the cube will lie in the range [Dmin, Dmax] = [μ − 3σ, μ + 3σ]. Therefore, most of the points in the cube lie within a distance range of Dmax − Dmin = 6σ = 3d from the origin. Note that the expected Manhattan distance grows with dimensionality at a rate that is linearly proportional to d. Therefore, the ratio of the variation in the distances to the absolute values that is referred to as Contrast(d), is given by:

Contrast(d) =

Dmax Dmin

=




(3.4)




12/d.







μ










This ratio can be interpreted as the distance contrast between the different data points, in terms of how different the minimum and maximum distances from the origin might be considered. Because the contrast reduces with d, it means that there is virtually no contrast with increasing dimensionality. Lower contrasts are obviously not desirable because it means that the data mining algorithm will score the distances between all pairs of data points in approximately the same way and will not discriminate well between different pairs of objects with varying levels of semantic relationships. The variation in contrast with increasing dimensionality is shown in Fig. 3.1a. This behavior is, in fact, observed for all Lp-norms at different values of p, though with varying severity. These differences in severity will be explored in a later section. Clearly, with increasing dimensionality, a direct use of the Lp-norm may not be effective.


3.2.1.3 Impact of Locally Irrelevant Features


A more fundamental way of exploring the effects of high dimensionality is by examining the impact of irrelevant features. This is because many features are likely to be irrelevant in a typical high -dimensional data set. Consider, for example, a set of medical records, contain-ing patients with diverse medical conditions and very extensive quantitative measurements about various aspects of an individual’s medical history. For a cluster containing diabetic patients, certain attributes such as the blood glucose level are more important for the dis-tance computation. On the other hand, for a cluster containing epileptic patients, a different set of features will be more important. The additive effects of the natural variations in the many attribute values may be quite significant. A distance metric such as the Euclidean metric may unnecessarily contribute a high value from the more noisy components because of its square-sum approach. The key point to understand here is that the precise features that are relevant to the distance computation may sometimes be sensitive to the particular pair of objects that are being compared. This problem cannot be solved by global feature




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   42   43   44   45   46   47   48   49   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin