Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	46/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 42 43 44 45 46 47 48 49 ... 423

1-Data Mining tarjima

3.2. MULTIDIMENSIONAL DATA

Figure 3.1: Reduction in distance contrasts with increasing dimensionality and norms

are high dimensional because of the varying impact of data sparsity, distribution, noise, and feature relevance. This chapter will discuss these broader principles in the context of distance function design.

3.2.1.1 Impact of Domain-Specific Relevance

In some cases, an analyst may know which features are more important than others for a particular application. For example, for a credit-scoring application, an attribute such as salary is much more relevant to the design of the distance function than an attribute such as gender, though both may have some impact. In such cases, the analyst may choose to weight the features diﬀerently if domain-specific knowledge about the relative importance of diﬀerent features is available. This is often a heuristic process based on experience and skill. The generalized L_p-distance is most suitable for this case and is defined in a similar way to the L_p-norm, except that a coeﬃcient a_i is associated with the ith feature. This coeﬃcient is used to weight the corresponding feature component in the L_p-norm:

				d	1/p
				d
Dist(			) =	a_i · \|x_i − y_i\|^p		(3.2)
Dist(	X,	Y	) =	a_i · \|x_i − y_i\|^p	.	(3.2)

i=1

This distance is also referred to as the generalized Minkowski distance. In many cases, such domain knowledge is not available. Therefore, the L_p-norm may be used as a default option. Unfortunately, without knowledge about the most relevant features, the L_p -norm is suscep-tible to some undesirable eﬀects of increasing dimensionality, as discussed subsequently.

3.2.1.2 Impact of High Dimensionality

Many distance-based data mining applications lose their eﬀectiveness as the dimensionality of the data increases. For example, a distance-based clustering algorithm may group unre-lated data points because the distance function may poorly reflect the intrinsic semantic distances between data points with increasing dimensionality. As a result, distance-based models of clustering, classification, and outlier detection are often qualitatively ineﬀective. This phenomenon is referred to as the “curse of dimensionality,” a term first coined by Richard Bellman.

66 CHAPTER 3. SIMILARITY AND DISTANCES

To better understand the impact of the dimensionality curse on distances, let us examine a unit cube of dimensionality d that is fully located in the nonnegative quadrant, with one corner at the origin O. What is the Manhattan distance of the corner of this cube (say, at the origin) to a randomly chosen point X inside the cube? In this case, because one end point is the origin, and all coordinates are nonnegative, the Manhattan distance will sum up the coordinates of X over the diﬀerent dimensions. Each of these coordinates is uniformly distributed in [0, 1]. Therefore, if Y_i represents the uniformly distributed random variable in [0, 1], it follows that the Manhattan distance is as follows:

d
Dist(O, X) = (Y_i − 0).	(3.3)

i=1

The result is a random variable with a mean of μ = d/2 and a standard deviation of

= d/12. For large values of d, it can be shown by the law of large numbers that the vast majority of randomly chosen points inside the cube will lie in the range [D_min, D_max] = [μ − 3σ, μ + 3σ]. Therefore, most of the points in the cube lie within a distance range of D_max − D_min = 6σ = ^√3d from the origin. Note that the expected Manhattan distance grows with dimensionality at a rate that is linearly proportional to d. Therefore, the ratio of the variation in the distances to the absolute values that is referred to as Contrast(d), is given by:

Contrast(d) =	^Dmax ⁻ ^Dmin	=		(3.4)
			12/d.
	μ

This ratio can be interpreted as the distance contrast between the diﬀerent data points, in terms of how diﬀerent the minimum and maximum distances from the origin might be considered. Because the contrast reduces with ^√d, it means that there is virtually no contrast with increasing dimensionality. Lower contrasts are obviously not desirable because it means that the data mining algorithm will score the distances between all pairs of data points in approximately the same way and will not discriminate well between diﬀerent pairs of objects with varying levels of semantic relationships. The variation in contrast with increasing dimensionality is shown in Fig. 3.1a. This behavior is, in fact, observed for all L_p-norms at diﬀerent values of p, though with varying severity. These diﬀerences in severity will be explored in a later section. Clearly, with increasing dimensionality, a direct use of the L_p-norm may not be eﬀective.

3.2.1.3 Impact of Locally Irrelevant Features

A more fundamental way of exploring the eﬀects of high dimensionality is by examining the impact of irrelevant features. This is because many features are likely to be irrelevant in a typical high -dimensional data set. Consider, for example, a set of medical records, contain-ing patients with diverse medical conditions and very extensive quantitative measurements about various aspects of an individual’s medical history. For a cluster containing diabetic patients, certain attributes such as the blood glucose level are more important for the dis-tance computation. On the other hand, for a cluster containing epileptic patients, a diﬀerent set of features will be more important. The additive eﬀects of the natural variations in the many attribute values may be quite significant. A distance metric such as the Euclidean metric may unnecessarily contribute a high value from the more noisy components because of its square-sum approach. The key point to understand here is that the precise features that are relevant to the distance computation may sometimes be sensitive to the particular pair of objects that are being compared. This problem cannot be solved by global feature

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 42 43 44 45 46 47 48 49 ... 423