Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	47/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 43 44 45 46 47 48 49 50 ... 423

1-Data Mining tarjima

3.2. MULTIDIMENSIONAL DATA

Figure 3.2: Impact of p on contrast

subset selection during preprocessing, because the relevance of features is locally determined by the pair of objects that are being considered. Globally, all features may be relevant.
When many features are irrelevant, the additive noise eﬀects of the irrelevant features can sometimes be reflected in the concentration of the distances. In any case, such irrelevant fea-tures will almost always result in errors in distance computation. Because high-dimensional data sets are often likely to contain diverse features, many of which are irrelevant, the addi-tive eﬀect with the use of a sum-of-squares approach, such as the L₂-norm, can be very detrimental.
3.2.1.4 Impact of Diﬀerent L_p-Norms

Diﬀerent L_p-norms do not behave in a similar way either in terms of the impact of irrelevant features or the distance contrast. Consider the extreme case when p = ∞. This translates to using only the dimension where the two objects are the most dissimilar. Very often, this may be the impact of the natural variations in an irrelevant attribute that is not too useful for a similarity-based application. In fact, for a 1000-dimensional application, if two objects have similar values on 999 attributes, such objects should be considered very similar. However, a single irrelevant attribute on which the two objects are very diﬀerent will throw oﬀ the distance value in the case of the L_∞ metric. In other words, local similarity properties of the data are de-emphasized by L_∞. Clearly, this is not desirable.

This behavior is generally true for larger values of p, where the irrelevant attributes are emphasized. In fact, it can also be shown that distance contrasts are also poorer for larger values of p for certain data distributions. In Fig. 3.1b, the distance contrasts have been illustrated for diﬀerent values of p for the L _p-norm over diﬀerent dimensionalities. The figure is constructed using the same approach as Fig. 3.1a. While all L_p-norms degrade with increasing dimensionality, the degradation is much faster for the plots representing larger values of p. This trend can be understood better from Fig. 3.2 where the value of p is used on the X-axis. In Fig. 3.2a, the contrast is illustrated with diﬀerent values of p for data of diﬀerent dimensionalities. Figure 3.2b is derived from Fig. 3.2a, except that the results show the fraction of the Manhattan performance achieved by higher order norms. It is evident that the rate of degradation with increasing p is higher when the dimensionality of the data is large. For 2-dimensional data, there is very little degradation. This is the reason that the value of p matters less in lower dimensional applications.

68 CHAPTER 3. SIMILARITY AND DISTANCES

This argument has been used to propose the concept of fractional metrics, for which p ∈ (0, 1). Such fractional metrics can provide more eﬀective results for the high-dimensional case. As a rule of thumb, the larger the dimensionality, the lower the value of p. However, no exact rule exists on the precise choice of p because dimensionality is not the only factor in determining the proper value of p. The precise choice of p should be selected in an application-specific way, with the use of benchmarking. The bibliographic notes contain discussions on the use of fractional metrics.

3.2.1.5 Match-Based Similarity Computation

Because it is desirable to select locally relevant features for distance computation, a question arises as to how this can be achieved in a meaningful and practical way for data mining applications. A simple approach that is based on the cumulative evidence of matching many attribute values has been shown to be eﬀective in many scenarios. This approach is also relatively easy to implement eﬃciently.

A broader principle that seems to work well for high -dimensional data is that the impact of the noisy variation along individual attributes needs to be de-emphasized while counting the cumulative match across many dimensions. Of course, such an approach poses challenges for low-dimensional data, because the cumulative impact of matching cannot be counted in a statistically robust way with a small number of dimensions. Therefore, an approach is needed that can automatically adjust to the dimensionality of the data.

With increasing dimensionality, a record is likely to contain both relevant and irrelevant features. A pair of semantically similar objects may contain feature values that are dissimilar (at the level of one standard deviation along that dimension) because of the noisy variations in irrelevant features. Conversely, a pair of objects are unlikely to have similar values across many attributes, just by chance, unless these attributes were relevant. Interestingly, the Euclidean metric (and L_p-norm in general) achieves exactly the opposite eﬀect by using the squared sum of the diﬀerence in attribute values. As a result, the “noise” components from the irrelevant attributes dominate the computation and mask the similarity eﬀects of a large number of relevant attributes. The L_∞-norm provides an extreme example of this eﬀect where the dimension with the largest distance value is used. In high-dimensional domains such as text, similarity functions such as the cosine measure (discussed in Sect. 3.3), tend to emphasize the cumulative eﬀect of matches on many attribute values rather than large distances along individual attributes. This general principle can also be used for quantitative data.

One way of de-emphasizing precise levels of dissimilarity is to use proximity thresh-olding in a dimensionality - sensitive way. To perform proximity thresholding, the data are discretized into equidepth buckets. Each dimension is divided into k_d equidepth buckets, containing a fraction 1/k_d of the records. The number of buckets, k_d, is dependent on the data dimensionality d.

Let X = (x₁ . . . x_d) and Y = (y₁ . . . y_d ) be two d-dimensional records. Then, for dimen-sion i, if both x_i and y_i belong to the same bucket, the two records are said to be in proximity on dimension i. The subset of dimensions on which X and Y map to the same bucket is referred to as the proximity set, and it is denoted by S (X, Y , k_d). Furthermore, for each dimension i ∈ S (X, Y , k_d), let m_i and n_i be the upper and lower bounds of the bucket in dimension i, in which the two records are proximate to one another. Then, the

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 43 44 45 46 47 48 49 50 ... 423