Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə47/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   43   44   45   46   47   48   49   50   ...   423
1-Data Mining tarjima

3.2. MULTIDIMENSIONAL DATA

67




























































































































































































































































































































































































































































































































































































































































































































































































































































































































Figure 3.2: Impact of p on contrast


subset selection during preprocessing, because the relevance of features is locally determined by the pair of objects that are being considered. Globally, all features may be relevant.
When many features are irrelevant, the additive noise effects of the irrelevant features can sometimes be reflected in the concentration of the distances. In any case, such irrelevant fea-tures will almost always result in errors in distance computation. Because high-dimensional data sets are often likely to contain diverse features, many of which are irrelevant, the addi-tive effect with the use of a sum-of-squares approach, such as the L2-norm, can be very detrimental.
3.2.1.4 Impact of Different Lp-Norms

Different Lp-norms do not behave in a similar way either in terms of the impact of irrelevant features or the distance contrast. Consider the extreme case when p = . This translates to using only the dimension where the two objects are the most dissimilar. Very often, this may be the impact of the natural variations in an irrelevant attribute that is not too useful for a similarity-based application. In fact, for a 1000-dimensional application, if two objects have similar values on 999 attributes, such objects should be considered very similar. However, a single irrelevant attribute on which the two objects are very different will throw off the distance value in the case of the L metric. In other words, local similarity properties of the data are de-emphasized by L. Clearly, this is not desirable.


This behavior is generally true for larger values of p, where the irrelevant attributes are emphasized. In fact, it can also be shown that distance contrasts are also poorer for larger values of p for certain data distributions. In Fig. 3.1b, the distance contrasts have been illustrated for different values of p for the L p-norm over different dimensionalities. The figure is constructed using the same approach as Fig. 3.1a. While all Lp-norms degrade with increasing dimensionality, the degradation is much faster for the plots representing larger values of p. This trend can be understood better from Fig. 3.2 where the value of p is used on the X-axis. In Fig. 3.2a, the contrast is illustrated with different values of p for data of different dimensionalities. Figure 3.2b is derived from Fig. 3.2a, except that the results show the fraction of the Manhattan performance achieved by higher order norms. It is evident that the rate of degradation with increasing p is higher when the dimensionality of the data is large. For 2-dimensional data, there is very little degradation. This is the reason that the value of p matters less in lower dimensional applications.


68 CHAPTER 3. SIMILARITY AND DISTANCES


This argument has been used to propose the concept of fractional metrics, for which p ∈ (0, 1). Such fractional metrics can provide more effective results for the high-dimensional case. As a rule of thumb, the larger the dimensionality, the lower the value of p. However, no exact rule exists on the precise choice of p because dimensionality is not the only factor in determining the proper value of p. The precise choice of p should be selected in an application-specific way, with the use of benchmarking. The bibliographic notes contain discussions on the use of fractional metrics.


3.2.1.5 Match-Based Similarity Computation


Because it is desirable to select locally relevant features for distance computation, a question arises as to how this can be achieved in a meaningful and practical way for data mining applications. A simple approach that is based on the cumulative evidence of matching many attribute values has been shown to be effective in many scenarios. This approach is also relatively easy to implement efficiently.


A broader principle that seems to work well for high -dimensional data is that the impact of the noisy variation along individual attributes needs to be de-emphasized while counting the cumulative match across many dimensions. Of course, such an approach poses challenges for low-dimensional data, because the cumulative impact of matching cannot be counted in a statistically robust way with a small number of dimensions. Therefore, an approach is needed that can automatically adjust to the dimensionality of the data.


With increasing dimensionality, a record is likely to contain both relevant and irrelevant features. A pair of semantically similar objects may contain feature values that are dissimilar (at the level of one standard deviation along that dimension) because of the noisy variations in irrelevant features. Conversely, a pair of objects are unlikely to have similar values across many attributes, just by chance, unless these attributes were relevant. Interestingly, the Euclidean metric (and Lp-norm in general) achieves exactly the opposite effect by using the squared sum of the difference in attribute values. As a result, the “noise” components from the irrelevant attributes dominate the computation and mask the similarity effects of a large number of relevant attributes. The L-norm provides an extreme example of this effect where the dimension with the largest distance value is used. In high-dimensional domains such as text, similarity functions such as the cosine measure (discussed in Sect. 3.3), tend to emphasize the cumulative effect of matches on many attribute values rather than large distances along individual attributes. This general principle can also be used for quantitative data.


One way of de-emphasizing precise levels of dissimilarity is to use proximity thresh-olding in a dimensionality - sensitive way. To perform proximity thresholding, the data are discretized into equidepth buckets. Each dimension is divided into kd equidepth buckets, containing a fraction 1/kd of the records. The number of buckets, kd, is dependent on the data dimensionality d.


Let X = (x1 . . . xd) and Y = (y1 . . . yd ) be two d-dimensional records. Then, for dimen-sion i, if both xi and yi belong to the same bucket, the two records are said to be in proximity on dimension i. The subset of dimensions on which X and Y map to the same bucket is referred to as the proximity set, and it is denoted by S (X, Y , kd). Furthermore, for each dimension i ∈ S (X, Y , kd), let mi and ni be the upper and lower bounds of the bucket in dimension i, in which the two records are proximate to one another. Then, the






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   43   44   45   46   47   48   49   50   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin