Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	155/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 151 152 153 154 155 156 157 158 ... 423

1-Data Mining tarjima

8.5. DISTANCE-BASED OUTLIER DETECTION

251

for each

∈ R do begin

for each

∈ D − S do begin

Update current k-nearest neighbor distance estimate

(

) by

computing distance of

;

if V ^k(

) ≤ L then terminate inner loop;

endfor

if V ^k(

) > L then

include

in current r best outliers and update L to

the new rth best outlier score;

endfor

Note that the k-nearest neighbors of a data point X do not include the data point itself. Therefore, care must be taken in the nested loop structure to ignore the trivial cases where

= Y while updating k-nearest neighbor distances.

8.5.2 Local Distance Correction Methods

Section 3.2.1.8 of Chap. 3 provides a detailed discussion of the impact of the local data distribution on distance computation. In particular, it is shown that straightforward mea-sures, such as the Euclidean distance, do not reflect the intrinsic distances between data points when the density and shape of the clusters vary significantly with data locality. This principle was also used in Sect. 8.4 to justify the use of the local Mahalanobis distance for measuring the distances to cluster centroids, rather than the Euclidean distance. One of the earliest methods that recognized this principle in the context of varying data density was the Local Outlier Factor (LOF) method. A formal justification is based on the generative principles of data sets, but only an intuitive understanding will be provided here. It should be pointed out that the use of the Mahalanobis distance (instead of the Euclidean distance) for multivariate extreme value analysis (Sect. 8.2.2) is also based on generative principles of the likelihood of a data point conforming to the statistical properties of the underlying distribution. The main diﬀerence is that the analysis was global in that case, whereas the analysis is local in this case. The reader is also advised to revisit Sect. 3.2.1.8 of Chap. 3 for a discussion of the impact of data distributions on intrinsic distances between data points.

To motivate the principles of local distance correction in the context of outlier analysis, two examples will be used. One of these examples illustrates the impact of varying local distribution density, whereas another example illustrates the impact of varying local cluster shape. Both these aspects can be addressed with diﬀerent kinds of local normalization of distance computations. In Fig. 8.8a, two diﬀerent clusters have been shown, one of which is much sparser than the other. In this case, both data points A and B are clearly outliers. While the outlier B will be easily detected by most distance-based algorithms, a challenge arises in the detection of outlier A. This is because the nearest neighbor distance of many data points in the sparser cluster is at least as large as the nearest neighbor distance of outlier A. As a result, depending on the distance-threshold used, a k-nearest neighbor algorithm will either falsely report portions of the sparse cluster, or will completely miss outlier A. Simply speaking, the ranking of the outliers by distance-based algorithms is an incorrect one. This is because the true distance of points in cluster A should be computed in a normalized way, based on its local data distribution. This aspect is relevant to the discussion in Sect. 3.2.1.8 of Chap. 3 on the impact of local data distributions on distance function design, and it is important for many distance-based data mining problems. The key

252 CHAPTER 8. OUTLIER ANALYSIS

Figure 8.8: Impact of local variations in data distribution on distance-based outlier detection

issue here is the generative principle, that data point A is much less likely to be generated by its closest (tightly knit) cluster than many slightly isolated data points belonging to the relatively diﬀuse cluster are likely to be generated by their cluster. Hawkins’s definition of outliers, stated at the beginning of this chapter, was formulated on the basis of generative principles. It should be pointed out that the probabilistic EM algorithm of Sect. 8.3 does a much better job at recognizing these generative diﬀerences. However, the probabilistic EM method is often not used practically because of overfitting issues with smaller data sets. The LOF approach is the first method that recognized the importance of incorporating these generative principles in nonparametric distance-based algorithms.

This point can be emphasized further by examining clusters of diﬀerent local shape and orientation in Fig. 8.8b. In this case, a distance-based algorithm will report one of the data points along the long axis of one of the elongated clusters, as the strongest outlier, if the 1-nearest neighbor distance is used. This data point is far more likely to be generated by its closest cluster, than the outlier marked by “X.” However, the latter has a smaller 1-nearest neighbor distance. Therefore, the significant problem with distance-based algorithms is that they do not account for the local generative behavior of the underlying data. In this section, two methods will be discussed for addressing this issue. One of them is LOF, and the other is a direct generalization of the global Mahalanobis method for extreme value analysis. The first method can adjust for the generative variations illustrated in Fig. 8.8a, and the second method can adjust for the generative variations illustrated in Fig. 8.8b.

8.5.2.1 Local Outlier Factor (LOF)

The Local Outlier Factor (LOF) approach adjusts for local variations in cluster density by normalizing distances with the average point - specific distances in a data locality. It is often understood popularly as a density-based approach, although, in practice, it is a (normalized) distance-based approach where the normalization factor corresponds to the average local data density. This normalization is the key to addressing the challenges posed by the scenario of Fig. 8.8a.

For a given data point X, let V k(X) be the distance to its k- nearest neighbor, and let L_k(X) be the set of points within the k-nearest neighbor distance of X. The set L_k(X) will

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 151 152 153 154 155 156 157 158 ... 423