8.5. DISTANCE-BASED OUTLIER DETECTION
|
|
|
251
|
|
for each
|
|
|
|
|
∈ R do begin
|
|
|
|
|
|
|
X
|
|
|
|
|
|
|
for each
|
|
|
|
∈ D − S do begin
|
|
|
|
|
|
|
Y
|
|
k
|
|
|
|
|
Update current k-nearest neighbor distance estimate
|
V
|
|
(
|
|
) by
|
|
|
X
|
|
computing distance of
|
|
to
|
|
;
|
|
|
|
|
|
|
Y
|
X
|
|
|
|
|
|
|
if V k(
|
|
|
) ≤ L then terminate inner loop;
|
|
|
|
|
|
|
X
|
|
|
|
|
|
|
endfor
|
|
|
|
|
|
|
if V k(
|
|
) > L then
|
|
|
|
|
|
|
X
|
|
|
|
|
|
|
include
|
|
in current r best outliers and update L to
|
|
|
|
|
|
|
X
|
|
|
|
|
|
|
the new rth best outlier score;
|
|
|
|
|
|
|
endfor
|
|
|
|
|
|
|
Note that the k-nearest neighbors of a data point X do not include the data point itself. Therefore, care must be taken in the nested loop structure to ignore the trivial cases where
= Y while updating k-nearest neighbor distances.
8.5.2 Local Distance Correction Methods
Section 3.2.1.8 of Chap. 3 provides a detailed discussion of the impact of the local data distribution on distance computation. In particular, it is shown that straightforward mea-sures, such as the Euclidean distance, do not reflect the intrinsic distances between data points when the density and shape of the clusters vary significantly with data locality. This principle was also used in Sect. 8.4 to justify the use of the local Mahalanobis distance for measuring the distances to cluster centroids, rather than the Euclidean distance. One of the earliest methods that recognized this principle in the context of varying data density was the Local Outlier Factor (LOF) method. A formal justification is based on the generative principles of data sets, but only an intuitive understanding will be provided here. It should be pointed out that the use of the Mahalanobis distance (instead of the Euclidean distance) for multivariate extreme value analysis (Sect. 8.2.2) is also based on generative principles of the likelihood of a data point conforming to the statistical properties of the underlying distribution. The main difference is that the analysis was global in that case, whereas the analysis is local in this case. The reader is also advised to revisit Sect. 3.2.1.8 of Chap. 3 for a discussion of the impact of data distributions on intrinsic distances between data points.
To motivate the principles of local distance correction in the context of outlier analysis, two examples will be used. One of these examples illustrates the impact of varying local distribution density, whereas another example illustrates the impact of varying local cluster shape. Both these aspects can be addressed with different kinds of local normalization of distance computations. In Fig. 8.8a, two different clusters have been shown, one of which is much sparser than the other. In this case, both data points A and B are clearly outliers. While the outlier B will be easily detected by most distance-based algorithms, a challenge arises in the detection of outlier A. This is because the nearest neighbor distance of many data points in the sparser cluster is at least as large as the nearest neighbor distance of outlier A. As a result, depending on the distance-threshold used, a k-nearest neighbor algorithm will either falsely report portions of the sparse cluster, or will completely miss outlier A. Simply speaking, the ranking of the outliers by distance-based algorithms is an incorrect one. This is because the true distance of points in cluster A should be computed in a normalized way, based on its local data distribution. This aspect is relevant to the discussion in Sect. 3.2.1.8 of Chap. 3 on the impact of local data distributions on distance function design, and it is important for many distance-based data mining problems. The key
252 CHAPTER 8. OUTLIER ANALYSIS
Figure 8.8: Impact of local variations in data distribution on distance-based outlier detection
issue here is the generative principle, that data point A is much less likely to be generated by its closest (tightly knit) cluster than many slightly isolated data points belonging to the relatively diffuse cluster are likely to be generated by their cluster. Hawkins’s definition of outliers, stated at the beginning of this chapter, was formulated on the basis of generative principles. It should be pointed out that the probabilistic EM algorithm of Sect. 8.3 does a much better job at recognizing these generative differences. However, the probabilistic EM method is often not used practically because of overfitting issues with smaller data sets. The LOF approach is the first method that recognized the importance of incorporating these generative principles in nonparametric distance-based algorithms.
This point can be emphasized further by examining clusters of different local shape and orientation in Fig. 8.8b. In this case, a distance-based algorithm will report one of the data points along the long axis of one of the elongated clusters, as the strongest outlier, if the 1-nearest neighbor distance is used. This data point is far more likely to be generated by its closest cluster, than the outlier marked by “X.” However, the latter has a smaller 1-nearest neighbor distance. Therefore, the significant problem with distance-based algorithms is that they do not account for the local generative behavior of the underlying data. In this section, two methods will be discussed for addressing this issue. One of them is LOF, and the other is a direct generalization of the global Mahalanobis method for extreme value analysis. The first method can adjust for the generative variations illustrated in Fig. 8.8a, and the second method can adjust for the generative variations illustrated in Fig. 8.8b.
8.5.2.1 Local Outlier Factor (LOF)
The Local Outlier Factor (LOF) approach adjusts for local variations in cluster density by normalizing distances with the average point - specific distances in a data locality. It is often understood popularly as a density-based approach, although, in practice, it is a (normalized) distance-based approach where the normalization factor corresponds to the average local data density. This normalization is the key to addressing the challenges posed by the scenario of Fig. 8.8a.
For a given data point X, let V k(X) be the distance to its k- nearest neighbor, and let Lk(X) be the set of points within the k-nearest neighbor distance of X. The set Lk(X) will
Dostları ilə paylaş: |