Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə156/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   152   153   154   155   156   157   158   159   ...   423
1-Data Mining tarjima

8.5. DISTANCE-BASED OUTLIER DETECTION

253

typically contain k points, but may sometimes contain more than k points because of ties in the k-nearest neighbor distance.


Then, the reachability distance Rk(X, Y ) of object X with respect to Y is defined as the maximum of the distance Dist(X, Y ), between the pair (X, Y ) and the k-nearest neighbor distance of Y .



Rk(










) = max{Dist(










), V k(




)}

(8.10)




X,

Y

X,

Y

Y




The reachability distance is not symmetric between X and Y . Intuitively, when Y is in a dense region and the distance between X and Y is large, the reachability distance of X with respect to it is equal to the true distance Dist(X, Y ). On the other hand, when the distances between X and Y are small, then the reachability distance is smoothed out by the k-nearest neighbor distance of Y . The larger the value of k, the greater the smoothing. Correspondingly, the reachability distances with respect to different points will also become more similar. The reason for using this smoothing is that it makes the intermediate distance computations more stable. This is especially important when the distances between X and



  • are small, and it will result in greater statistical fluctuations in the raw distances. At a conceptual level, it is possible to define a version of LOF directly in terms of raw distances, rather than reachability distances. However, such a version would be missing the stability provided by smoothing.

The average reachability distance ARk(X) of data point X with respect to its neigh-borhood Lk(X) is defined as the average of its reachability distances to all objects in its neighborhood.



ARk(




) = MEAN




Lk (




)Rk(










)

(8.11)




X

X,

Y




Y

X




Here, the MEAN function simply represents the mean value over the entire set Lk(X). The Local Outlier Factor LOFk( X) is then equal to the mean ratio of ARk(X) to the corresponding values of all points in the k-neighborhood of X.
























ARk(










)

























X







LOFk(X) = MEAN

Lk (

(8.12)




Y

X

) ARk(




)




Y




The use of distance ratios in the definition ensures that the local distance behavior is well accounted for in this definition. As a result, the LOF values for the objects in a cluster are often close to 1 when the data points in the cluster are homogeneously distributed. For example, in the case of Fig. 8.8a, the LOF values of data points in both clusters will be quite close to 1, even though the densities of the two clusters are different. On the other hand, the LOF values of both the outlying points will be much higher because they will be computed in terms of the ratios to the average neighbor reachability distances. In practice, the maximum value of LOFk(X ) over a range of different values of k is used as the outlier score to determine the best size of the neighborhood.


One observation about the LOF method is that while it is popularly understood in the literature as a density-based approach, it can be more simply understood as a relative distance-based approach with smoothing. The smoothing is really a refinement to make distance computations more stable. The basic LOF method will work reasonably well on many data sets, even if the raw distances are used instead of reachability distances, for the aforementioned computations of Eq. 8.11.


The LOF method, therefore, has the ability to adjust well to regions of varying density because of this relative normalization in the denominator of each term of Eq. 8.12. In the original presentation of the LOF algorithm (see bibliographic notes), the LOF is defined in terms of a density variable. The density variable is loosely defined as the inverse of the


254 CHAPTER 8. OUTLIER ANALYSIS


average of the smoothed reachability distances. This is, of course, not a precise definition of density. Density is traditionally defined in terms of the number of data points within a specified area or volume. This book provides exactly the same definition of LOF but presents it slightly differently by omitting the intermediate density variable. This is done both for simplicity, and for a definition of LOF directly in terms of (normalized) distances. The real connection of LOF to data density lies in its insightful ability to adjust to varying data density with the use of relative distances. Therefore, this book has classified this approach as a (normalized) distance-based method, rather than as a density-based method.


8.5.2.2 Instance-Specific Mahalanobis Distance


The instance-specific Mahalanobis distance is designed for adjusting to varying shapes of the distributions in the locality of a particular data point, as illustrated in Fig. 8.8b. The Mahalanobis distance is directly related to shape of the data distribution, although it is tra-ditionally used in the global sense. Of course, it is also possible to use the local Mahalanobis distance by using the covariance structure of the neighborhood of a data point.


The problem here is that the neighborhood of a data point is hard to define with the Euclidean distance when the shape of the neighborhood cluster is not spherical. For exam-ple, the use of the Euclidean distance to a data point is biased toward capturing the circular region around that point, rather than an elongated cluster. To address this issue, an agglom-erative approach is used to determine the k-neighborhood Lk(X) of a data point X. First, data point X is added to Lk(X). Then, data points are iteratively added to Lk(X) that have the smallest distance to their closest point in Lk(X). This approach can be viewed as a special case of single - linkage hierarchical clustering methods, where singleton points are merged with clusters. Single-linkage methods are well-known for creating clusters of arbitrary shape. Such an approa ch tends to “grow” the neighborhood with the same shape as the cluster. The mean μk(X) and covariance matrix Σk(X) of the neighborhood Lk(X) are computed. Then, the instance-specific Mahalanobis score LM ahak(X) of a data point





  • provides its outlier score. This score is defined as the Mahalanobis distance of X to the mean μk(X) of data points in Lk(X).





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   152   153   154   155   156   157   158   159   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin