Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə153/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   149   150   151   152   153   154   155   156   ...   423
1-Data Mining tarjima

The distance-based outlier score of an object O is its distance to its kth nearest neighbor.

The aforementioned definition, which uses the k-nearest neighbor distance, is the most common one. Other variations of this definition are sometimes used, such as the average distance to the k -nearest neighbors. The value of k is a user-defined parameter. Selecting a value of k larger than 1 helps identify isolated groups of outliers. For example, in Fig. 8.7, as long as k is fixed to any value larger than 3, all data points within the small groups of closely related points will have a high outlier score. Note that the target data point, for which the outlier score is computed, is itself not included among its k-nearest neighbors. This is done to avoid scenarios where a 1-nearest neighbor method will always yield an outlier score of 0.


Distance-based methods typically use a finer granularity of analysis than clustering methods and can therefore distinguish between ambient noise and truly isolated anomalies. This is because ambient noise will typically have a lower k-nearest neighbor distance than a truly isolated anomaly. This distinction is lost in clustering methods where the distance to the closest cluster centroid does not accurately reflect the instance-specific isolation of the underlying data point.


The price of this better granularity is higher computational complexity. Consider a data set D containing n data points. The determination of the k-nearest neighbor distance requires O(n) time for each data point, when a sequential scan is used. Therefore, the



8.5. DISTANCE-BASED OUTLIER DETECTION

249

determination of the outlier scores of all data points may require O(n2 ) time. This is clearly not a feasible option for very large data sets. Therefore, a variety of methods are used to speed up the computation:





  1. Index structures: Index structures can be used to determine kth nearest neighbor distances efficiently. This is, however, not an option, if the data is high dimensional. In such cases, the effectiveness of index structures tends to degrade.




  1. Pruning tricks: In most applications, the outlier scores of all the data points are not required. It may suffice to return binary labels for the top-r outliers, together with their scores. The outlier scores of the remaining data points are irrelevant. In such cases, it may be possible to terminate a k-nearest neighbor sequential scan for an outlier candidate when its current upper bound estimate on the k-nearest neighbor distance value falls below the rth best outlier score found so far. This is because such a candidate is guaranteed to be not among the top-r outliers. This methodology is referred to as the “early termination trick,” and it is described in detail later in this section.

In some cases, it is also possible to combine the pruning approach with index structures.


8.5.1 Pruning Methods


Pruning methods are used only for the case where the top-r ranked outliers need to be returned, and the outlier scores of the remaining data points are irrelevant. Thus, pruning methods can be used only for the binary-decision version of the algorithm. The basic idea in pruning methods is to reduce the time required for the k-nearest neighbor distance computations by ruling out data points quickly that are obviously nonoutliers even with approximate computation.



8.5.1.1 Sampling Methods



















The first step is to pick a sample S of size s n from the data D, and compute all pairwise




distances between the data points in sample S and

those in database

D

. There are a total







2










of n · s such pairs. This process requires O(n · s) O(n




) distance computations. Thus, for




each of the sampled points in S, the k-nearest neighbor distance is already known exactly. number of outliers



The top rth ranked outlier in sample S is determined, where r is the




1

on the rth




to be returned. The score of the rth rank outlier provides a lower bound




L




ranked outlier score over the entire data set


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   149   150   151   152   153   154   155   156   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin