D
|
. For the data points in
|
D − S
|
, only an upper
|
|
bound V
|
k
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(X) on the k-nearest neighbor distance is known. This upper bound is equal to
|
|
the k-nearest neighbor distance of each point in
|
D − S
|
to the sample
|
S ⊂ D
|
. However,
|
|
if this upper bound V
|
k
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(X) is no larger than the lower bound L already determined, then
|
|
such a data point X ∈ D − S can be excluded from further consideration as a top-r outlier. Typically, this will result in the removal of a large number of outlier candidates from D − S immediately, as long as the underlying data set is clustered well. This is because most of the data points in clusters will be removed, as long as at least one point from each cluster is included in the sample S, and at least r points in S are located in somewhat sparse regions. This can often be achieved with modest values of the sample size s in real-world data sets. After removing these data points from D − S, the remaining set of points is R ⊆ D − S. The k-nearest neighbor approach can be applied to a much smaller set of candidates R.
Note that higher k-nearest neighbor distances indicate greater outlierness.
250 CHAPTER 8. OUTLIER ANALYSIS
The top-r ranked outliers in R ∪ S are returned as the final output. Depending on the level of pruning already achieved, this can result in a very significant reduction in computational time, especially when |R ∪ S| |D| .
8.5.1.2 Early Termination Trick with Nested Loops
The approach discussed in the previous section can be improved even further by speeding up the second phase of computing the k-nearest neighbor distances of each data point in R. The idea is that the computation of the k-nearest neighbor distance of any data point X ∈ R need not be followed through to termination once it has been determined that X cannot possibly be among the top-r outliers. In such cases, the scan of the database D for computation of the k-nearest neighbor of X can be terminated early.
Note that one already has an estimate (upper bound) V k(X) of the k-nearest neighbor distance of every X ∈ R, based on distances to sample S. Furthermore, the k-nearest neigh-bor distance of the rth best outlier in S provides a lower bound on the “cut-off” required to make it to the top-r outliers. This lower-bound is denoted by L. This estimate V k(X) of the k-nearest neighbor distance of X is further tightened (reduced) as the database D − S is scanned, and the distance of X is computed to each point in D − S. Because this running estimate V k(X) is always an upper bound on the true k-nearest neighbor distance of X, the process of determining the k -nearest neighbor of X can be terminated as soon as V k(X) falls below the known lower bound L on the top-r outlier distance. This is referred to as early termination and provides significant computational savings. Then, the next data point in R can be processed. In cases where early termination is not achieved, the data point X will almost2 always be among the top-r (current) outliers. Therefore, in this case, the lower bound L can be tightened (increased) as well, to the new rth best outlier score. This will result in even better pruning when the next data point from R is processed to determine its k-nearest neighbor distance value. To maximize the benefits of pruning, the data points in R should not be processed in arbitrary order. Rather, they should be processed in decreas-ing order of the initially sampled estimate V k(·) of the k-nearest neighbor distances (based on S). This ensures that the outliers in R are found early on, and the global bound L is tightened as fast as possible for even better pruning. Furthermore, in the inner loop, the data points Y in D − S can be ordered in the opposite direction, based on increasing value of V k(Y ). Doing so ensures that the k-nearest neighbor distances are updated as fast as possible, and the advantage of early termination is maximized. The nested loop approach can also be implemented without the first phase3 of sampling, but such an approach will not have the advantage of proper ordering of the data points processed. Starting with an initial lower bound L on the rth best outlier score obtained from the sampling phase, the nested loop is executed as follows:
2We say “almost,” because the very last distance computation for X may bring V (X) below L. This scenario is unusual, but might occasionally occur.
Most descriptions in the literature omit the first phase of sampling, which is very important for efficiency maximization. A number of implementations in time-series analysis [306] do order the data points more carefully but not with sampling.
Dostları ilə paylaş: |