Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə159/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   155   156   157   158   159   160   161   162   ...   423
1-Data Mining tarjima

8.8. OUTLIER VALIDITY

259

enough to make these criteria unusable for outlier analysis. The reader is advised to refer to Sect. 6.9.1 of Chap. 6 for a discussion of the challenges of internal cluster validity. Most of these challenges are related to the fact that cluster validity criteria are derived from the objective function criteria of clustering algorithms. Therefore, a particular validity measure will favor (or overfit) a clustering algorithm using a similar objective function criterion. These problems become magnified in outlier analysis because of the small sample solution space . A model only needs to be correct on a few outlier data points to be considered a good model. Therefore, the overfitting of internal validity criteria, which is significant even in clustering, becomes even more problematic in outlier analysis. As a specific example, if one used the k-nearest neighbor distance as an internal validity measure, then a pure distance- based outlier detector will always outperform a locally normalized detector such as LOF. This is, of course, not consistent with known experience in real settings, where LOF usually provides more meaningful results. One can try to reduce the overfitting effect by designing a validity measure which is different from the outlier detection models being compared. However, this is not a satisfactory solution because significant uncertainty always remains about the impact of hidden interrelationships between such measures and outlier detection models. The main problem with internal measures is that the relative bias in evaluation of various algorithms is consistently present, even when the data set is varied. A biased selection of internal measures can easily be abused in algorithm benchmarking.


Internal measures are almost never used in outlier analysis, although they are often used in clustering evaluation. Even in clustering, the use of internal validity measures is question-able in spite of its wider acceptance. Therefore, most of the validity measures used for outlier analysis are based on external measures such as the Receiver Operating Characteristic curve.


8.8.2 Receiver Operating Characteristic


Outlier detection algorithms are typically evaluated with the use of external measures where the known outlier labels from a synthetic data set or the rare class labels from a real data set are used as the ground -truth. This ground-truth is compared systematically with the outlier score to generate the final output. While such rare classes may not always reflect all the natural outliers in the data, the results are usually reasonably representative of algorithm quality, when evaluated over many data sets.


In outlier detection models, a threshold is typically used on the outlier score to generate a binary label. If the threshold is picked too restrictively to minimize the number of declared outliers then the algorithm will miss true outlier points (false-negatives). On the other hand, if the threshold is chosen in a more relaxed way, this will lead to too many false-positives. This leads to a trade-off between the false-positives and false-negatives. The problem is that the “correct” threshold to use is never known exactly in a real scenario. However, this entire trade-off curve can be generated, and various algorithms can be compared over the entire trade-off curve. One example of such a curve is the Receiver Operating Characteristic (ROC) curve.


For any given threshold t on the outlier score, the declared outlier set is denoted by S(t). As t changes, the size of S(t) changes as well. Let G represent the true set (ground-truth set) of outliers in the data set. The true-positive rate , which is also referred to as the recall, is defined as the percentage of ground-truth outliers that have been reported as outliers at threshold t.





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   155   156   157   158   159   160   161   162   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin