Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə160/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   156   157   158   159   160   161   162   163   ...   423
1-Data Mining tarjima

T P R(t) = Recall(t) = 100 |S(t) ∩ G|


|G|

260 CHAPTER 8. OUTLIER ANALYSIS


Table 8.1: ROC construction with rank of ground-truth outliers



Algorithm


Rank of ground-truth outliers



Algorithm A

1, 5, 8, 15,

20

Algorithm B

3, 7, 11, 13, 15

Random Algorithm

17, 36, 45, 59, 66

Perfect Oracle

1, 2, 3, 4,

5

The false positive rate F P R(t) is the percentage of the falsely reported positives out of the ground-truth negatives. Therefore, for a data set D with ground-truth positives G, this measure is defined as follows:





F P R(t) = 100

|S(t) − G|

.

(8.16)




|D−G|




The ROC curve is defined by plotting the F P R(t) on the X-axis, and T P R(t) on the Y -axis for varying values of t. Note that the end points of the ROC curve are always at (0, 0) and (100, 100), and a random method is expected to exhibit performance along the diagonal line connecting these points. The lift obtained above this diagonal line provides an idea of the accuracy of the approach. The area under the ROC curve provides a concrete quantitative evaluation of the effectiveness of a particular method.


To illustrate the insights gained from these different graphical representations, consider an example of a data set with 100 points from which 5 points are outliers. Two algorithms, A and B, are applied to this data set that rank all data points from 1 to 100, with lower rank representing greater propensity to be an outlier. Thus, the true-positive rate and false-positive rate values can be generated by determining the ranks of the 5 ground-truth outlier points. In Table 8.1, some hypothetical ranks for the five ground-truth outliers have been illustrated for the different algorithms. In addition, the ranks of the ground-truth outliers for a random algorithm have been indicated. The random algorithm outputs a random outlier score for each data point. Similarly, the ranks for a “perfect oracle” algorithm, which ranks the correct top 5 points as outliers, have also been illustrated in the table. The corresponding ROC curves are illustrated in Fig. 8.9.


What do these curves really tell us? For cases in which one curve strictly dominates another, it is clear that the algorithm for the former curve is superior. For example, it is immediately evident that the oracle algorithm is superior to all algorithms, and the random algorithm is inferior to all the other algorithms. On the other hand, algorithms A and B show domination at different parts of the ROC curve. In such cases, it is hard to say that one algorithm is strictly superior. From Table 8.1, it is clear that Algorithm A, ranks three of the correct ground-truth outliers very highly, but the remaining two outliers are ranked poorly. In the case of Algorithm B, the highest ranked outliers are not as well ranked as the case of Algorithm A, though all five outliers are determined much earlier in terms of rank threshold. Correspondingly, Algorithm A dominates on the earlier part of the ROC curve whereas Algorithm B dominates on the later part. Some practitioners use the area under the ROC curve as a proxy for the overall effectiveness of the algorithm, though such a measure should be used very carefully because all parts of the ROC curve may not be equally important for different applications.






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   156   157   158   159   160   161   162   163   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin