Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	160/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 156 157 158 159 160 161 162 163 ... 423

1-Data Mining tarjima

T P R(t) = Recall(t) = 100 ∗ ^|S⁽^t⁾ ^{∩ G|}

|G|

260 CHAPTER 8. OUTLIER ANALYSIS

Table 8.1: ROC construction with rank of ground-truth outliers

Algorithm

Rank of ground-truth outliers

Algorithm A	1, 5, 8, 15,		20
Algorithm B		3, 7, 11, 13, 15
Random Algorithm		17, 36, 45, 59, 66
Perfect Oracle	1, 2, 3, 4,		5

The false positive rate F P R(t) is the percentage of the falsely reported positives out of the ground-truth negatives. Therefore, for a data set D with ground-truth positives G, this measure is defined as follows:

F P R(t) = 100 ∗	\|S(t) − G\|	.	(8.16)
	\|D−G\|

The ROC curve is defined by plotting the F P R(t) on the X-axis, and T P R(t) on the Y -axis for varying values of t. Note that the end points of the ROC curve are always at (0, 0) and (100, 100), and a random method is expected to exhibit performance along the diagonal line connecting these points. The lift obtained above this diagonal line provides an idea of the accuracy of the approach. The area under the ROC curve provides a concrete quantitative evaluation of the eﬀectiveness of a particular method.

To illustrate the insights gained from these diﬀerent graphical representations, consider an example of a data set with 100 points from which 5 points are outliers. Two algorithms, A and B, are applied to this data set that rank all data points from 1 to 100, with lower rank representing greater propensity to be an outlier. Thus, the true-positive rate and false-positive rate values can be generated by determining the ranks of the 5 ground-truth outlier points. In Table 8.1, some hypothetical ranks for the five ground-truth outliers have been illustrated for the diﬀerent algorithms. In addition, the ranks of the ground-truth outliers for a random algorithm have been indicated. The random algorithm outputs a random outlier score for each data point. Similarly, the ranks for a “perfect oracle” algorithm, which ranks the correct top 5 points as outliers, have also been illustrated in the table. The corresponding ROC curves are illustrated in Fig. 8.9.

What do these curves really tell us? For cases in which one curve strictly dominates another, it is clear that the algorithm for the former curve is superior. For example, it is immediately evident that the oracle algorithm is superior to all algorithms, and the random algorithm is inferior to all the other algorithms. On the other hand, algorithms A and B show domination at diﬀerent parts of the ROC curve. In such cases, it is hard to say that one algorithm is strictly superior. From Table 8.1, it is clear that Algorithm A, ranks three of the correct ground-truth outliers very highly, but the remaining two outliers are ranked poorly. In the case of Algorithm B, the highest ranked outliers are not as well ranked as the case of Algorithm A, though all five outliers are determined much earlier in terms of rank threshold. Correspondingly, Algorithm A dominates on the earlier part of the ROC curve whereas Algorithm B dominates on the later part. Some practitioners use the area under the ROC curve as a proxy for the overall eﬀectiveness of the algorithm, though such a measure should be used very carefully because all parts of the ROC curve may not be equally important for diﬀerent applications.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 156 157 158 159 160 161 162 163 ... 423