Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə204/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   200   201   202   203   204   205   206   207   ...   423
1-Data Mining tarjima

P recision(t) = 100 |S(t) ∩ G|


|S(t)|

The value of P recision(t) is not necessarily monotonic in t because both the numerator and denominator may change with t differently. The recall is correspondingly defined as the percentage of ground-truth positives that have been reported as positives at threshold t.




Recall(t) = 100 |S(t) ∩ G|


|G|

While a natural trade-off exists between precision and recall, this trade-off is not necessarily monotonic. One way of creating a single measure that summarizes both precision and recall is the F1-measure, which is the harmonic mean between the precision and the recall.





F1(t) =

2 · P recision(t) · Recall(t)

(10.81)




P recision(t) + Recall(t)













While the F1(t) measure provides a better quantification than either precision or recall, it is still dependent on the threshold t, and is therefore still not a complete representation of the trade-off between precision and recall. It is possible to visually examine the entire trade-off between precision and recall by varying the value of t, and examining the trade-off between the two quantities, by plotting the precision versus the recall. As shown later with an exam-ple, the lack of monotonicity of the precision makes the results harder to intuitively interpret.


A second way of generating the trade-off in a more intuitive way is through the use of the ROC curve. The true-positive rate, which is the same as the recall, is defined as the percentage of ground-truth positives that have been predicted as positive instances at threshold t.




T P R(t) = Recall(t) = 100 |S(t) ∩ G|


|G|

The false-positive rate F P R(t) is the percentage of the falsely reported positives out of the ground-truth negatives. Therefore, for a data set D with ground-truth positives G, this measure is defined as follows:





F P R(t) = 100

|S(t) − G|

.

(10.82)




|D−G|




The ROC curve is defined by plotting the F P R(t) on the X-axis, and T P R(t) on the Y -axis for varying values of t. Note that the end points of the ROC curve are always at (0, 0) and (100, 100), and a random method is expected to exhibit performance along the diagonal line connecting these points. The lift obtained above this diagonal line provides an idea of the accuracy of the approach. The area under the ROC curve provides a concrete quantitative evaluation of the effectiveness of a particular method.


To illustrate the insights gained from these different graphical representations, consider an example of a data set with 100 points from which 5 points belong to the positive class. Two algorithms A and B are applied to this data set that rank all data points from 1 to 100, with lower rank representing greater propensity to belong to the positive class. Thus, the true-positive rate and false-positive rate values can be generated by determining the ranks of the five ground-truth positive label points. In Table 10.2, some hypothetical ranks for the




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   200   201   202   203   204   205   206   207   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin