Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	204/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 200 201 202 203 204 205 206 207 ... 423

1-Data Mining tarjima

P recision(t) = 100 ∗ ^|S⁽^t⁾ ^{∩ G|}

|S(t)|

The value of P recision(t) is not necessarily monotonic in t because both the numerator and denominator may change with t diﬀerently. The recall is correspondingly defined as the percentage of ground-truth positives that have been reported as positives at threshold t.

Recall(t) = 100 ∗ ^|S⁽^t⁾ ^{∩ G|}

|G|

While a natural trade-oﬀ exists between precision and recall, this trade-oﬀ is not necessarily monotonic. One way of creating a single measure that summarizes both precision and recall is the F₁-measure, which is the harmonic mean between the precision and the recall.

F₁(t) =	2 · P recision(t) · Recall(t)	(10.81)
	P recision(t) + Recall(t)

While the F₁(t) measure provides a better quantification than either precision or recall, it is still dependent on the threshold t, and is therefore still not a complete representation of the trade-oﬀ between precision and recall. It is possible to visually examine the entire trade-oﬀ between precision and recall by varying the value of t, and examining the trade-oﬀ between the two quantities, by plotting the precision versus the recall. As shown later with an exam-ple, the lack of monotonicity of the precision makes the results harder to intuitively interpret.

A second way of generating the trade-oﬀ in a more intuitive way is through the use of the ROC curve. The true-positive rate, which is the same as the recall, is defined as the percentage of ground-truth positives that have been predicted as positive instances at threshold t.

T P R(t) = Recall(t) = 100 ∗ ^|S⁽^t⁾ ^{∩ G|}

|G|

The false-positive rate F P R(t) is the percentage of the falsely reported positives out of the ground-truth negatives. Therefore, for a data set D with ground-truth positives G, this measure is defined as follows:

F P R(t) = 100 ∗	\|S(t) − G\|	.	(10.82)
	\|D−G\|

The ROC curve is defined by plotting the F P R(t) on the X-axis, and T P R(t) on the Y -axis for varying values of t. Note that the end points of the ROC curve are always at (0, 0) and (100, 100), and a random method is expected to exhibit performance along the diagonal line connecting these points. The lift obtained above this diagonal line provides an idea of the accuracy of the approach. The area under the ROC curve provides a concrete quantitative evaluation of the eﬀectiveness of a particular method.

To illustrate the insights gained from these diﬀerent graphical representations, consider an example of a data set with 100 points from which 5 points belong to the positive class. Two algorithms A and B are applied to this data set that rank all data points from 1 to 100, with lower rank representing greater propensity to belong to the positive class. Thus, the true-positive rate and false-positive rate values can be generated by determining the ranks of the five ground-truth positive label points. In Table 10.2, some hypothetical ranks for the

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 200 201 202 203 204 205 206 207 ... 423