Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	202/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 198 199 200 201 202 203 204 205 ... 423

1-Data Mining tarjima

Accuracy: The accuracy is the fraction of test instances in which the predicted value matches the ground-truth value.

Cost-sensitive accuracy: Not all classes are equally important in all scenarios while comparing the accuracy. This is particularly important in imbalanced class problems, which will be discussed in more detail in the next chapter. For example, consider an application in which it is desirable to classify tumors as malignant or nonmalignant where the former is much rarer than the latter. In such cases, the misclassification of the former is often much less desirable than misclassification of the latter. This is frequently quantified by imposing diﬀerential costs c₁ . . . c_k on the misclassification of the diﬀerent classes. Let n₁ . . . n_k be the number of test instances belonging to each class. Furthermore, let a₁ . . . a_k be the accuracies (expressed as a fraction) on the subset of test instances belonging to each class. Then, the overall accuracy A can be computed as a weighted combination of the accuracies over the individual labels.

k
_A₌ i=1 ^cⁱⁿⁱ^aⁱ _(10.77)

k
_i₌₁^ciⁿi

The cost sensitive accuracy is the same as the unweighted accuracy when all costs c₁ . . . c_k are the same.

Aside from the accuracy, the statistical robustness of a model is also an important issue. For example, if two classifiers are trained over a small number of test instances and compared, the diﬀerence in accuracy may be a result of random variations, rather than a truly statis-tically significant diﬀerence between the two classifiers. Therefore, it is important to design statistical measures to quantify the specific advantage of one classifier over the other.

Most statistical methodologies such as holdout, bootstrap, and cross-validation use b > 1 diﬀerent randomly sampled rounds to obtain multiple estimates of the accuracy. For the purpose of discussion, let us assume that b diﬀerent rounds (i.e., b diﬀerent m-way partitions) of cross-validation are used. Let M₁ and M₂ be two models. Let A_i,₁ and A_i,₂ be the respective accuracies of the models M₁ and M₂ on the partitioning created by the ith round of cross-validation. The corresponding diﬀerence in accuracy is δa_i = A_i,₁ − A_i,₂. This results in b estimates δa₁ . . . δa_b. Note that δa_i might be either positive or negative, depending on which classifier provides superior performance on a particular round of cross-validation. Let the average diﬀerence in accuracy between the two classifiers be ΔA.

	b	δa_i
ΔA =	i=1	δa_i	(10.78)
ΔA =	b		(10.78)
	b

The standard deviation σ of the diﬀerence in accuracy may be estimated as follows:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 198 199 200 201 202 203 204 205 ... 423