Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	220/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 216 217 218 219 220 221 222 223 ... 423

1-Data Mining tarjima

Pruning criterion: To minimize overfitting, a portion of the training data is not used for constructing the decision tree. This training data is then used for evaluating the squared error of prediction of the decision tree. A similar post-pruning strategy is used as the case of categorical class variables. Leaf nodes are iteratively removed if their removal improves accuracy on the validation set, until no more nodes can be removed.

The main drawback of this approach is that overfitting of the linear regression model is a real possibility when leaf nodes do not contain enough data. Therefore, a suﬃcient amount of training data is required to begin with. In such cases, regression trees can be very powerful because they can model complex nonlinear relationships.

11.6. SEMISUPERVISED LEARNING

361

11.5.6 Assessing Model Eﬀectiveness

The eﬀectiveness of linear regression models can be evaluated with a measure known as the R²-statistic, or the coeﬃcient of determination. The term SSE = ⁿ (yi − g(Xi))² yields

i=1
the sum-of-squared error of prediction of regression. Here, g(X ) represents the linear model used for regression. The squared error of the response variable about its mean (or total sum

of squares) is SST =	n	y_i −	n y_j	2	. Then the fraction of unexplained variance is


	i=1		j=1 n

given by SSE/SST , and the R2-statistic is as follows:

R²=1	SSE	(11.18)
	⁻SST^.

This statistic always ranges between 0 and 1 for the case of linear models. Higher values are desirable. When the dimensionality is large, the adjusted R2-statistic provides a more accurate measure:

(11.19)

The R2-statistic is appropriate only for the case of linear models. For nonlinear models, it is possible for the R2-statistic to be highly misleading or even negative. In such cases, one might directly use the SSE as a measure of the error.

11.6 Semisupervised Learning

In many applications, labeled data is expensive and hard to acquire. On the other hand, unlabeled data is often copiously available. It turns out that unlabeled data can be used to significantly improve the accuracy of many mining algorithms. Unlabeled data is useful because of the following two reasons:

Unlabeled data can be used to estimate the low-dimensional manifold structure of the data. The available variation in label distribution can then be extrapolated on this manifold structure.

Unlabeled data can be used to estimate the joint probability distribution of features. The joint probability distributions of features are useful for indirectly relating feature values to labels.

The two aforementioned points are closely related. To explain these points, we will use two examples. In Fig. 11.2, an example has been illustrated where only two labeled examples are available. Based only on this training data, a reasonable decision boundary is illustrated in Fig. 11.2a. Note that this is the best decision boundary that one can hope to find with the use of this limited training data. Portions of this decision boundary are in regions of the space where almost no feature values are available. Therefore, the decision boundaries in these regions may not reflect the class behavior of unseen test instances.

Now, suppose that a large number of unlabeled examples are added to the training data, as illustrated in Fig. 11.2b. Because of the addition of these unlabeled examples, it becomes immediately evident that the data is distributed along two manifolds, each of which contains one of the training examples. A key assumption here is that the class variables are likely to vary smoothly over dense regions of the space, but it may vary significantly over sparse regions of the space. This leads to a new decision boundary that takes the underlying feature correlations into account in addition to the labeled instances. In the particular example of

362

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 216 217 218 219 220 221 222 223 ... 423