Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	235/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 231 232 233 234 235 236 237 238 ... 423

1-Data Mining tarjima

y = f (	) +	(11.29)
	) +	(11.29)
X

Here, f (X) is the function representing the true (but unknown) relationship between the feature variables and the class variable, and is the intrinsic error in the data that cannot be modeled. Therefore, over the n test instances (X₁, y₁) . . . (X_n, y_n), the intrinsic noise ²_a

378 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS

Table 11.1: Impact of diﬀerent techniques on bias-variance trade-oﬀ

Technique

Source/level of bias

Source/level of variance

Simple	Oversimplification increases	Low variance. Simple models
models	bias in decision boundary	do not overfit
Complex	Generally lower than simple	High variance. Complex
models	models. Complex boundary	assumptions will be overly
	can be modeled	sensitive to data variation
Shallow	High bias. Shallow tree	Low variance. The top split
decision	will ignore many relevant	levels do not depend on
trees	split predicates	minor data variations
Deep	Lower bias than shallow	High variance because of
decision	decision tree. Deep levels	overfitting at lower levels
trees	model complex boundary
Rules	Bias increases with fewer	Variance increases with
	antecedents per rule	more antecedents per rule
Naive	High bias from simplified	Variance in estimation of
Bayes	model (e.g., Bernoulli)	model parameters. More
	and naive assumption	parameters increase variance
Linear	High bias. Correct boundary	Low variance. Linear separator
models	may not be linear	can be modeled robustly
Kernel	Bias lower than linear SVM.	Variance higher than
SVM	Choice of kernel function	linear SVM
k-NN	Simplified distance function	Complex distance function such
model	such as Euclidean causes	as local discriminant causes
	bias. Increases with k	variance. Decreases with k
Regularization	Increases bias	Reduces variance

may be estimated as follows:

1		n
_a² =			(y_i − f (X_i))².	(11.30)
	n

i=1

Because the precise form of the function f (X) is unknown, most classification algorithms construct a model g(X, D) based on certain modeling assumptions. An example of such a modeling assumption is the linear decision boundary in SVM. This function g(X, D) may be defined either algorithmically (e.g., decision tree), or it may be defined in closed form, such as the SVM classifier, discussed in Sect. 10.6 of Chap. 10. In the latter case, the function g(X, D) is defined as follows:

g(		D) = sign{		·		+ b}.	(11.31)
	X,		W		X

Note that the coeﬃcients W and b can only be estimated from the training data set D. The notation D occurs as an argument of g(X, D) because the training data set D is required to estimate model coeﬃcients such as W and b. For a training data set D of limited size, it is often not possible to estimate g(X, D) very accurately, as in the case of the coarse decision boundaries of Fig. 11.5b. Therefore, the specific impact of the training data set on the estimated result g(X, D) can be quantified by comparing it with its expected value E_D[g(X, D)] over all possible outcomes of training data sets D.

Aside from the intrinsic error ²_a, which is data-set specific, there are two primary sources of error in the modeling and estimation process:

The modeling assumptions about g(X, D) may not reflect the true model. Consider the case where the linear SVM modeling assumption is used for g(X, D), and the true boundary between the classes is not linear. This would result in a bias in the model. In

11.8. ENSEMBLE METHODS

379

practice, the bias usually arises from oversimplification in the modeling assumptions. Even if we had a very large training data set and could somehow estimate the expected value of g(X, D), the value of (f (X) − E_D[g(X, D)])2 would be nonzero because of the bias arising from the diﬀerence between the true model and assumed model. This is referred to as the (squared) bias. Note that it is often possible to reduce the bias by assuming a more complex form for g(X, D), such as using a kernel SVM instead of a linear SVM in Fig. 11.5a.

Even if our assumptions about g(X, D) were correct, it is not possible to exactly estimate E_D[g(X, D)] with any given training data set D. Over diﬀerent instan-tiations of a training data set D and fixed test instance X, the predicted class label g(X, D) would be diﬀerent. This is the model variance, which corresponds to E_D[(g(X, D)−E_D[g(X, D)])²]. Note that the expectation function E_D[g(X, D)] defines a decision boundary which is usually much closer to the true decision boundary (e.g., ensemble boundary estimate in Fig. 11.6b) as compared to that defined by a specific instantiation D of the training data (e.g., boundaries A and B in Fig. 11.5b).

It can be shown that the expected mean squared error of the prediction E

[MSE] =

_i₌₁E_D[(y_i − g(X_i, D))

] of test data points (X₁

, y₁) . . . (X_n, y_n) over diﬀerent choices

of training data set D can be decomposed into the bias, variance, and the intrinsic error as follows:

1
^ED^[^MSE^]⁼ n

i=1

(

f (X_i) − E

_D[g(X_i, D)])

+ E_D[(g(X_i, D) − E_D[g(X_i, D)])

] +

_a.

Error

Bias

Variance

The ensemble method reduces the classification error by reducing the bias and variance of the combined model. By carefully choosing the component models with specific bias-variance properties and combining them appropriately, it is often possible to reduce both the bias and the variance over an individual model.

11.8.3 Specific Instantiations of Ensemble Learning

A number of ensemble approaches have been designed to increase accuracy by reducing bias, variance, or a combination of both. In the following, a discussion of selected models is provided.

11.8.3.1 Bagging

Bagging, which is also referred to as bootstrapped aggregating, is an approach that attempts to reduce the variance of the prediction. It is based on the idea that if the variance of a prediction is σ2, then the variance of the average of k independent and identically dis-tributed (i.i.d.) predictions is reduced to σ_k² . Given suﬃciently independent predictors, such an approach of averaging will reduce the variance significantly.

So how does one approximate i.i.d. predictors? In bagging, data points are sampled uniformly from the original data with replacement. This sampling approach is referred to as bootstrapping and is also used for model evaluation. A sample of approximately the same size as the original data is drawn. This sample may contain duplicates, and typically contains approximately 1 − (1 − 1/n)ⁿ ≈ 1 − 1/e fraction of the distinct data points in the original data. Here, e represents the base of the natural logarithm. This result is easy to show because

380 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS

the probability of a data point not being included in the sample is given by (1 − 1/n)n . A total of k diﬀerent bootstrapped samples, each of size n, are drawn independently, and the classifier is trained on each of them. For a given test instance, the predicted class label is reported by the ensemble as the dominant vote of the diﬀerent classifiers.

The primary advantage of bagging is that it reduces the model variance. However, bag-ging does not reduce the model bias. Therefore, this approach will not be eﬀective for the example of Fig. 11.5a, but it will be eﬀective for the example of Fig. 11.5b. In Fig. 11.5b, the diﬀerent decision tree boundaries are created by the random variations in the bootstrapped samples. The majority vote of these bootstrapped samples will, however, perform better than a model constructed on the full data set because of a reduction in the variance com-ponent. For cases in which the bias is the primary component of error, the bootstrapping approach may show a slight degradation in accuracy. Therefore, while designing a boot-strapping approach, it is advisable to design the individual components to reduce the bias at the possible expense of variance, because bagging will take care of the latter. Such a choice optimizes the bias-variance trade-oﬀ. For example, one might want to grow a deep decision tree with 100 % class purity at the leaves. In fact, decision trees are an ideal choice for bagging, because they have low bias and high variance when they are grown suﬃciently deep. The main problem with bagging is that the i.i.d. assumption is usually not satisfied because of correlations between ensemble components.

11.8.3.2 Random Forests

Bagging works best when the individual predictions from various ensemble components satisfy the i.i.d. property. If the k diﬀerent predictors, each with variance σ2, have a positive pairwise correlation of ρ between them, then the variance of the averaged prediction can be

shown to be ρ · σ	2	+	(1−ρ)·σ²	. The term ρ · σ	2	is invariant to the number of components k
			k

in the ensemble. This term limits the performance gains from bagging. As we will discuss below, the predictions from bootstrapped decision trees are usually positively correlated.

Random forests can be viewed as a generalization of the basic bagging method, as applied to decision trees. Random forests are defined as an ensemble of decision trees, in which randomness has explicitly been inserted into the model building process of each decision tree. While the bootstrapped sampling approach of bagging is also an indirect way of adding randomness to model-building, there are some disadvantages of doing so. The main drawback of using decision-trees directly with bagging is that the split choices at the top levels of the tree are statistically likely to remain approximately invariant to bootstrapped sampling. Therefore, the trees are more correlated, which limits the amount of error reduction obtained from bagging. In such cases, it makes sense to directly increase the diversity of the component decision-tree models. The idea is to use a randomized decision tree model with less correlation between the diﬀerent ensemble components. The underlying variability can then be more eﬀectively reduced by an averaging approach. The final results are often more accurate than a direct application of bagging on decision trees. Figure 11.6b provides a typical example of the eﬀect of the approach on the decision boundary, which is similar to that of bagging, but it is usually more pronounced.

The random- split selection approach directly introduces randomness into the split crite-rion. An integer parameter q ≤ d is used to regulate the amount of randomness introduced in split selection. The split selection at each node is preceded by the randomized selection of a subset S of attributes of size q. The splits at that node are then executed using only this subset. Larger values of q will result in correlated trees that are similar to a tree without any injected randomness. By selecting small values of q relative to the full dimensionality

11.8. ENSEMBLE METHODS

381

d, the correlations across diﬀerent components of the random forest can be reduced. The resulting trees can also be grown more eﬃciently, because a smaller number of attributes need to be considered at each node. On the other hand, when a larger size q of the selected feature set S is used, then each individual ensemble component will have better accuracy, which is also desirable. Therefore, the goal is to select a good trade-oﬀ point. It has been shown that, when the number of input attributes is d, a total of q = log₂(d) + 1 attributes should be selected to achieve the best trade-oﬀ. Interestingly, even a choice of q = 1 seems to work well in terms of accuracy, though it requires the use of a larger number of ensemble components. After the ensemble has been constructed, the dominant label from the various predictions of a test instance is reported. This approach is referred to as Forest-RI because it is based on random input selection.

This approach does not work well when the overall dimensionality d is small, and there-fore it is no longer possible to use values of q much smaller than d. In such cases, a value L ≤ d is specified, which corresponds to the number of input features that are combined together. At each node, L features are randomly selected, and combined linearly with coeﬃ-cients generated uniformly at random from the range [−1, 1]. A total of q such combinations are generated in order to create a new subset S of multivariate attributes. As in the previ-ous case, the splits are executed using only the attribute set S. This results in multivariate random splits. This approach is referred to as Forest-RC because it uses random linear combinations.

In the random forest approach, deep trees are grown without pruning. Each tree is grown on a bootstrapped sample of the training data to increase the variance reduction. As in bagging, the variance is reduced significantly by the random forest approach. However, the component classifiers may have higher bias because of the restricted split selection of each component classifier. This can sometimes cause a problem when the fraction of informative features in the training data is small. Therefore, the gains in random forests are a result of variance reduction. In practice, the random forests approach usually performs better than bagging and comparable to boosting. It is also resistant to noise and outliers.

11.8.3.3 Boosting

In boosting, a weight is associated with each training instance, and the diﬀerent classifiers are trained with the use of these weights. The weights are modified iteratively based on classifier performance. In other words, the future models constructed are dependent on the results from previous models. Thus, each classifier in this model is constructed using a the same algorithm A on a weighted training data set. The basic idea is to focus on the incorrectly classified instances in future iterations by increasing the relative weight of these instances. The hypothesis is that the errors in these misclassified instances are caused by classifier bias. Therefore, increasing the instance weight of misclassified instances will result in a new classifier that corrects for the bias on these particular instances. By iteratively using this approach and creating a weighted combination of the various classifiers, it is possible to create a classifier with lower overall bias. For example, in Fig. 11.6a, each individual SVM is not globally optimized and is accurate only near specific portions of the decision boundary, but the combination ensemble provides a very accurate decision boundary.

The most well-known approach to boosting is the AdaBoost algorithm. For simplicity, the following discussion will assume the binary class scenario. It is assumed that the class labels are drawn from {−1, +1}. This algorithm works by associating each training example with a weight that is updated in each iteration, depending on the results of the classification in the last iteration. The base classifiers therefore need to be able to work with weighted

382 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS

Algorithm AdaBoost(Data Set: D, Base Classifier: A, Maximum Rounds: T ) begin

t = 0;
for each i initialize W₁(i) = 1/n;
repeat
t = t + 1;
Determine weighted error rate _t on D when base algorithm A

1			is applied to weighted data set with weights W_t(·);
α_t =		log_e((1 − _t)/ _t);
α_t =	2	log_e((1 − _t)/ _t);
for each misclassified					∈ D do W_t₊₁(i) = W_t(i)e^αt ;
for each misclassified				X_i	∈ D do W_t₊₁(i) = W_t(i)e^αt ;
else (correctly classified instance) do W_t₊₁(i) = W_t(i)e^−αt ;
						n
for each instance X_i do normalize W_t₊₁(i) = W_t₊₁(i)/[						n
for each instance X_i do normalize W_t₊₁(i) = W_t₊₁(i)/[						_j₌₁^Wt+1(^j)];
until ((t ≥ T ) OR ( _t = 0) OR ( _t ≥ 0.5));

Use ensemble components with weights α_t for test instance classification; end
Figure 11.7: The AdaBoost algorithm
instances. Weights can be incorporated either by direct modification of training models, or by (biased) bootstrap sampling of the training data. The reader should revisit the section on rare class learning for a discussion on this topic. Instances that are misclassified are given higher weights in successive iterations. Note that this corresponds to intentionally biasing the classifier in later iterations with respect to the global training data, but reducing the bias in certain local regions that are deemed “diﬃcult” to classify by the specific model A.

In the tth round, the weight of the ith instance is W_t(i). The algorithm starts with equal weight of 1/n for each of the n instances, and updates them in each iteration. In the event that the ith instance is misclassified, then its (relative) weight is increased to W_t₊₁(i) = W_t(i)eαt , whereas in the case of a correct classification, the weight is decreased to W_t₊₁ (i) = W_t(i)e^−αt . Here α_t is chosen as the function ¹₂ log _e((1 − _t)/ _t), where _t is the fraction of incorrectly predicted training instances (computed after weighting with W_t(i)) by the model in the tth iteration. The approach terminates when the classifier achieves 100 % accuracy on the training data ( _t = 0), or it performs worse than a random (binary) classifier ( _t ≥ 0.5). An additional termination criterion is that the number of boosting rounds is bounded above by a user-defined parameter T . The overall training portion of the algorithm is illustrated in Fig. 11.7.

It remains to be explained how a particular test instance is classified with the ensemble learner. Each of the models induced in the diﬀerent rounds of boosting is applied to the test instance. The prediction p_t ∈ {−1, +1} of the test instance for the tth round is weighted

with α_t and these weighted predictions are aggregated. The sign of this aggregation _tp_tα_t provides the class label prediction of the test instance. Note that less accurate components are weighted less by this approach.

An error rate of _t ≥ 0.5 is as bad or worse than the expected error rate of a random (binary) classifier. This is the reason that this case is also used as a termination criterion. In some implementations of boosting, the weights W_t (i) are reset to 1/n whenever _t ≥ 0.5, and the boosting process is continued with the reset weights. In other implementations, _t is allowed to increase beyond 0.5, and therefore some of the prediction results p_t for a test instance are eﬀectively inverted with negative values of the weight α_t = log_e((1 − _t)/ _t).

11.8. ENSEMBLE METHODS

383

Boosting primarily focuses on reducing the bias. The bias component of the error is reduced because of the greater focus on misclassified instances. The combination decision boundary is a complex combination of the simpler decision boundaries, which are each optimized to specific parts of the training data. An example of how simpler decision bound-aries combine to provide a complex decision boundary was already illustrated in Fig. 11.6a. Because of its focus on the bias of classifier models, such an approach is capable of com-bining many weak (high bias) learners to create a strong learner. Therefore, the approach should generally be used with simpler (high bias) learners with low variance in the indi-vidual ensemble components. In spite of its focus on bias, boosting can also reduce the variance slightly when reweighting is implemented with sampling. This reduction is because of the repeated construction of models on randomly sampled, albeit reweighted, instances. The amount of variance reduction depends on the reweighting scheme used. Modifying the weights less aggressively between rounds will lead to better variance reduction. For example, if the weights are not modified at all between boosting rounds, then the boosting approach defaults to bagging, which only reduces variance. Therefore, it is possible to leverage variants of boosting to explore the bias-variance trade-oﬀ in various ways.

Boosting is vulnerable to data sets with significant noise in them. This is because boost-ing assumes that misclassification is caused by the bias component of instances near the incorrectly modeled decision boundary, whereas it might simply be a result of the misla-beling of the data. This is the noise component that is intrinsic to the data, rather than the model. In such cases, boosting inappropriately overtrains the classifier to low-quality portions of the data. Indeed, there are many noisy real-world data sets where boosting does not perform well. Its accuracy is typically superior to bagging in scenarios where the data sets are not excessively noisy.

11.8.3.4 Bucket of Models

The bucket of models is based on the intuition that it is often diﬃcult to know a priori, which classifier will work well for a particular data set. For example, a particular data set may be suited to decision trees, whereas another data set may be well suited to support vector machines. Therefore, model selection methods are required. Given a data set, how does one determine, which algorithm to use? The idea is to first divide the data set into two subsets A and B. Each algorithm is trained on subset A. The set B is then used to evaluate the performance of each of these models. The winner in this “bake-oﬀ” contest is selected. Then, the winner is retrained using the full data set. If desired, cross-validation can be used for the contest, instead of the “hold-out” approach of dividing the training data into two subsets.

Note that the accuracy of the bucket of models is no better than the accuracy of the best classifier for a

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 231 232 233 234 235 236 237 238 ... 423