Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə235/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   231   232   233   234   235   236   237   238   ...   423
1-Data Mining tarjima

y = f (




) +

(11.29)




X




Here, f (X) is the function representing the true (but unknown) relationship between the feature variables and the class variable, and is the intrinsic error in the data that cannot be modeled. Therefore, over the n test instances (X1, y1) . . . (Xn, yn), the intrinsic noise 2a


378 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS


Table 11.1: Impact of different techniques on bias-variance trade-off



Technique


Source/level of bias

Source/level of variance








Simple




Oversimplification increases

Low variance. Simple models







models




bias in decision boundary

do not overfit







Complex




Generally lower than simple

High variance. Complex







models




models. Complex boundary

assumptions will be overly













can be modeled

sensitive to data variation







Shallow




High bias. Shallow tree

Low variance. The top split







decision




will ignore many relevant

levels do not depend on







trees




split predicates

minor data variations







Deep




Lower bias than shallow

High variance because of







decision




decision tree. Deep levels

overfitting at lower levels







trees




model complex boundary










Rules




Bias increases with fewer

Variance increases with













antecedents per rule

more antecedents per rule







Naive




High bias from simplified

Variance in estimation of







Bayes




model (e.g., Bernoulli)

model parameters. More













and naive assumption

parameters increase variance







Linear




High bias. Correct boundary

Low variance. Linear separator







models




may not be linear

can be modeled robustly







Kernel




Bias lower than linear SVM.

Variance higher than







SVM




Choice of kernel function

linear SVM







k-NN




Simplified distance function

Complex distance function such







model




such as Euclidean causes

as local discriminant causes













bias. Increases with k

variance. Decreases with k







Regularization




Increases bias

Reduces variance






















may be estimated as follows:





1

n













a2 =




(yi − f (Xi))2.

(11.30)




n




i=1

Because the precise form of the function f (X) is unknown, most classification algorithms construct a model g(X, D) based on certain modeling assumptions. An example of such a modeling assumption is the linear decision boundary in SVM. This function g(X, D) may be defined either algorithmically (e.g., decision tree), or it may be defined in closed form, such as the SVM classifier, discussed in Sect. 10.6 of Chap. 10. In the latter case, the function g(X, D) is defined as follows:





g(




D) = sign{




·




+ b}.

(11.31)




X,

W

X




Note that the coefficients W and b can only be estimated from the training data set D. The notation D occurs as an argument of g(X, D) because the training data set D is required to estimate model coefficients such as W and b. For a training data set D of limited size, it is often not possible to estimate g(X, D) very accurately, as in the case of the coarse decision boundaries of Fig. 11.5b. Therefore, the specific impact of the training data set on the estimated result g(X, D) can be quantified by comparing it with its expected value ED[g(X, D)] over all possible outcomes of training data sets D.


Aside from the intrinsic error 2a, which is data-set specific, there are two primary sources of error in the modeling and estimation process:





  1. The modeling assumptions about g(X, D) may not reflect the true model. Consider the case where the linear SVM modeling assumption is used for g(X, D), and the true boundary between the classes is not linear. This would result in a bias in the model. In




11.8. ENSEMBLE METHODS

379

practice, the bias usually arises from oversimplification in the modeling assumptions. Even if we had a very large training data set and could somehow estimate the expected value of g(X, D), the value of (f (X) − ED[g(X, D)])2 would be nonzero because of the bias arising from the difference between the true model and assumed model. This is referred to as the (squared) bias. Note that it is often possible to reduce the bias by assuming a more complex form for g(X, D), such as using a kernel SVM instead of a linear SVM in Fig. 11.5a.





  1. Even if our assumptions about g(X, D) were correct, it is not possible to exactly estimate ED[g(X, D)] with any given training data set D. Over different instan-tiations of a training data set D and fixed test instance X, the predicted class label g(X, D) would be different. This is the model variance, which corresponds to ED[(g(X, D)−ED[g(X, D)])2]. Note that the expectation function ED[g(X, D)] defines a decision boundary which is usually much closer to the true decision boundary (e.g., ensemble boundary estimate in Fig. 11.6b) as compared to that defined by a specific instantiation D of the training data (e.g., boundaries A and B in Fig. 11.5b).




It can be shown that the expected mean squared error of the prediction E

D

[MSE] =







1

n







2














































n

i=1 ED[(yi − g(Xi, D))




] of test data points (X1

, y1) . . . (Xn, yn) over different choices




of training data set D can be decomposed into the bias, variance, and the intrinsic error as follows:



1
ED[MSE] = n



n



















2






















2




2






















i=1

(

f (Xi) − E

D[g(Xi, D)])




+ ED[(g(Xi, D) − ED[g(Xi, D)])

] +

a .










2
















Error













Bias










Variance










The ensemble method reduces the classification error by reducing the bias and variance of the combined model. By carefully choosing the component models with specific bias-variance properties and combining them appropriately, it is often possible to reduce both the bias and the variance over an individual model.

11.8.3 Specific Instantiations of Ensemble Learning


A number of ensemble approaches have been designed to increase accuracy by reducing bias, variance, or a combination of both. In the following, a discussion of selected models is provided.


11.8.3.1 Bagging

Bagging, which is also referred to as bootstrapped aggregating, is an approach that attempts to reduce the variance of the prediction. It is based on the idea that if the variance of a prediction is σ2, then the variance of the average of k independent and identically dis-tributed (i.i.d.) predictions is reduced to σk2 . Given sufficiently independent predictors, such an approach of averaging will reduce the variance significantly.


So how does one approximate i.i.d. predictors? In bagging, data points are sampled uniformly from the original data with replacement. This sampling approach is referred to as bootstrapping and is also used for model evaluation. A sample of approximately the same size as the original data is drawn. This sample may contain duplicates, and typically contains approximately 1 (1 1/n)n 1 1/e fraction of the distinct data points in the original data. Here, e represents the base of the natural logarithm. This result is easy to show because



380 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS

the probability of a data point not being included in the sample is given by (1 1/n)n . A total of k different bootstrapped samples, each of size n, are drawn independently, and the classifier is trained on each of them. For a given test instance, the predicted class label is reported by the ensemble as the dominant vote of the different classifiers.


The primary advantage of bagging is that it reduces the model variance. However, bag-ging does not reduce the model bias. Therefore, this approach will not be effective for the example of Fig. 11.5a, but it will be effective for the example of Fig. 11.5b. In Fig. 11.5b, the different decision tree boundaries are created by the random variations in the bootstrapped samples. The majority vote of these bootstrapped samples will, however, perform better than a model constructed on the full data set because of a reduction in the variance com-ponent. For cases in which the bias is the primary component of error, the bootstrapping approach may show a slight degradation in accuracy. Therefore, while designing a boot-strapping approach, it is advisable to design the individual components to reduce the bias at the possible expense of variance, because bagging will take care of the latter. Such a choice optimizes the bias-variance trade-off. For example, one might want to grow a deep decision tree with 100 % class purity at the leaves. In fact, decision trees are an ideal choice for bagging, because they have low bias and high variance when they are grown sufficiently deep. The main problem with bagging is that the i.i.d. assumption is usually not satisfied because of correlations between ensemble components.


11.8.3.2 Random Forests

Bagging works best when the individual predictions from various ensemble components satisfy the i.i.d. property. If the k different predictors, each with variance σ2, have a positive pairwise correlation of ρ between them, then the variance of the averaged prediction can be



shown to be ρ · σ

2

+

(1−ρ)·σ2

. The term ρ · σ

2

is invariant to the number of components k







k







in the ensemble. This term limits the performance gains from bagging. As we will discuss below, the predictions from bootstrapped decision trees are usually positively correlated.

Random forests can be viewed as a generalization of the basic bagging method, as applied to decision trees. Random forests are defined as an ensemble of decision trees, in which randomness has explicitly been inserted into the model building process of each decision tree. While the bootstrapped sampling approach of bagging is also an indirect way of adding randomness to model-building, there are some disadvantages of doing so. The main drawback of using decision-trees directly with bagging is that the split choices at the top levels of the tree are statistically likely to remain approximately invariant to bootstrapped sampling. Therefore, the trees are more correlated, which limits the amount of error reduction obtained from bagging. In such cases, it makes sense to directly increase the diversity of the component decision-tree models. The idea is to use a randomized decision tree model with less correlation between the different ensemble components. The underlying variability can then be more effectively reduced by an averaging approach. The final results are often more accurate than a direct application of bagging on decision trees. Figure 11.6b provides a typical example of the effect of the approach on the decision boundary, which is similar to that of bagging, but it is usually more pronounced.


The random- split selection approach directly introduces randomness into the split crite-rion. An integer parameter q ≤ d is used to regulate the amount of randomness introduced in split selection. The split selection at each node is preceded by the randomized selection of a subset S of attributes of size q. The splits at that node are then executed using only this subset. Larger values of q will result in correlated trees that are similar to a tree without any injected randomness. By selecting small values of q relative to the full dimensionality



11.8. ENSEMBLE METHODS

381



d, the correlations across different components of the random forest can be reduced. The resulting trees can also be grown more efficiently, because a smaller number of attributes need to be considered at each node. On the other hand, when a larger size q of the selected feature set S is used, then each individual ensemble component will have better accuracy, which is also desirable. Therefore, the goal is to select a good trade-off point. It has been shown that, when the number of input attributes is d, a total of q = log2(d) + 1 attributes should be selected to achieve the best trade-off. Interestingly, even a choice of q = 1 seems to work well in terms of accuracy, though it requires the use of a larger number of ensemble components. After the ensemble has been constructed, the dominant label from the various predictions of a test instance is reported. This approach is referred to as Forest-RI because it is based on random input selection.

This approach does not work well when the overall dimensionality d is small, and there-fore it is no longer possible to use values of q much smaller than d. In such cases, a value L ≤ d is specified, which corresponds to the number of input features that are combined together. At each node, L features are randomly selected, and combined linearly with coeffi-cients generated uniformly at random from the range [1, 1]. A total of q such combinations are generated in order to create a new subset S of multivariate attributes. As in the previ-ous case, the splits are executed using only the attribute set S. This results in multivariate random splits. This approach is referred to as Forest-RC because it uses random linear combinations.


In the random forest approach, deep trees are grown without pruning. Each tree is grown on a bootstrapped sample of the training data to increase the variance reduction. As in bagging, the variance is reduced significantly by the random forest approach. However, the component classifiers may have higher bias because of the restricted split selection of each component classifier. This can sometimes cause a problem when the fraction of informative features in the training data is small. Therefore, the gains in random forests are a result of variance reduction. In practice, the random forests approach usually performs better than bagging and comparable to boosting. It is also resistant to noise and outliers.


11.8.3.3 Boosting


In boosting, a weight is associated with each training instance, and the different classifiers are trained with the use of these weights. The weights are modified iteratively based on classifier performance. In other words, the future models constructed are dependent on the results from previous models. Thus, each classifier in this model is constructed using a the same algorithm A on a weighted training data set. The basic idea is to focus on the incorrectly classified instances in future iterations by increasing the relative weight of these instances. The hypothesis is that the errors in these misclassified instances are caused by classifier bias. Therefore, increasing the instance weight of misclassified instances will result in a new classifier that corrects for the bias on these particular instances. By iteratively using this approach and creating a weighted combination of the various classifiers, it is possible to create a classifier with lower overall bias. For example, in Fig. 11.6a, each individual SVM is not globally optimized and is accurate only near specific portions of the decision boundary, but the combination ensemble provides a very accurate decision boundary.


The most well-known approach to boosting is the AdaBoost algorithm. For simplicity, the following discussion will assume the binary class scenario. It is assumed that the class labels are drawn from {−1, +1}. This algorithm works by associating each training example with a weight that is updated in each iteration, depending on the results of the classification in the last iteration. The base classifiers therefore need to be able to work with weighted


382 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS


Algorithm AdaBoost(Data Set: D, Base Classifier: A, Maximum Rounds: T ) begin




t = 0;
for each i initialize W1(i) = 1/n;
repeat
t = t + 1;
Determine weighted error rate t on D when base algorithm A

1

is applied to weighted data set with weights Wt(·);




αt =




loge((1 t)/ t);







2







for each misclassified




D do Wt+1(i) = Wt(i)eαt ;







Xi







else (correctly classified instance) do Wt+1(i) = Wt(i)e−αt ;

























n




for each instance Xi do normalize Wt+1(i) = Wt+1(i)/[




j=1 Wt+1(j)];




until ((t ≥ T ) OR ( t = 0) OR ( t 0.5));







Use ensemble components with weights αt for test instance classification; end
Figure 11.7: The AdaBoost algorithm
instances. Weights can be incorporated either by direct modification of training models, or by (biased) bootstrap sampling of the training data. The reader should revisit the section on rare class learning for a discussion on this topic. Instances that are misclassified are given higher weights in successive iterations. Note that this corresponds to intentionally biasing the classifier in later iterations with respect to the global training data, but reducing the bias in certain local regions that are deemed “difficult” to classify by the specific model A.

In the tth round, the weight of the ith instance is Wt(i). The algorithm starts with equal weight of 1/n for each of the n instances, and updates them in each iteration. In the event that the ith instance is misclassified, then its (relative) weight is increased to Wt+1(i) = Wt(i)eαt , whereas in the case of a correct classification, the weight is decreased to Wt+1 (i) = Wt(i)e−αt . Here αt is chosen as the function 12 log e((1 t)/ t), where t is the fraction of incorrectly predicted training instances (computed after weighting with Wt(i)) by the model in the tth iteration. The approach terminates when the classifier achieves 100 % accuracy on the training data ( t = 0), or it performs worse than a random (binary) classifier ( t 0.5). An additional termination criterion is that the number of boosting rounds is bounded above by a user-defined parameter T . The overall training portion of the algorithm is illustrated in Fig. 11.7.


It remains to be explained how a particular test instance is classified with the ensemble learner. Each of the models induced in the different rounds of boosting is applied to the test instance. The prediction pt ∈ {−1, +1} of the test instance for the tth round is weighted


with αt and these weighted predictions are aggregated. The sign of this aggregation t ptαt provides the class label prediction of the test instance. Note that less accurate components are weighted less by this approach.

An error rate of t 0.5 is as bad or worse than the expected error rate of a random (binary) classifier. This is the reason that this case is also used as a termination criterion. In some implementations of boosting, the weights Wt (i) are reset to 1/n whenever t 0.5, and the boosting process is continued with the reset weights. In other implementations, t is allowed to increase beyond 0.5, and therefore some of the prediction results pt for a test instance are effectively inverted with negative values of the weight αt = loge((1 t)/ t).



11.8. ENSEMBLE METHODS

383

Boosting primarily focuses on reducing the bias. The bias component of the error is reduced because of the greater focus on misclassified instances. The combination decision boundary is a complex combination of the simpler decision boundaries, which are each optimized to specific parts of the training data. An example of how simpler decision bound-aries combine to provide a complex decision boundary was already illustrated in Fig. 11.6a. Because of its focus on the bias of classifier models, such an approach is capable of com-bining many weak (high bias) learners to create a strong learner. Therefore, the approach should generally be used with simpler (high bias) learners with low variance in the indi-vidual ensemble components. In spite of its focus on bias, boosting can also reduce the variance slightly when reweighting is implemented with sampling. This reduction is because of the repeated construction of models on randomly sampled, albeit reweighted, instances. The amount of variance reduction depends on the reweighting scheme used. Modifying the weights less aggressively between rounds will lead to better variance reduction. For example, if the weights are not modified at all between boosting rounds, then the boosting approach defaults to bagging, which only reduces variance. Therefore, it is possible to leverage variants of boosting to explore the bias-variance trade-off in various ways.


Boosting is vulnerable to data sets with significant noise in them. This is because boost-ing assumes that misclassification is caused by the bias component of instances near the incorrectly modeled decision boundary, whereas it might simply be a result of the misla-beling of the data. This is the noise component that is intrinsic to the data, rather than the model. In such cases, boosting inappropriately overtrains the classifier to low-quality portions of the data. Indeed, there are many noisy real-world data sets where boosting does not perform well. Its accuracy is typically superior to bagging in scenarios where the data sets are not excessively noisy.


11.8.3.4 Bucket of Models

The bucket of models is based on the intuition that it is often difficult to know a priori, which classifier will work well for a particular data set. For example, a particular data set may be suited to decision trees, whereas another data set may be well suited to support vector machines. Therefore, model selection methods are required. Given a data set, how does one determine, which algorithm to use? The idea is to first divide the data set into two subsets A and B. Each algorithm is trained on subset A. The set B is then used to evaluate the performance of each of these models. The winner in this “bake-off” contest is selected. Then, the winner is retrained using the full data set. If desired, cross-validation can be used for the contest, instead of the “hold-out” approach of dividing the training data into two subsets.


Note that the accuracy of the bucket of models is no better than the accuracy of the best classifier for a


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   231   232   233   234   235   236   237   238   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin