Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	211/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 207 208 209 210 211 212 213 214 ... 423

1-Data Mining tarjima

Instance-based methods: Weighted votes are used for the diﬀerent classes, after deter-mining the m nearest neighbors to a given test instance.

Thus, most classifiers can be made to work with the weighted case with relatively small changes. The advantage of weighting techniques is that they work with the original training data, and are therefore less prone to overfitting than sampling methods that manipulate the training data.

11.3.2 Sampling Methods

In adaptive resampling, the diﬀerent classes are diﬀerentially sampled to enhance the impact of the rare class on the classification model. Sampling can be performed either with or without replacement. The rare class can be oversampled, or the normal class can be under-sampled, or both can occur. The classification model is learned on the resampled data. The sampling probabilities are typically chosen in proportion to their misclassification costs. This enhances the proportion of the rare costs in the sample used for learning, and the approach is generally applicable to multiclass scenarios as well. It has generally been observed that undersampling the normal class has a number of advantages over oversampling the rare class. When undersampling is used, the sampled training data is much smaller than the original data set, which leads to better training eﬃciency.

In some variations, all instances of the rare class are used in combination with a small sample of the normal class. This is also referred to as one-sided selection. The logic of this approach is that rare class instances are too valuable as training data to modify any type

350 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS

of sampling. Undersampling has several advantages with respect to oversampling because of the following reasons:

The model construction phase for a smaller training data set requires much less time.

The normal class is less important for modeling purposes, and all instances from the more valuable rare class are included for modeling. Therefore, the discarded instances do not impact the modeling eﬀectiveness in a significant way.

11.3.2.1 Relationship Between Weighting and Sampling

Resampling methods can be understood as methods that sample the data in proportion to their weights, and then treat all examples equally. Therefore, the two methods are almost equivalent although sampling methods have greater randomness associated with them. A direct weight- based technique is generally more reliable because of the absence of this ran-domness. On the other hand, sampling can be more naturally combined with ensemble methods (cf. Sect. 11.8) such as bagging to improve accuracy. Furthermore, sampling has distinct eﬃciency advantages because it works with a much smaller data set. For example, for a data set containing a rare to normal ratio of 1:99, it is possible for a resampling tech-nique to work eﬀectively with 2 % of the original data when the data is resampled into an equal mixture of the normal and anomalous classes. This kind of resampling translates to a performance improvement of a factor of 50.

11.3.2.2 Synthetic Oversampling: SMOTE

One of the problems with oversampling the minority class is that a larger number of sam-ples with replacement leads to repeated samples of the same data point. Repeated samples cause overfitting and reduce classification accuracy. In order to address this issue, a recent approach is to use synthetic oversampling that creates synthetic examples without repeti-tion.

The SMOTE approach works as follows. For each minority instance, its k nearest neigh-bors belonging to the same class are determined. Then, depending on the level of oversam-pling required, a fraction of them are chosen randomly. For each sampled example- neighbor pair, a synthetic data example is generated on the line segment connecting that minority example to its nearest neighbor. The exact position of the example is chosen uniformly at random along the line segment. These new minority examples are added to the training data, and the classifier is trained with the augmented data. The SMOTE algorithm is gener-ally more accurate than a vanilla oversampling approach. This approach forces the decision region of the resampled data to become more general than one in which only members from the rare classes in the original training data are oversampled.

11.4 Scalable Classification

In many applications, the training data sizes are rather large. This leads to numerous scal-ability challenges in building classification models. In such cases, the data will typically not fit in main memory, and therefore the algorithms need to be designed to optimize the accesses to disk. Although the traditional decision-tree algorithms, such as C4.5, work well for smaller data sets, they are not optimized to disk-resident data. One solution is to sample the training data, but this has the disadvantage of losing the learning knowl-edge in the discarded training instances. Some classifiers, such as associative classifiers and

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 207 208 209 210 211 212 213 214 ... 423