nearest-neighbor methods, can be made faster by using more efficient subroutines for fre-quent pattern mining and nearest-neighbor indexing, respectively. Other classifiers, such as decision trees and support vector machines, require more careful redesign because they do not rely on any specific computationally intensive subroutines. These two classifiers are also particularly popular, and each of them is used widely in various data domains. Therefore, this chapter will specifically focus on these two classifiers in the context of scalability. An additional scalability challenge is created by streaming data, although such algorithms are not discussed in this chapter. The discussion of streaming data is deferred to Chap. 12.
11.4.1 Scalable Decision Trees
The construction of a decision tree can be computationally expensive because the evaluation of a split criterion at a node can sometimes be very slow. In the following, we will discuss two well-known methods for scalable decision tree construction.
11.4.1.1 RainForest
The RainForest approach is based on the insight that the evaluation of the split criteria in univariate decision trees do not need access to the data in its multidimensional form. Because each attribute value is analyzed independently in a univariate split, only the count statistics of distinct attributes values need to be maintained over different classes. For numeric data, it is assumed that they are discretized into categorical attribute values. The count statistics are collectively referred to as the AVC-set . The AVC-set is specific to a decision-tree node, and provides the counts of the distinct values of the attribute in the data records relevant to that node for different classes. Therefore, the size of the AVC-set depends only on the number of distinct attribute values and the number of classes. This size is often extremely small in comparison to the number of data records. Therefore, the memory requirement is dependent on the dimensionality of the data, the number of distinct values per dimension, and the number of classes. The larger the base training data set, the greater the proportional savings.
These AVC-sets are stored in main memory and used for efficiently evaluating the split criteria at the nodes. The splits are performed at nodes, until the AVC-sets no longer fit in main memory. The data does need to be scanned when the AVC-sets are constructed for newly created nodes. By carefully interleaving the splits and the AVC-set construction, significant computational and disk-access savings can be achieved.
11.4.1.2 BOAT
The Bootstrapped Optimistic Algorithm for Tree construction (BOAT) algorithm uses boot-strapped samples for decision-tree construction. In bootstrapping, the data is sampled with replacement to create b different bootstrapped samples. These are used to create b different trees denoted by T1 . . . Tb. Then, it is checked whether the choice of the split attributes and the splitting subsets are identical, at a particular node in the different bootstrapped trees. For nodes where this is not the case, they are deleted along with the corresponding sub-trees. The bootstrapping is used to create an information -coarse splitting criterion where a confidence interval is imposed on the numeric attribute at each node. The width of this confidence interval can be controlled with the number of bootstrapped samples. At a later stage of the algorithm, the coarse splitting criterion is converted to an exact one by inte-grating the various confidence intervals of the splits into a crisp criterion. In effect, BOAT
352 CHAPTER 11. DATA CLASSIFICATION: ADVANCED CONCEPTS
uses the trees T1 . . . Tb to create a new tree that is very close to one that would have been constructed, even if all the data had been available. The BOAT algorithm is faster than RainForest, and it requires only two scans over the database. Furthermore, BOAT also has the capability of performing incremental decision tree induction and can also handle tuple deletions.
11.4.2 Scalable Support Vector Machines
A major problem with support vector machines is that the size of the optimization problem scales with the number of training data points, and that the memory requirements may scale with the square of the number of data points in the case of kernel-based support vector machines. For example, consider the optimization problem for SVM discussed in Sect. 10.6 of Chap. 10. The kernel-based Lagrangian dual of the problem, as adapted from Eq. 10.62 in Chap. 10, may be written as follows:
Dostları ilə paylaş: |