of a hill-climbing approach. In this sense, the SVM model has greater sophistication than the basic perceptron model by using the maximum margin principle to better focus on the more important decision boundary region. Furthermore, the generalization power of neu-ral networks can be improved by using a (weighted) regularization penalty term λ||W ||2/2 in the objective function. Note that this regularization term is similar to the maximum margin term in SVMs. The maximum margin term is, in fact, also referred to as the regu-larization term for SVMs. Variations of SVMs exist, in which the maximum margin term is replaced with an L1 penalty d |wi|. In such cases, the regularization interpretation is
i=1
more natural than a margin-based interpretation. Furthermore, certain forms of the slack term in SVMs (e.g., quadratic slack) are similar to the main objective function in other linear models (e.g., least-squares models). The main difference is that the slack term is computed from the margin separators in SVMs rather than the decision boundary. This is consistent with the philosophy of SVMs that discourages training data points from not only being on the wrong side of the decision boundary, but also from being close to the decision boundary. Therefore, various linear models share a number of conceptual similarities, but they emphasize different aspects of optimization. This is the reason that maximum margin models are generally more robust to noise than linear models that use only distance-based penalties to reduce the number of data points on the wrong side of the separating hyper-planes. It has experimentally been observed that neural networks are sensitive to noise. On the other hand, multilayer neural networks can approximate virtually any complex function in principle.
10.8 Instance-Based Learning
Most of the classifiers discussed in the previous sections are eager learners in which the classification model is constructed up front and then used to classify a specific test instance. In instance-based learning, the training is delayed until the last step of classification. Such classifiers are also referred to as lazy learners. The simplest principle to describe instance-based learning is as follows:
Similar instances have similar class labels.
A natural approach for leveraging this general principle is to use nearest-neighbor clas-sifiers. For a given test instance, the closest m training examples are determined. The dominant label among these m training examples is reported as the relevant class. In some variations of the model, an inverse distance-weighted scheme is used, to account for the varying importance of the m training instances that are closest to the test instance. An example of such an inverse weight function of the distance δ is f (δ) = e−δ2/t2 , where t is a user-defined parameter. Here, δ is the distance of the training point to the test instance. This weight is used as a vote, and the class with the largest vote is reported as the relevant label.
If desired, a nearest-neighbor index may be constructed up front, to enable more efficient retrieval of instances. The major challenge with the use of the nearest-neighbor classifier is the choice of the parameter m. In general, a very small value of m will not lead to robust classification results because of noisy variations within the data. On the other hand, large values of m will lose sensitivity to the underlying data locality. In practice, the appropriate value of m is chosen in a heuristic way. A common approach is to test different values of m for accuracy over the training data. While computing the m-nearest neighbors of a
332 CHAPTER 10. DATA CLASSIFICATION
training instance X, the data point X is not included9 among the nearest neighbors. A similar approach can be used to learn the value of t in the distance-weighted scheme.
10.8.1 Design Variations of Nearest Neighbor Classifiers
A number of design variations of nearest-neighbor classifiers are able to achieve more effec-tive classification results. This is because the Euclidean function is usually not the most effec-tive distance metric in terms of its sensitivity to feature and class distribution. The reader is advised to review Chap. 3 on distance function design. Both unsupervised and supervised distance design methods can typically provide more effective classification results. Instead of using the Euclidean distance metric, the distance between two d-dimensional points X and Y is defined with respect to a d × d matrix A.
Dist(
|
|
|
|
) = (
|
|
−
|
|
)A(
|
|
−
|
|
)T
|
(10.71)
|
|
|
Dostları ilə paylaş: |