( o + i x i) IS
|
.
|
POSITIVE
|
|
|
.
|
CLASS
|
|
|
PROPORTIONAL TO
|
.
|
|
|
DISTANCE AND
|
. . .
|
.
|
NEGATIVE
|
|
POSITIVE
|
CLASS
|
|
. . .
. .
.
( o + i x i) IS
PROPORTIONAL TO
DISTANCE AND
NEGATIVE
HYPERPLANE DEFINED BY: o + i x I = 0
Figure 10.6: Illustration of logistic regression in terms of linear separators
regression achieves the same result in the form of probabilities defined by the aforementioned discriminative function.
The term θ0 + d θixi, within the exponent of the logistic function is proportional to
i=1
the distance of the data point from the separating hyperplane. When the data point lies exactly on this hyperplane, both classes are assigned the probability of 0.5 according to the logistic function. Positive values of the distance will assign probability values greater than 0.5 to the positive class. Negative values of the distance will assign (symmetrically equal) probability values greater than 0.5 to the negative class. This scenario is illustrated in Fig. 10.6. Therefore, the logistic function neatly exponentiates the distances, shown in Fig. 10.6, to convert them to intuitively interpretable probabilities in (0 , 1). The setup of logistic regression is similar to classical least-squares linear regression, with the difference that the logit function is used to estimate probabilities of class membership instead of con-structing a squared error objective. Consequently, instead of the least-squares optimization in linear regression, a maximum likelihood optimization model is used for logistic regression.
10.5.2.1 Training a Logistic Regression Classifier
The maximum likelihood approach is used to estimate the best fitting parameters of the logistic regression model. Let D+ and D− be the segments of the training data belonging
to the positive and negative classes, respectively. Let the kth data point be denoted by
|
= (xk1 . . . xkd). Then, the likelihood function L
|
|
|
for the entire data set is defined as
|
|
Xk
|
(Θ)
|
|
follows:
|
|
|
|
1
|
|
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
L
|
(Θ) =
|
|
|
|
|
.
|
(10.28)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 + e−(θ0+
|
d
|
i
|
|
|
|
|
|
1 + e(θ0+
|
d
|
i
|
|
|
|
|
|
|
|
i=1
|
θixk )
|
|
|
|
|
|
i=1
|
θixk )
|
|
|
|
|
|
|
|
Xk ∈D+
|
|
|
|
Xk ∈D−
|
|
|
|
|
|
This likelihood function is the product of the probabilities of all the training examples taking on their assigned labels according to the logistic model. The go al is to maximize this function to determine the optimal value of the parameter vector Θ. For numerical convenience, the log likelihood is used to yield the following:
|
|
|
|
|
|
|
|
d
|
i
|
|
|
d
|
i
|
LL(Θ) = log(L(Θ)) = −
|
|
|
log(1 + e−(θ0+
|
i=1
|
θixk )) −
|
|
log(1 + e(θ0+
|
i=1
|
θixk )).
|
|
|
|
|
|
Xk ∈D+
|
|
|
Xk ∈D−
|
|
|
(10.29)
312 CHAPTER 10. DATA CLASSIFICATION
There is no closed-form solution for optimizing the aforementioned expression with respect to the vector Θ. Therefore, a natural approach is to use a gradient ascent method to deter-mine the optimal value of the parameter vector Θ iteratively. The gradient vector is obtained by differentiating the log-likelihood function with respect to each of the parameters:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
=
|
∂LL(Θ
|
. . .
|
∂LL(Θ
|
.
|
(10.30)
|
|
∇LL
|
|
(Θ)
|
|
|
|
|
∂θ0
|
∂θd
|
|
|
It is instructive to examine the ith component4 of the aforementioned gradient, for i > 0. By computing the partial derivative of both sides of Eq. 10.29 with respect to θi, the following can be obtained:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
∂LL(Θ)
|
|
=
|
|
|
|
|
|
xki
|
|
|
|
|
|
|
|
|
|
|
|
|
|
xki
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
d
|
|
|
|
|
|
|
|
|
d
|
θixi)
|
|
∂θi
|
|
|
|
1 + e(θ0 +
|
i=1 θixi) −
|
|
|
|
|
|
|
1 + e−(θ0 +
|
i=1
|
|
|
|
|
Xk ∈D+
|
|
|
|
|
|
|
|
|
|
|
|
Xk ∈D−
|
|
|
|
|
|
=
|
|
|
P (
|
|
∈ D−)xki −
|
|
|
|
|
|
P (
|
|
∈ D+)xki
|
|
|
|
|
|
|
|
Xk
|
|
|
|
Xk
|
|
|
|
|
|
|
|
|
Xk ∈D+
|
|
|
|
|
|
|
|
Xk ∈D−
|
|
|
|
|
|
=
|
|
|
P (Mistake on
|
|
)xki
|
−
|
|
|
|
|
P (Mistake on
|
|
)xki.
|
|
|
|
Xk
|
|
|
|
|
Xk
|
|
|
|
|
Xk ∈D+
|
|
|
|
|
|
|
|
|
|
|
|
|
Xk ∈D−
|
|
|
|
|
|
(10.31)
(10.32)
(10.33)
It is interesting to note that the terms P ( Xk ∈ D−) and P ( Xk ∈ D+) represent the probability of an incorrect prediction of Xk in the positive and negative classes, respectively. Thus, the mistakes of the current model are used to identify the steepest ascent directions. This approach is generally true of many linear models, such as neural networks, which are also referred to as mistake-driven methods. In addition, the multiplicative factor xik impacts the magnitude of the ith component of the gradient direction contributed by Xk. Therefore, the update condition for θi is as follows:
θi
|
←
|
θi + α
|
|
P (
|
|
∈ D−
|
)xi
|
−
|
|
|
P (
|
|
∈ D
|
+)xi
|
.
|
(10.34)
|
|
Xk
|
|
|
Xk
|
|
|
|
|
|
∈D+
|
|
|
k
|
|
∈D−
|
|
|
k
|
|
|
|
|
|
Xk
|
|
|
|
Xk
|
|
|
|
|
|
The value of α is the step size, which can be determined by using binary search to maximize the improvement in the objective function value. The aforementioned equation uses a batch ascent method, wherein all the training data points contribute to the gradient in a single update step. In practice, it is possible to cycle through the data points one by one for the update process. It can be shown that the likelihood function is concave. Therefore, a global optimum will be found by the gradient ascent method. A number of regularization methods are also used to reduce overfitting. A typical example of a regularization term, which is added
|
|
|
d
|
|
|
to the log-likelihood function LL(Θ) is −λ
|
θi2/2, where λ is the balancing parameter.
|
|
i=1
|
|
The only difference to the gradient update is that the term −λθi needs to be added to the ith gradient component for i ≥ 1.
10.5.2.2 Relationship with Other Linear Models
Although the logistic regression method is a probabilistic method, it is also a special case of a broader class of generalized linear models (cf. Sect. 11.5.3 of Chap. 11). There are many ways of formulating a linear model. For example, instead of using a logistic function to set
4For the case where i = 0, the value of xik is replaced by 1.
10.6. SUPPORT VECTOR MACHINES
|
313
|
up a likelihood criterion, one might directly optimize the squared error of the prediction. In other words, if the class label for Xk is yk ∈ {−1, +1}, one might simply attempt to
optimize the squared error
|
|
D(yk − sign(θ0 +
|
d
|
θixik))2 over all test instances. Here,
|
|
Xk
|
i=1
|
|
the function “sign” evaluates to +1 or −1, depending on whether its argument is positive or negative. As will be evident in Sect. 10.7, such a model is (approximately) used by neural networks. Similarly, Fisher’s linear discriminant, which was discussed at the beginning of this chapter, is also a linear least-squares model (cf. Sect. 11.5.1.1 of Chap. 11) but with a different coding of the class variable. In the next section, a linear model that uses the maximum margin principle to separate the two classes, will be discussed.
10.6 Support Vector Machines
Support vector machines (SVMs) are naturally defined for binary classification of numeric data. The binary-class problem can be generalized to the multiclass case by using a vari-ety of tricks discussed in Sect. 11.2 of Chap. 11. Categorical feature variables can also be addressed by transforming categorical attributes to binary data with the binarization approach discussed in Chap. 2.
It is assumed that the class labels are drawn from {−1, 1}. As with all linear models, SVMs use separating hyperplanes as the decision boundary between the two classes. In the case of SVMs, the optimization problem of determining these hyperplanes is set up with the notion of margin. Intuitively, a maximum margin hyperplane is one that cleanly separates the two classes, and for which a large region (or margin) exists on each side of the boundary with no training data points in it. To understand this concept, the very special case where the data is linearly separable will be discussed first. In linearly separable data, it is possible to construct a linear hyperplane which cleanly separates data points belonging to the two classes. Of course, this special case is relatively unusual because real data is rarely fully separable, and at least a few data points, such as mislabeled data points or outliers, will violate linear separability. Nevertheless, the linearly separable formulation is crucial in understanding the important principle of maximum margin. After discussing the linear separable case, the modifications to the formulation required to enable more general (and realistic) scenarios will be addressed.
10.6.1 Support Vector Machines for Linearly Separable Data
This section will introduce the use of the maximum margin principle in linearly separable data. When the data is linearly separable, there are an infinite number of possible ways of constructing a linear separating hyperplane between the classes. Two examples of such hyperplanes are illustrated in Fig. 10.7a as hyperplane 1 and hyperplane 2. Which of these hyperplanes is better? To understand this, consider the test instance (marked by a square), which is very obviously much closer to class A than class B. The hyperplane 1 will correctly classify it to class A, whereas the hyperplane 2 will incorrectly classify it to class B.
The reason for the varying performance of the two classifiers is that the test instance is placed in a noisy and uncertain boundary region between the two classes, which is not easily generalizable from the available training data. In other words, there are few training data points in this uncertain region that are quite like the test instance. In such cases, a separating hyperplane like hyperplane 1, whose minimum perpendicular distance to training points from both classes is as large as possible, is the most robust one for correct classification. This distance can be quantified using the margin of the hyperplane.
|