Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	186/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 182 183 184 185 186 187 188 189 ... 423

1-Data Mining tarjima

DISTANCE AND . .
HYPERPLANE DEFINED BY: o

( _o + _i x _i) IS	.	POSITIVE

.	CLASS
.	PROPORTIONAL TO	.
DISTANCE AND	. _. ^.	.	NEGATIVE
POSITIVE	. _. ^.	.	CLASS

_.^. .
_. .
.

( _o+ _ix _i) IS

PROPORTIONAL TO
DISTANCE AND
NEGATIVE

HYPERPLANE DEFINED BY: _o+ _ix _I = 0

Figure 10.6: Illustration of logistic regression in terms of linear separators

regression achieves the same result in the form of probabilities defined by the aforementioned discriminative function.

^{The term}θ0 + ^d θixi, within the exponent of the logistic function is proportional to
i=1
the distance of the data point from the separating hyperplane. When the data point lies exactly on this hyperplane, both classes are assigned the probability of 0.5 according to the logistic function. Positive values of the distance will assign probability values greater than 0.5 to the positive class. Negative values of the distance will assign (symmetrically equal) probability values greater than 0.5 to the negative class. This scenario is illustrated in Fig. 10.6. Therefore, the logistic function neatly exponentiates the distances, shown in Fig. 10.6, to convert them to intuitively interpretable probabilities in (0, 1). The setup of logistic regression is similar to classical least-squares linear regression, with the diﬀerence that the logit function is used to estimate probabilities of class membership instead of con-structing a squared error objective. Consequently, instead of the least-squares optimization in linear regression, a maximum likelihood optimization model is used for logistic regression.
10.5.2.1 Training a Logistic Regression Classifier

The maximum likelihood approach is used to estimate the best fitting parameters of the logistic regression model. Let D₊ and D₋ be the segments of the training data belonging

to the positive and negative classes, respectively. Let the kth data point be denoted by

= (x_k¹ . . . x_k^d). Then, the likelihood function L

for the entire data set is defined as

^Xk

(Θ)

follows:

(Θ) =

(10.28)

1 + _e−⁽^θ0⁺

1 + _e⁽^θ0⁺

i=1

^θi^x_k ⁾

i=1

^θi^x_k ⁾

X_k ∈D₊

X_k ∈D₋

This likelihood function is the product of the probabilities of all the training examples taking on their assigned labels according to the logistic model. The goal is to maximize this function to determine the optimal value of the parameter vector Θ. For numerical convenience, the log likelihood is used to yield the following:

LL(Θ) = log(L(Θ)) = −

log(1 + _e−⁽^θ0⁺

i=1

^θi^xk ⁾) ₋

log(1 + e⁽^θ0⁺

i=1

^θi^xk ⁾)_.

X_k ∈D₊

X_k ∈D₋

(10.29)

312 CHAPTER 10. DATA CLASSIFICATION

There is no closed-form solution for optimizing the aforementioned expression with respect to the vector Θ. Therefore, a natural approach is to use a gradient ascent method to deter-mine the optimal value of the parameter vector Θ iteratively. The gradient vector is obtained by diﬀerentiating the log-likelihood function with respect to each of the parameters:

∂LL(Θ

. . .

∂LL(Θ

(10.30)

∇LL

(Θ)

∂θ₀

∂θ_d

It is instructive to examine the ith component4 of the aforementioned gradient, for i > 0. By computing the partial derivative of both sides of Eq. 10.29 with respect to θ_i, the following can be obtained:

∂LL(Θ)

x_kⁱ

θ_ix_i)

∂θ_i

1 + _e⁽^θ0 ⁺

_i₌₁θ_ix_i) ⁻

1 + _e−⁽^θ0 ⁺

i=1

X_k ∈D₊

X_k ∈D₋

P (

∈ D₋)x_kⁱ −

P (

∈ D₊)x_kⁱ

X_k

X_k ∈D₊

X_k ∈D₋

P (Mistake on

)x_kⁱ

−

P (Mistake on

)x_kⁱ.

X_k

X_k ∈D₊

X_k ∈D₋

(10.31)

(10.32)

(10.33)

It is interesting to note that the terms P ( X_k ∈ D₋) and P (X_k ∈ D₊) represent the probability of an incorrect prediction of X_k in the positive and negative classes, respectively. Thus, the mistakes of the current model are used to identify the steepest ascent directions. This approach is generally true of many linear models, such as neural networks, which are also referred to as mistake-driven methods. In addition, the multiplicative factor xⁱ_k impacts the magnitude of the ith component of the gradient direction contributed by X_k. Therefore, the update condition for θ_i is as follows:

θ_i

←

θ_i + α

P (

∈ D₋

)xⁱ

−

P (

∈ D

₊₎_xi

(10.34)

X_k

∈D₊

∈D₋

X_k

The value of α is the step size, which can be determined by using binary search to maximize the improvement in the objective function value. The aforementioned equation uses a batch ascent method, wherein all the training data points contribute to the gradient in a single update step. In practice, it is possible to cycle through the data points one by one for the update process. It can be shown that the likelihood function is concave. Therefore, a global optimum will be found by the gradient ascent method. A number of regularization methods are also used to reduce overfitting. A typical example of a regularization term, which is added

	d
to the log-likelihood function LL(Θ) is −λ	d	θ_i2/2, where λ is the balancing parameter.
to the log-likelihood function LL(Θ) is −λ	i=1	θ_i2/2, where λ is the balancing parameter.

The only diﬀerence to the gradient update is that the term −λθ_i needs to be added to the ith gradient component for i ≥ 1.

10.5.2.2 Relationship with Other Linear Models

Although the logistic regression method is a probabilistic method, it is also a special case of a broader class of generalized linear models (cf. Sect. 11.5.3 of Chap. 11). There are many ways of formulating a linear model. For example, instead of using a logistic function to set

⁴For the case where i = 0, the value of xⁱ_k is replaced by 1.

10.6. SUPPORT VECTOR MACHINES

313

up a likelihood criterion, one might directly optimize the squared error of the prediction. In other words, if the class label for X_k is y_k ∈ {−1, +1}, one might simply attempt to

optimize the squared error		_D(y_k − sign(θ₀ +	d	θ_ix_i^k))² over all test instances. Here,
	X_k		i=1

the function “sign” evaluates to +1 or −1, depending on whether its argument is positive or negative. As will be evident in Sect. 10.7, such a model is (approximately) used by neural networks. Similarly, Fisher’s linear discriminant, which was discussed at the beginning of this chapter, is also a linear least-squares model (cf. Sect. 11.5.1.1 of Chap. 11) but with a diﬀerent coding of the class variable. In the next section, a linear model that uses the maximum margin principle to separate the two classes, will be discussed.

10.6 Support Vector Machines

Support vector machines (SVMs) are naturally defined for binary classification of numeric data. The binary-class problem can be generalized to the multiclass case by using a vari-ety of tricks discussed in Sect. 11.2 of Chap. 11. Categorical feature variables can also be addressed by transforming categorical attributes to binary data with the binarization approach discussed in Chap. 2.

It is assumed that the class labels are drawn from {−1, 1}. As with all linear models, SVMs use separating hyperplanes as the decision boundary between the two classes. In the case of SVMs, the optimization problem of determining these hyperplanes is set up with the notion of margin. Intuitively, a maximum margin hyperplane is one that cleanly separates the two classes, and for which a large region (or margin) exists on each side of the boundary with no training data points in it. To understand this concept, the very special case where the data is linearly separable will be discussed first. In linearly separable data, it is possible to construct a linear hyperplane which cleanly separates data points belonging to the two classes. Of course, this special case is relatively unusual because real data is rarely fully separable, and at least a few data points, such as mislabeled data points or outliers, will violate linear separability. Nevertheless, the linearly separable formulation is crucial in understanding the important principle of maximum margin. After discussing the linear separable case, the modifications to the formulation required to enable more general (and realistic) scenarios will be addressed.

10.6.1 Support Vector Machines for Linearly Separable Data

This section will introduce the use of the maximum margin principle in linearly separable data. When the data is linearly separable, there are an infinite number of possible ways of constructing a linear separating hyperplane between the classes. Two examples of such hyperplanes are illustrated in Fig. 10.7a as hyperplane 1 and hyperplane 2. Which of these hyperplanes is better? To understand this, consider the test instance (marked by a square), which is very obviously much closer to class A than class B. The hyperplane 1 will correctly classify it to class A, whereas the hyperplane 2 will incorrectly classify it to class B.

The reason for the varying performance of the two classifiers is that the test instance is placed in a noisy and uncertain boundary region between the two classes, which is not easily generalizable from the available training data. In other words, there are few training data points in this uncertain region that are quite like the test instance. In such cases, a separating hyperplane like hyperplane 1, whose minimum perpendicular distance to training points from both classes is as large as possible, is the most robust one for correct classification. This distance can be quantified using the margin of the hyperplane.

314

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 182 183 184 185 186 187 188 189 ... 423