Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	185/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 181 182 183 184 185 186 187 188 ... 423

1-Data Mining tarjima

P (C = c x		= a	, . . . x	= a	) =	P (C = c)P (x₁ = a₁, . . . x_d = a_d\|C = c)	(10.18)
						P (C = c)P (x₁ = a₁, . . . x_d = a_d\|C = c)

		P (x₁ = a₁, . . . x_d = a_d)
\|	1	P (x₁ = a₁, . . . x_d = a_d)	1	d	d
					∝ P (C = c)P (x₁ = a₁, . . . x_d = a_d\|C = c).		(10.19)

The second relationship above is based on the fact that the term P (x₁ = a₁, . . . x_d = a_d) in the denominator of the first relationship is independent of the class. Therefore, it suﬃces to only compute the numerator to determine the class with the maximum conditional probability. The value of P (C = c ) is the prior probability of the class identifier c and can be estimated as the fraction of the training data points belonging to class c. The key usefulness of the Bayes rule is that the terms on the right-hand side can now be eﬀectively approximated from the training data with the use of a naive Bayes approximation. The naive Bayes approximation assumes that the values on the diﬀerent attributes x₁ . . . x_d are independent of one another conditional on the class. When two random events A and B are independent of one another conditional on a third event F , it follows that P (A ∩ B|F ) = P (A|F )P (B|F ). In the case of the naive Bayes approximation, it is assumed that the feature

308 CHAPTER 10. DATA CLASSIFICATION

values are independent of one another conditional on a fixed value of the class variable. This implies the following for the conditional term on the right-hand side of Eq. 10.19.

d
P (x₁ = a₁, . . . x_d = a_d\|C = c) = P (x_j = a_j \|C = c)	(10.20)

j=1

Therefore, by substituting Eq. 10.20 in Eq. 10.19, the Bayes probability can be estimated within a constant of proportionality as follows:

d
P (C = c\|x₁ = a₁, . . . x_d = a_d) ∝ P (C = c) P (x_j = a_j \|C = c).	(10.21)

j=1

Note that each term P (x_j = a_j |C = c) is much easier to estimate from the training data than P (x₁ = a₁, . . . x_d = a_d|C = c) because enough training examples will exist in the former case to provide a robust estimate. Specifically, the maximum likelihood estimate for the value of P (x_j = a _j |C = c) is the fraction of training examples taking on value a_j , conditional on the fact, that they belong to class c. In other words, if q (a_j , c) is the number of training examples corresponding to feature variable x_j = a_j and class c, and r(c) is the number of training examples belonging to class c, then the estimation is performed as follows:

P (x_j = a_j \|C = c) =	q(a_j , c)	(10.22)
	r(c) ^.

In some cases, enough training examples may still not be available to estimate these values robustly. For example, consider a rare class c with a single training example satisfying r(c) = 1, and q (a_j , c) = 0. In such a case, the conditional probability is estimated to 0. Because of the productwise form of the Bayes expression, the entire probability will be estimated to 0. Clearly, the use of a small number of training examples belonging to the rare class cannot provide robust estimates. To avoid this kind of overfitting, Laplacian smoothing is used. A small value of α is added to the numerator, and a value of α · m_j is added to the denominator, where m_j is the number of distinct values of the jth attribute:

P (x_j = a_j \|C = c) =	q(a_j , c) + α	.	(10.23)
	r(c) + α · m_j

Here, α is the Laplacian smoothing parameter. For the case where r(c) = 0, this has the eﬀect of estimating the probability to an unbiased value of 1/m_j for all m_j distinct attribute values. This is a reasonable estimate in the absence of any training data about class c. Thus, the training phase only requires the estimation of these conditional probabilities P (x_j = a_j |C = c) of each class–attribute–value combination, and the estimation of the prior probabilities P (C = c) of each class.

This model is referred to as the binary or Bernoulli model for Bayes classification when it is applied to categorical data with only two outcomes of each feature attribute. For example, in text data, the two outcomes could correspond to the presence or absence of a word. In cases where more than two outcomes are possible for a feature variable, the model is referred to as the generalized Bernoulli model. The implicit generative assumption of this model is similar to that of mixture modeling algorithms in clustering (cf. Sect. 6.5 of Chap. 6). The features within each class (mixture component) are independently generated from a distribution whose probabilities are the productwise approximations of Bernoulli distributions. The estimation of model parameters in the training phase is analogous to

10.5. PROBABILISTIC CLASSIFIERS

309

the M-step in expectation–maximization (EM) clustering algorithms. Note that, unlike EM clustering algorithms, the labels on only the training data are used to compute the maximum likelihood estimates of parameters in the training phase. Furthermore, the E-step (or the iterative approach) is not required because the (deterministic) assignment “probabilities” of labeled data are already known. In Sect. 13.5.2.1 of Chap. 13, a more sophisticated model, referred to as the multinomial model, will be discussed. This model can address sparse frequencies associated with attributes, as in text data. In general, the Bayes model can assume any parametric form of the conditional feature distribution P (x₁ = a₁, . . . x_d = a_d|C = c) of each class (mixture component), such as a Bernoulli model, a multinomial model, or even a Gaussian model for numeric data. The parameters of the distribution of each class are estimated in a data -driven manner. The approach discussed in this section, therefore, represents only a single instantiation from a wider array of possibilities.

The aforementioned description is based on categorical data. It can also be generalized to numeric data sets by using the process of discretization. Each discretized range becomes one of the possible categorical values of an attribute. Such an approach can, however, be sensitive to the granularity of the discretization. A second approach is to assume a specific form of the probability distribution of each mixture component (class), such as a Gaussian distribution. The mean and variance parameters of the Gaussian distribution of each class are estimated in a data-driven manner, just as the class conditioned feature probabilities are estimated in the Bernoulli model. Specifically, the mean and variance of each Gaussian can be estimated directly as the mean and variance of the training data for the corresponding class. This is similar to the M-step in EM clustering algorithms with Gaussian mixtures. The conditional class probabilities in Eq. 10.21 for a test instance are replaced with the class-specific Gaussian densities of the test instance.

10.5.1.1 The Ranking Model for Classification

The aforementioned algorithms predict the labels of individual test instances. In some sce-narios, a set of test instances is provided to the learner, and it is desired to rank these test instances by their propensity to belong to a particularly important class c. This is a common scenario in rare-class learning, which will be discussed in Sect. 11.3 of Chap. 11.

As discussed in Eq. 10.21, the probability of a test instance (a₁ . . . a_d) belonging to a particular class can be estimated within a constant of proportionality as follows:

d
P (C = c\|x₁ = a₁, . . . x_d = a_d) ∝ P (C = c) P (x_j = a_j \|C = c).	(10.24)

j=1

The constant of proportionality is irrelevant while comparing the scores across diﬀerent classes but is not irrelevant while comparing the scores across diﬀerent test instances. This is because the constant of proportionality is the inverse of the generative probability of the specific test instance. An easy way to estimate the proportionality constant is to use normalization so that the sum of probabilities across diﬀerent classes is 1. Therefore, if the class label c is assumed to be an integer drawn from the range {1 . . . k} for a k-class problem, then the Bayes probability can be estimated as follows:

	P (C = c)	d
P (C = c\|x₁ = a₁, . . . x_d = a_d) =	P (C = c)	_j₌₁^P (^xj = ^aj ^\|C = ^c)		.	(10.25)
P (C = c\|x₁ = a₁, . . . x_d = a_d) =	k		d	.	(10.25)
	_c₌₁P (C = c)		_j₌₁^P (^xj = ^aj ^\|C = ^c)

These normalized values can then be used to rank diﬀerent test instances. It should be pointed out that most classification algorithms return a numerical score for each class,

310 CHAPTER 10. DATA CLASSIFICATION

and therefore an analogous normalization can be performed for virtually any classification algorithm. However, in the Bayes method, it is more natural to intuitively interpret the normalized values as probabilities.

10.5.1.2 Discussion of the Naive Assumption

The Bayes model is referred to as “naive” because of the assumption of conditional inde-pendence. This assumption is obviously not true in practice because the features in real data sets are almost always correlated even when they are conditioned on a specific class. Nevertheless, in spite of this approximation, the naive Bayes classifier seems to perform quite well in practice in many domains. Although it is possible to implement the Bayes model using more general multivariate estimation methods, such methods can be computa-tionally more expensive. Furthermore, the estimation of multivariate probabilities becomes inaccurate with increasing dimensionality, especially with limited training data. Therefore, significant practical accuracy is often not gained with the use of theoretically more accu-rate assumptions. The bibliographic notes contain pointers to theoretical results on the eﬀectiveness of the naive assumption.

10.5.2 Logistic Regression

While the Bayes classifier assumes a specific form of the feature probability distribution for each class, logistic regression directly models the class-membership probabilities in terms of the feature variables with a discriminative function. Thus, the nature of the modeling assumption is diﬀerent in the two cases. Both are, however, probabilistic classifiers because they use a specific modeling assumption to map the feature variables to a class-membership probability. In both cases, the parameters of the underlying probabilistic model need to be estimated in a data-driven manner.

In the simplest form of logistic regression, it is assumed that the class variable is binary, and is drawn from {−1, +1}, although it is also possible to model nonbinary class variables. Let Θ = (θ₀, θ₁ . . . θ_d) be a vector of d + 1 diﬀerent parameters. The ith parameter θ_i is a coeﬃcient related to the ith dimension in the underlying data, and θ₀ is an oﬀset parameter. Then, for a record X = (x₁ . . . x_d), the probability that the class variable C takes on the values of +1 or −1, is modeled with the use of a logistic function.

P(C = +1

X) =

(10.26)

1 + _e−⁽^θ0⁺

_i₌₁^θi^xi⁾

P(C =

X) =

(10.27)

1 + _e⁽^θ0 ⁺

−

i=1

θ_ix_i)

It is easy to verify that the sum of the two aforementioned probability values is 1. Logistic regression can be viewed as either a probabilistic classifier or a linear classifier. In linear classifiers, such as Fisher’s discriminant, a linear hyperplane is used to separate the two classes. Other linear classifiers such as SVMs and neural networks will be discussed in Sects. 10.6 and 10.7 of this chapter. In logistic regression, the parameters Θ = (θ₀ . . . θ_d)

can be viewed as the coeﬃcients of a separating hyperplane θ₀ +		d
		_i₌₁θ_ix_i = 0 between
the two classes. The term θ_i is the linear coeﬃcient of dimension i, and the term θ₀ is the
constant term. The value of θ₀ +	d
	_i₌₁θ_ix_i will be either positive or negative, depending on

the side of the separating hyperplane on which X is located. A positive value is predictive of the class +1, whereas a negative value is predictive of the class −1. In many other linear classifiers, the sign of this expression yields the class label of X from {−1, +1}. Logistic

10.5. PROBABILISTIC CLASSIFIERS			311

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 181 182 183 184 185 186 187 188 ... 423