Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə185/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   181   182   183   184   185   186   187   188   ...   423
1-Data Mining tarjima

P (C = c x




= a

, . . . x




= a

) =

P (C = c)P (x1 = a1, . . . xd = ad|C = c)

(10.18)



















P (x1 = a1, . . . xd = ad)




|

1

1




d

d




























P (C = c)P (x1 = a1, . . . xd = ad|C = c).

(10.19)




The second relationship above is based on the fact that the term P (x1 = a1, . . . xd = ad) in the denominator of the first relationship is independent of the class. Therefore, it suffices to only compute the numerator to determine the class with the maximum conditional probability. The value of P (C = c ) is the prior probability of the class identifier c and can be estimated as the fraction of the training data points belonging to class c. The key usefulness of the Bayes rule is that the terms on the right-hand side can now be effectively approximated from the training data with the use of a naive Bayes approximation. The naive Bayes approximation assumes that the values on the different attributes x1 . . . xd are independent of one another conditional on the class. When two random events A and B are independent of one another conditional on a third event F , it follows that P (A ∩ B|F ) = P (A|F )P (B|F ). In the case of the naive Bayes approximation, it is assumed that the feature


308 CHAPTER 10. DATA CLASSIFICATION


values are independent of one another conditional on a fixed value of the class variable. This implies the following for the conditional term on the right-hand side of Eq. 10.19.





d




P (x1 = a1, . . . xd = ad|C = c) = P (xj = aj |C = c)

(10.20)

j=1

Therefore, by substituting Eq. 10.20 in Eq. 10.19, the Bayes probability can be estimated within a constant of proportionality as follows:





d




P (C = c|x1 = a1, . . . xd = ad) ∝ P (C = c) P (xj = aj |C = c).

(10.21)

j=1

Note that each term P (xj = aj |C = c) is much easier to estimate from the training data than P (x1 = a1, . . . xd = ad|C = c) because enough training examples will exist in the former case to provide a robust estimate. Specifically, the maximum likelihood estimate for the value of P (xj = a j |C = c) is the fraction of training examples taking on value aj , conditional on the fact, that they belong to class c. In other words, if q (aj , c) is the number of training examples corresponding to feature variable xj = aj and class c, and r(c) is the number of training examples belonging to class c, then the estimation is performed as follows:



P (xj = aj |C = c) =

q(aj , c)

(10.22)




r(c) .




In some cases, enough training examples may still not be available to estimate these values robustly. For example, consider a rare class c with a single training example satisfying r(c) = 1, and q (aj , c) = 0. In such a case, the conditional probability is estimated to 0. Because of the productwise form of the Bayes expression, the entire probability will be estimated to 0. Clearly, the use of a small number of training examples belonging to the rare class cannot provide robust estimates. To avoid this kind of overfitting, Laplacian smoothing is used. A small value of α is added to the numerator, and a value of α · mj is added to the denominator, where mj is the number of distinct values of the jth attribute:





P (xj = aj |C = c) =

q(aj , c) + α

.

(10.23)




r(c) + α · mj




Here, α is the Laplacian smoothing parameter. For the case where r(c) = 0, this has the effect of estimating the probability to an unbiased value of 1/mj for all mj distinct attribute values. This is a reasonable estimate in the absence of any training data about class c. Thus, the training phase only requires the estimation of these conditional probabilities P (xj = aj |C = c) of each class–attribute–value combination, and the estimation of the prior probabilities P (C = c) of each class.


This model is referred to as the binary or Bernoulli model for Bayes classification when it is applied to categorical data with only two outcomes of each feature attribute. For example, in text data, the two outcomes could correspond to the presence or absence of a word. In cases where more than two outcomes are possible for a feature variable, the model is referred to as the generalized Bernoulli model. The implicit generative assumption of this model is similar to that of mixture modeling algorithms in clustering (cf. Sect. 6.5 of Chap. 6). The features within each class (mixture component) are independently generated from a distribution whose probabilities are the productwise approximations of Bernoulli distributions. The estimation of model parameters in the training phase is analogous to



10.5. PROBABILISTIC CLASSIFIERS

309

the M-step in expectation–maximization (EM) clustering algorithms. Note that, unlike EM clustering algorithms, the labels on only the training data are used to compute the maximum likelihood estimates of parameters in the training phase. Furthermore, the E-step (or the iterative approach) is not required because the (deterministic) assignment “probabilities” of labeled data are already known. In Sect. 13.5.2.1 of Chap. 13, a more sophisticated model, referred to as the multinomial model, will be discussed. This model can address sparse frequencies associated with attributes, as in text data. In general, the Bayes model can assume any parametric form of the conditional feature distribution P (x1 = a1, . . . xd = ad|C = c) of each class (mixture component), such as a Bernoulli model, a multinomial model, or even a Gaussian model for numeric data. The parameters of the distribution of each class are estimated in a data -driven manner. The approach discussed in this section, therefore, represents only a single instantiation from a wider array of possibilities.


The aforementioned description is based on categorical data. It can also be generalized to numeric data sets by using the process of discretization. Each discretized range becomes one of the possible categorical values of an attribute. Such an approach can, however, be sensitive to the granularity of the discretization. A second approach is to assume a specific form of the probability distribution of each mixture component (class), such as a Gaussian distribution. The mean and variance parameters of the Gaussian distribution of each class are estimated in a data-driven manner, just as the class conditioned feature probabilities are estimated in the Bernoulli model. Specifically, the mean and variance of each Gaussian can be estimated directly as the mean and variance of the training data for the corresponding class. This is similar to the M-step in EM clustering algorithms with Gaussian mixtures. The conditional class probabilities in Eq. 10.21 for a test instance are replaced with the class-specific Gaussian densities of the test instance.


10.5.1.1 The Ranking Model for Classification

The aforementioned algorithms predict the labels of individual test instances. In some sce-narios, a set of test instances is provided to the learner, and it is desired to rank these test instances by their propensity to belong to a particularly important class c. This is a common scenario in rare-class learning, which will be discussed in Sect. 11.3 of Chap. 11.


As discussed in Eq. 10.21, the probability of a test instance (a1 . . . ad) belonging to a particular class can be estimated within a constant of proportionality as follows:





d




P (C = c|x1 = a1, . . . xd = ad) ∝ P (C = c) P (xj = aj |C = c).

(10.24)

j=1

The constant of proportionality is irrelevant while comparing the scores across different classes but is not irrelevant while comparing the scores across different test instances. This is because the constant of proportionality is the inverse of the generative probability of the specific test instance. An easy way to estimate the proportionality constant is to use normalization so that the sum of probabilities across different classes is 1. Therefore, if the class label c is assumed to be an integer drawn from the range {1 . . . k} for a k-class problem, then the Bayes probability can be estimated as follows:








P (C = c)

d













P (C = c|x1 = a1, . . . xd = ad) =

j=1 P (xj = aj |C = c)

.

(10.25)




k




d







c=1 P (C = c)

j=1 P (xj = aj |C = c)










These normalized values can then be used to rank different test instances. It should be pointed out that most classification algorithms return a numerical score for each class,

310 CHAPTER 10. DATA CLASSIFICATION


and therefore an analogous normalization can be performed for virtually any classification algorithm. However, in the Bayes method, it is more natural to intuitively interpret the normalized values as probabilities.


10.5.1.2 Discussion of the Naive Assumption


The Bayes model is referred to as “naive” because of the assumption of conditional inde-pendence. This assumption is obviously not true in practice because the features in real data sets are almost always correlated even when they are conditioned on a specific class. Nevertheless, in spite of this approximation, the naive Bayes classifier seems to perform quite well in practice in many domains. Although it is possible to implement the Bayes model using more general multivariate estimation methods, such methods can be computa-tionally more expensive. Furthermore, the estimation of multivariate probabilities becomes inaccurate with increasing dimensionality, especially with limited training data. Therefore, significant practical accuracy is often not gained with the use of theoretically more accu-rate assumptions. The bibliographic notes contain pointers to theoretical results on the effectiveness of the naive assumption.


10.5.2 Logistic Regression

While the Bayes classifier assumes a specific form of the feature probability distribution for each class, logistic regression directly models the class-membership probabilities in terms of the feature variables with a discriminative function. Thus, the nature of the modeling assumption is different in the two cases. Both are, however, probabilistic classifiers because they use a specific modeling assumption to map the feature variables to a class-membership probability. In both cases, the parameters of the underlying probabilistic model need to be estimated in a data-driven manner.


In the simplest form of logistic regression, it is assumed that the class variable is binary, and is drawn from {−1, +1}, although it is also possible to model nonbinary class variables. Let Θ = (θ0, θ1 . . . θd) be a vector of d + 1 different parameters. The ith parameter θi is a coefficient related to the ith dimension in the underlying data, and θ0 is an offset parameter. Then, for a record X = (x1 . . . xd), the probability that the class variable C takes on the values of +1 or 1, is modeled with the use of a logistic function.




















1
















P(C = +1

|

X) =










(10.26)



















1 + e(θ0+

d




























i=1 θixi)




P(C =

1










1



















|

X) =










(10.27)






















1 + e(θ0 +

d


















i=1

θixi)




It is easy to verify that the sum of the two aforementioned probability values is 1. Logistic regression can be viewed as either a probabilistic classifier or a linear classifier. In linear classifiers, such as Fisher’s discriminant, a linear hyperplane is used to separate the two classes. Other linear classifiers such as SVMs and neural networks will be discussed in Sects. 10.6 and 10.7 of this chapter. In logistic regression, the parameters Θ = (θ0 . . . θd)

can be viewed as the coefficients of a separating hyperplane θ0 +

d




i=1 θixi = 0 between




the two classes. The term θi is the linear coefficient of dimension i, and the term θ0 is the




constant term. The value of θ0 +

d







i=1 θixi will be either positive or negative, depending on




the side of the separating hyperplane on which X is located. A positive value is predictive of the class +1, whereas a negative value is predictive of the class 1. In many other linear classifiers, the sign of this expression yields the class label of X from {−1, +1}. Logistic

10.5. PROBABILISTIC CLASSIFIERS




311





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   181   182   183   184   185   186   187   188   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin