10.5. PROBABILISTIC CLASSIFIERS
|
307
|
Each of the expressions on the right-hand side is already known. The value of P (E ) is 6/11, and the value of P (E|D) is 5/6. Furthermore, the prior probability P (D) before knowing the age is 6/11. Consequently, the posterior probability may be estimated as follows:
P (D|E) =
|
(5/6)(6/11)
|
= 5/6.
|
(10.17)
|
|
6/11
|
|
Therefore, if we had 1-dimensional training data containing only the Age, along with the class variable, the probabilities could be estimated using this approach. Table 10.1 contains an example with training instances satisfying the aforementioned conditions. It is also easy to verify from Table 10.1 that the fraction of individuals above age 50 who are donors is 5/6, which is in agreement with Bayes theorem. In this particular case, the Bayes theorem is not really essential because the classes can be predicted directly from a single attribute of the training data. A question arises, as to why the indirect route of using the Bayes theorem is useful, if the posterior probability P (D|E) could be estimated directly from the training data (Table 10.1) in the first place. The reason is that the conditional event E usually corresponds to a combination of constraints on d different feature variables, rather than a single one. This makes the direct estimation of P (D |E ) much more difficult. For example, the probability P (Donor|Age > 50, Salary > 50, 000) is harder to robustly estimate from the training data because there are fewer instances in Table 10.1 that satisfy both the conditions on age and salary. This problem increases with increasing dimensionality. In general, for a d-dimensional test instance, with d conditions, it may be the case that not even a single tuple in the training data satisfies all these conditions. Bayes rule provides a way of expressing P (Donor|Age > 50, Salary > 50, 000) in terms of P (Age > 50, Salary > 50, 000|Donor). The latter is much easier to estimate with the use of a product-wise approximation known as the naive Bayes approximation, whereas the former is not.
For ease in discussion, it will be assumed that all feature variables are categorical. The numeric case is discussed later. Let C be the random variable representing the class variable of an unseen test instance with d-dimensional feature values X = ( a1 . . . ad). The goal is to estimate P (C = c|X = (a1 . . . ad)). Let the random variables for the individual dimensions of
be denoted by X = (x1 . . . xd). Then, it is desired to estimate the conditional probability P (C = c|x1 = a1, . . . xd = ad). This is difficult to estimate directly from the training data because the training data may not contain even a single record with attribute values (a1 . . . ad). Then, by using Bayes theorem, the following equivalence can be inferred:
Dostları ilə paylaş: |