Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə276/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   272   273   274   275   276   277   278   279   ...   423
1-Data Mining tarjima

P (x1 = a1, . . . xd = ad)

P (C = c)P (x1 = a1, . . . xd = ad|C = c)




d

  • P (C = c) P (xi = ai|C = c).



i=1
(13.17)

(13.18)

(13.19)



The last of the aforementioned relationships is based on the naive assumption of conditional independence. In the binary model discussed in Chap. 10, each attribute value ai takes on the value of 1 or 0 depending on the presence or the absence of a word. Thus, if the fraction of the documents in class c containing word i is denoted by p(i, c), then the value of P (xi = ai |C = c) is estimated5 as either p(i, c) or 1 − p(i, c) depending upon whether ai is 1 or 0, respectively. Note that this approach explicitly penalizes nonoccurrence of words in documents. Larger lexicon sizes will result in many words that are absent in a document. Therefore, the Bernoulli model may be dominated by word absence rather than






  • The exact value will be slightly different because of Laplacian smoothing. Readers are advised to refer to Sect. 10.5.1 of Chap. 10.

450 CHAPTER 13. MINING TEXT DATA

word presence. Word absence is usually weakly related to class labels. This leads to greater noise in the evaluation. Furthermore, differential frequencies of words are ignored by this approach. Longer documents are more likely to have repeated words. The multinomial model is designed to address these issues.


In the multinomial model, the L terms in a document are treated as samples from a multinomial distribution. The total number of terms in the document (or document length)



is denoted by L =

d




j=1 ai. In this case, the value of ai is assumed to be the raw frequency




of the term in the document. The posterior class probabilities of a test document with the frequency vector (a1 . . . ad) are defined and estimated using the following generative approach:



  1. Sample a class c with a class-specific prior probability.




  1. Sample L terms with replacement from the term distribution of the chosen class c. The term distribution is defined using a multinomial model. The sampling process generates the frequency vector (a1 . . . ad). All training and test documents are assumed to be observed samples of this generative process. Therefore, all model parameters of the generative process are estimated from the training data.




  1. Test instance classification: What is the posterior probability that the class c is selected in the first generative step, conditional on the observed word frequency (a1 . . . ad) in the test document?

When the sequential ordering of the L different samples are considered, the number of possible ways to sample the different terms to result in the representation (a1 . . . ad) is given



by


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   272   273   274   275   276   277   278   279   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin