Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	276/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 272 273 274 275 276 277 278 279 ... 423

1-Data Mining tarjima

P (x₁ = a₁, . . . x_d = a_d)

∝ P (C = c)P (x₁ = a₁, . . . x_d = a_d|C = c)

P (C = c) P (x_i = a_i|C = c).

i=1
(13.17)

(13.18)

(13.19)

The last of the aforementioned relationships is based on the naive assumption of conditional independence. In the binary model discussed in Chap. 10, each attribute value a_i takes on the value of 1 or 0 depending on the presence or the absence of a word. Thus, if the fraction of the documents in class c containing word i is denoted by p(i, c), then the value of P (x_i = a_i |C = c) is estimated⁵ as either p(i, c) or 1 − p(i, c) depending upon whether a_i is 1 or 0, respectively. Note that this approach explicitly penalizes nonoccurrence of words in documents. Larger lexicon sizes will result in many words that are absent in a document. Therefore, the Bernoulli model may be dominated by word absence rather than

The exact value will be slightly diﬀerent because of Laplacian smoothing. Readers are advised to refer to Sect. 10.5.1 of Chap. 10.

450 CHAPTER 13. MINING TEXT DATA

word presence. Word absence is usually weakly related to class labels. This leads to greater noise in the evaluation. Furthermore, diﬀerential frequencies of words are ignored by this approach. Longer documents are more likely to have repeated words. The multinomial model is designed to address these issues.

In the multinomial model, the L terms in a document are treated as samples from a multinomial distribution. The total number of terms in the document (or document length)

is denoted by L =	d
is denoted by L =	_j₌₁a_i. In this case, the value of a_i is assumed to be the raw frequency

of the term in the document. The posterior class probabilities of a test document with the frequency vector (a₁ . . . a_d) are defined and estimated using the following generative approach:

Sample a class c with a class-specific prior probability.

Sample L terms with replacement from the term distribution of the chosen class c. The term distribution is defined using a multinomial model. The sampling process generates the frequency vector (a₁ . . . a_d). All training and test documents are assumed to be observed samples of this generative process. Therefore, all model parameters of the generative process are estimated from the training data.

Test instance classification: What is the posterior probability that the class c is selected in the first generative step, conditional on the observed word frequency (a₁ . . . a_d) in the test document?

When the sequential ordering of the L diﬀerent samples are considered, the number of possible ways to sample the diﬀerent terms to result in the representation (a₁ . . . a_d) is given

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 272 273 274 275 276 277 278 279 ... 423