Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	277/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 273 274 275 276 277 278 279 280 ... 423

1-Data Mining tarjima

L!		. The probability of each of these sequences is given by	i:a_i>0	p(i, c)^ai , by using

i:a_i>0	a_i!

the naive independence assumption. In this case, p(i, c) is estimated as the fractional number

of occurrences of word i in class c including repetitions. Therefore, unlike the Bernoulli model, repeated presence of word i in a document belonging to class c will increase p(i, c). If n(i, c) is the number of occurrences of word i in all documents belonging to class c, then

p(i, c) =

n(i,c)

. Then, the class conditional feature distribution is estimated as follows:

n(i,c)

P (x₁ = a₁, . . . x_d = a_d|C = c) ≈

p(i, c)^ai .

(13.20)

i:a_i>0

a_i!

i:a_i>0

Using the Bayes rule, the multinomial Bayes model computes the posterior probability for a test document as follows:

P (C = c\|x₁ = a₁, . . . x_d = a_d) ∝ P (C = c) · P (x₁ = a₁, . . . x_d = a_d\|C = c)								(13.21)
		≈ P (C = c) ·	L!				p(i, c)^ai	(13.22)

			i:a_i	>0	a_i!	i:a_i>0


		∝ P (C = c) ·		p(i, c)^ai .				(13.23)
The constant factor		i:a_i>0
	_a_i_!has been removed from the last condition because it is the
	L!

i:a_i>0

same across all classes. Note that in this case, the product on the right-hand side only uses those words i, for which a_i is strictly larger than 0. Therefore, nonoccurrence of words is ignored. In this case, we have assumed that each a_i is the raw frequency of a word, which is an integer. It is also possible to use the multinomial Bayes model with the tf-idf frequency of a word, in which the frequency a_i might be fractional. However, the generative explanation becomes less intuitive in such a case.

13.5. SPECIALIZED CLASSIFICATION METHODS FOR TEXT

451

13.5.3 SVM Classifiers for High-Dimensional and Sparse Data

The number of terms in the Lagrangian dual of the SVM formulation scales with the square of the number of dimensions. The reader is advised to refer to Sect. 10.6 of Chap. 10, and 11.4.2 of Chap. 11 for the relevant discussions. While the SVMLight method in Sect. 11.4.2 of Chap. 11 addresses this issue by making changes to the algorithmic approach, it does not make modifications to the SVM formulation itself. Most importantly, the approach does not make any modifications to address the high dimensional and sparse nature of text data.

The text domain is high dimensional and sparse. Only a small subset of the dimensions take on nonzero values for a given text document. Furthermore, linear classifiers tend to work rather well for the text domain, and it is often not necessary to use the kernelized version of the classifier. Therefore, it is natural to focus on linear classifiers, and ask whether it is possible to improve the complexity of SVM classification further by using the special domain-specific characteristics of text. SVMPerf is a linear-time algorithm designed for text classification. Its training complexity is O(n · s), where s is the average number of nonzero attributes per training document in the collection.

To explain the approach, we first briefly recap the soft penalty-based SVM formulation introduced in Sect. 10.6 of Chap. 10. The problem definition, referred to as the optimization formulation (OP1), is as follows:

(OP1): Minimize

||W ||²

+ C

ξ_i

i=1

subject to:

y_i

≥ 1 − ξ_i ∀i

X_i

ξ_i ≥ 0

∀i.

One diﬀerence from the conventional SVM formulation of Chap. 10 is that the constant term b is missing. The conventional SVM formulation uses the constraint y_i(W · X_i + b) ≥ 1 − ξ_i. The two formulations are, however, equivalent because it can be shown that adding a dummy feature with a constant value of 1 to each training instance has the same eﬀect. The coeﬃcient in W of this feature will be equal to b. Another minor diﬀerence from the conventional formulation is that the slack component in the objective function is scaled by a factor of n. This is not a significant diﬀerence either because the constant C can be adjusted accordingly. These minor variations in the notation are performed without loss of generality for algebraic simplicity.

The SVMPerf method reformulates this problem with a single slack variable ξ, and 2n constraints that are generated by summing a random subset of the n constraints in (OP1). Let U = (u ₁ . . . u_n) ∈ {0, 1}n represent the indicator vector for the constraints that are summed up to create this new synthetic constraint. An alternative formulation of the SVM model is as follows:

(OP2): Minimize subject to:

||²

+ Cξ

_i₌₁^ui

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 273 274 275 276 277 278 279 280 ... 423