L!
|
|
. The probability of each of these sequences is given by
|
i:ai>0
|
p(i, c)ai , by using
|
|
i:ai>0
|
ai!
|
|
|
|
|
|
the naive independence assumption. In this case, p(i, c) is estimated as the fractional number
of occurrences of word i in class c including repetitions. Therefore, unlike the Bernoulli model, repeated presence of word i in a document belonging to class c will increase p(i, c). If n(i, c) is the number of occurrences of word i in all documents belonging to class c, then
p(i, c) =
|
n(i,c)
|
|
. Then, the class conditional feature distribution is estimated as follows:
|
|
i
|
n(i,c)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
P (x1 = a1, . . . xd = ad|C = c) ≈
|
L!
|
|
p(i, c)ai .
|
(13.20)
|
|
|
|
|
|
|
|
|
|
|
i:ai>0
|
ai!
|
|
|
|
|
|
|
|
i:ai>0
|
|
|
|
|
|
|
|
|
|
|
|
Using the Bayes rule, the multinomial Bayes model computes the posterior probability for a test document as follows:
P (C = c|x1 = a1, . . . xd = ad) ∝ P (C = c) · P (x1 = a1, . . . xd = ad|C = c)
|
(13.21)
|
|
|
|
≈ P (C = c) ·
|
L!
|
|
|
p(i, c)ai
|
(13.22)
|
|
|
|
|
|
|
|
|
|
i:ai
|
>0
|
ai!
|
i:ai>0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
∝ P (C = c) ·
|
p(i, c)ai .
|
|
(13.23)
|
|
The constant factor
|
|
i:ai>0
|
|
|
|
|
|
|
ai! has been removed from the last condition because it is the
|
|
|
L!
|
|
|
|
|
|
|
|
i:ai>0
same across all classes. Note that in this case, the product on the right-hand side only uses those words i, for which ai is strictly larger than 0. Therefore, nonoccurrence of words is ignored. In this case, we have assumed that each ai is the raw frequency of a word, which is an integer. It is also possible to use the multinomial Bayes model with the tf-idf frequency of a word, in which the frequency ai might be fractional. However, the generative explanation becomes less intuitive in such a case.
13.5. SPECIALIZED CLASSIFICATION METHODS FOR TEXT
|
451
|
13.5.3 SVM Classifiers for High-Dimensional and Sparse Data
The number of terms in the Lagrangian dual of the SVM formulation scales with the square of the number of dimensions. The reader is advised to refer to Sect. 10.6 of Chap. 10, and 11.4.2 of Chap. 11 for the relevant discussions. While the SVMLight method in Sect. 11.4.2 of Chap. 11 addresses this issue by making changes to the algorithmic approach, it does not make modifications to the SVM formulation itself. Most importantly, the approach does not make any modifications to address the high dimensional and sparse nature of text data.
The text domain is high dimensional and sparse. Only a small subset of the dimensions take on nonzero values for a given text document. Furthermore, linear classifiers tend to work rather well for the text domain, and it is often not necessary to use the kernelized version of the classifier. Therefore, it is natural to focus on linear classifiers, and ask whether it is possible to improve the complexity of SVM classification further by using the special domain-specific characteristics of text. SVMPerf is a linear-time algorithm designed for text classification. Its training complexity is O(n · s), where s is the average number of nonzero attributes per training document in the collection.
To explain the approach, we first briefly recap the soft penalty-based SVM formulation introduced in Sect. 10.6 of Chap. 10. The problem definition, referred to as the optimization formulation (OP1), is as follows:
|
|
|
|
|
|
|
|
|
|
n
|
|
|
|
(OP1): Minimize
|
|
||W ||2
|
+ C
|
|
ξi
|
|
|
i=1
|
|
|
|
|
|
|
|
n
|
|
|
|
|
2
|
|
|
|
|
|
|
|
|
subject to:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
yi
|
|
·
|
|
≥ 1 − ξi ∀i
|
|
|
W
|
Xi
|
|
|
ξi ≥ 0
|
∀i.
|
|
|
|
|
One difference from the conventional SVM formulation of Chap. 10 is that the constant term b is missing. The conventional SVM formulation uses the constraint yi(W · Xi + b) ≥ 1 − ξi. The two formulations are, however, equivalent because it can be shown that adding a dummy feature with a constant value of 1 to each training instance has the same effect. The coefficient in W of this feature will be equal to b. Another minor difference from the conventional formulation is that the slack component in the objective function is scaled by a factor of n. This is not a significant difference either because the constant C can be adjusted accordingly. These minor variations in the notation are performed without loss of generality for algebraic simplicity.
The SVMPerf method reformulates this problem with a single slack variable ξ, and 2n constraints that are generated by summing a random subset of the n constraints in (OP1). Let U = (u 1 . . . un) ∈ {0, 1}n represent the indicator vector for the constraints that are summed up to create this new synthetic constraint. An alternative formulation of the SVM model is as follows:
(OP2): Minimize subject to:
||
|
W
|
||2
|
+ Cξ
|
|
|
|
|
|
|
|
|
|
2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1
|
|
n
|
|
|
|
|
|
n
|
|
|
|
|
|
|
|
|
|
|
|
|
i=1 ui
|
|
|
n
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Dostları ilə paylaş: |