Intuitively, the idea is to use the EM clustering algorithm to determine the clusters of documents most similar to the labeled classes. A partially supervised EM clustering method associates each cluster with a particular class. The conditional feature distributions in these clusters are used as a more robust proxy for the feature distributions of the corresponding classes.
The basic idea is to use a generative model to create semisupervised clusters from the data. A one-to-one correspondence between the mixture components and the classes is retained in this case. The use of EM algorithms for clustering categorical data and its semisupervised variant are discussed in Sects. 7.2.3 and 7.5.1, respectively, of Chap. 7. The reader is advised to revisit these sections for the relevant background before reading further.
For initialization, the labeled examples are used as the seeds for the EM algorithm, and the number of mixture components is set to the number of classes. A Bayes classifier is used to assign documents to clusters (classes) in the E -step. In the first iteration, the Bayes classifier uses only the labeled data to determine the initial set of posterior cluster (class) membership probabilities, as in a standard Bayes classifier. This results in a set of “soft” clusters, in which the (unlabeled) data point X has a weight w(X, c) in the range (0, 1) associated with each class c, corresponding to its posterior Bayes membership probability. Only labeled documents have binary weights that are either 0 or 1 for each class, depending on their fixed assignments. The value of P (xj = aj |C = c) is now estimated using a weighted variant of Eq. 10.22 in Chap. 10 that leverages both the labeled and the unlabeled documents.
|
|
|
L U w(
|
|
c)I(xj
|
, aj )
|
|
|
P (xj = aj |C = c) =
|
|
|
X,
|
(11.20)
|
|
|
X
|
|
|
|
|
|
L U w(
|
|
c)
|
|
|
|
|
|
|
X,
|
|
|
|
|
|
X
|
|
|
|
Here, I(xj , aj ) is an indicator variable that takes on the value of 1, if the jth feature xj of
is aj , and 0 otherwise. The major difference from Eq. 10.22 is that the posterior Bayes estimates of unlabeled documents are also used to estimate class-conditional feature distri-butions. As in the standard Bayes method, the same Laplacian smoothing approach may be incorporated to reduce overfitting. The prior probabilities P (C = c) for each cluster may also be estimated by computing the average assignment probability of the data points to the corresponding class. This is the M-step of the EM algorithm. The next E-step uses these modified values of P (xj = aj |C = c) and the prior probability to derive the posterior Bayes probability with a standard Bayes classifier. Therefore, the Bayes classifier implicitly incor-porates the impact of unlabeled data. The algorithm may be summarized by the following two iterative steps that are continually repeated to convergence:
(E-step) Estimate posterior probability of membership of data points to clusters (classes) using Bayes rule.
Dostları ilə paylaş: |