Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə272/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   268   269   270   271   272   273   274   275   ...   423
1-Data Mining tarjima

COMPONENT

ESTIMATE




COMPONENT

ESTIMATE







(CLUSTER)

MODELING




(ASPECT)

MODELING










PARAMETERS







PARAMETERS










FROM







FROM










DOCUMENT







DOCUMENT










TERM MATRIX







TERM MATRIX







GENERATE ROW

BY USING ITS




SELECT POSITION IN

BY USING ITS







(DOCUMENT) OF

ROWS AS




DOCUMENT TERM

ENTRIES AS







DOCUMENT TERM

OBSERVED




MATRIX BASED

OBSERVED







COMPONENT_AND'>HIDDEN'>MATRIX BASED ON

DOCUMENTS




ON HIDDEN

FREQUENCIES







HIDDEN







COMPONENT AND










COMPONENT







INCREASE BY 1
















(b) PLSA




(a) EM-clustering (section 13.3.2)




Figure 13.3: Varying generative process of EM-clustering and PLSA


can be viewed as a probabilistic variant of SVD and LSA, rather than a probabilistic variant of clustering. Nevertheless, soft clusters can also be generated with the use of this method. There are many other dimensionality reduction methods, such as nonnegative matrix fac-torization, which are intimately related to clustering. PLSA is, in fact, a nonnegative matrix factorization method with a maximum-likelihood objective function.

In most of the EM clustering algorithms of this book, a mixture component (cluster) is selected, and then the data record is generated based on a particular form of the distribution of that component. An example is the Bernoulli clustering model, which is discussed in Sect. 13.3.2. In PLSA, the generative process3 is inherently designed for dimensionality reduction rather than clustering, and different parts of the same document can be generated by different mixture components. It is assumed that there are k aspects (or latent topics) denoted by G1 . . . Gk. The generative process builds the document-term matrix as follows:


1. Select a latent component (aspect) Gm with probability P (Gm).


2. Generate the indices (i, j) of a document–word pair (Xi, wj ) with probabilities P (Xi|Gm) and P (wj |Gm), respectively. Increment the frequency of entry (i, j ) in the document-term matrix by 1. The document and word indices are generated in a prob-abilistically independent way.


All the parameters of this generative process, such as P (Gm), P (Xi|Gm), and P (wj |Gm), need to be estimated from the observed frequencies in the n × d document-term matrix.


Although the aspects G1 . . . Gk are analogous to the clusters of Sect. 13.3.2, they are not the same. Note that each iteration of the generative process of Sect. 13.3.2 creates the final frequency vector of an entire row of the document-term matrix. In PLSA, even a single matrix entry may have frequency contributions from various mixture components. Indeed, even in deterministic latent semantic analysis, a document is expressed as a linear combina-tion of different latent directions. Therefore, the interpretation of each mixture component as a cluster is more direct in the method of Sect. 13.3.2. The generative differences between these models are illustrated in Fig. 13.3. Nevertheless, PLSA can also be used for clus-tering because of the highly interpretable and nonnegative nature of the underlying latent factorization. The relationship with and applicability to clustering will be discussed later.








  • The original work [271] uses an asymmetric generative process, which is equivalent to the (simpler) symmetric generative process discussed here.

442 CHAPTER 13. MINING TEXT DATA

An important assumption in PLSA is that the selected documents and words are condi-tionally independent after the latent topical component Gm has been fixed. In other words, the following is assumed:





P (




, wj |Gm) = P (




|Gm) · P (wj |Gm)

(13.7)




Xi

Xi




This implies that the joint probability P (Xi, wj ) for selecting a document–word pair can be expressed in the following way:














k

k




P (




, wj ) =

P (Gm) · P (




, wj |Gm) =

P (Gm) · P (




|Gm) · P (wj |Gm). (13.8)




Xi

Xi

Xi













m=1

m=1




It is important to note that local independence between documents and words within a latent component does not imply global independence between the same pair over the entire corpus. The local independence assumption is useful in the derivation of EM algorithm.


In PLSA, the posterior probability P (Gm|Xi, wj ) of the latent component associated with a particular document–word pair is estimated. The EM algorithm starts by initializing P (Gm), P ( Xi|Gm), and P (wj |Gm) to 1/k , 1/n, and 1/d, respectively. Here, k, n, and d denote the number of clusters, number of documents, and number of words, respectively. The algorithm iteratively executes the following E- and M-steps to convergence:





  1. (E-step) Estimate posterior probability P (Gm|Xi, wj ) in terms of P (Gm), P (Xi|Gm), and P (wj |Gm).




  1. (M-step) Estimate P (Gm), P (Xi|Gm) and P (wj |Gm) in terms of the posterior prob-ability P (Gm|Xi, wj ), and observed data about word-document co-occurrence using log-likelihood maximization.

These steps are iteratively repeated to convergence. It now remains to discuss the details of the E-step and the M-step. First, the E-step is discussed. The posterior probability estimated in the E-step can be expanded using the Bayes rule:

















) =

P (Gm) · P (




, wj |Gm)

.







P (

Gm|







Xi

(13.9)




X

, w




























i

j




P (Xi, wj )







The numerator of the right-hand side of the aforementioned equation can be expanded using


Eq. 13.7, and the denominator can be expanded using Eq. 13.8:




















P (Gm) · P (




|Gm) · P (wj |Gm)

.







P (

Gm|







) =

Xi

(13.10)




X

, w













i

j




k




























r=1 P (Gr) · P (Xi|Gr) · P (wj |Gr)







This shows that the E-step can be implemented in terms of the estimated values P (Gm),



  • (Xi|Gm), and P (wj |Gm).

It remains to show how these values can be estimated using the observed word- document co-occurrences in the M-step. The posterior probabilities P (Gm|Xi, wj ) may be viewed as weights attached with word-document co-occurrence pairs for each aspect Gm. These weights can be leveraged to estimate the values P (Gm), P (Xi |Gm), and P (wj |Gm) for each aspect using maximization of the log-likelihood function. The details of the log-likelihood function, and the differential calculus associated with the maximization process will not be discussed here. Rather, the final estimated values will be presented directly. Let f (Xi, wj ) represent



13.4. TOPIC MODELING

443

the observed frequency of the occurrence of word wj in document Xi in the corpus. Then, the estimations in the M-step are as follows:





P (




|Gm)







f (










, wj ) · P (Gm|










, wj ) ∀i ∈ {1 . . . n}, m ∈ {1

. . . k}

(13.11)




Xi

Xi

Xi



















wj













P (wj |Gm)







f (







, wj ) · P (Gm|







, wj )

j ∈ {1 . . . d}, m ∈ {1

. . . k}

(13.12)










Xi

Xi
















Xi













P (Gm)







f (




, wj ) · P (Gm|




, wj )

m ∈ {1 . . . k}.




(13.13)










Xi

Xi






















wj



















Xi
















Each of these estimations may be scaled to a probability by ensuring that they sum to 1 over all the outcomes for that random variable. This scaling corresponds to the constant of proportionality associated with the “” notation in the aforementioned equations. Fur-thermore, these estimations can be used to decompose the original document-term matrix into a product of three matrices, which is very similar to SVD/LSA. This relationship will be explored in the next section.


13.4.1 Use in Dimensionality Reduction and Comparison with Latent Semantic Analysis

The three key sets of parameters estimated in the M - step are P (Xi|Gm), P (wj |Gm), and P (Gm), respectively. These sets of parameters provide an SVD-like matrix factorization of the n × d document-term matrix D . Assume that the document-term matrix D is scaled by a constant to sum to an aggregate probability value of 1. Therefore, the (i, j)th entry of D can be viewed as an observed instantiation of the probabilistic quantity P (Xi, wj ). Let Qk be the n × k matrix, for which the (i, m)th entry is P (Xi|Gm), let Σk be the k × k diagonal matrix for which the mth diagonal entry is P (Gm), and let Pk be the d × k matrix for which the (j, m)th entry is P (wj |G m). Then, the (i, j)th entry P (Xi, wj ) of the matrix D can be expressed in terms of the entries of the aforementioned matrices according to Eq. 13.8, which is replicated here:








k




P (Xi, wj ) =

P (Gm) · P (Xi|Gm) · P (wj |Gm).

(13.14)



m=1

This LHS of the equation is equal to the (i, j)th entry of D, whereas the RHS of the equation is the (i, j )th entry of the matrix product QkΣkPkT . Depending on the number of components k, the LHS can only approximate the matrix D, which is denoted by Dk. By stacking up the n × d conditions of Eq. 13.14, the following matrix condition is obtained:



D

= Q

Σ

P T .

(13.15)

k

k

k

k




It is instructive to note that the matrix decomposition in Eq. 13.15 is similar to that in SVD/LSA (cf. Eq. 2.12 of Chap. 2). Therefore, as in LSA, Dk is an approximation of the document-term matrix D , and the transformed representation in k-dimensional space is given by QkΣk. However, the transformed representations will be different in PLSA and LSA. This is because different objective functions are optimized in the two cases. LSA minimizes the mean-squared error of the approximation, whereas PLSA maximizes the log-likelihood fit to a probabilistic generative model. One advantage of PLSA is that the entries of Qk and Pk and the transformed coordinate values are nonnegative and have clear


444 CHAPTER 13. MINING TEXT DATA






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   268   269   270   271   272   273   274   275   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin