Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	272/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 268 269 270 271 272 273 274 275 ... 423

1-Data Mining tarjima

COMPONENT	ESTIMATE		COMPONENT		ESTIMATE

	(CLUSTER)	MODELING		(ASPECT)			MODELING
		PARAMETERS					PARAMETERS
		FROM					FROM
		DOCUMENT					DOCUMENT
		TERM MATRIX					TERM MATRIX
	GENERATE ROW	BY USING ITS		SELECT POSITION IN			BY USING ITS
	(DOCUMENT) OF	ROWS AS		DOCUMENT TERM			ENTRIES AS
	DOCUMENT TERM	OBSERVED		MATRIX BASED			OBSERVED
	COMPONENT_AND'>HIDDEN'>MATRIX BASED ON	DOCUMENTS		ON HIDDEN			FREQUENCIES
	HIDDEN			COMPONENT AND
	COMPONENT			INCREASE BY 1
						(b) PLSA
(a) EM-clustering (section 13.3.2)

Figure 13.3: Varying generative process of EM-clustering and PLSA

can be viewed as a probabilistic variant of SVD and LSA, rather than a probabilistic variant of clustering. Nevertheless, soft clusters can also be generated with the use of this method. There are many other dimensionality reduction methods, such as nonnegative matrix fac-torization, which are intimately related to clustering. PLSA is, in fact, a nonnegative matrix factorization method with a maximum-likelihood objective function.

In most of the EM clustering algorithms of this book, a mixture component (cluster) is selected, and then the data record is generated based on a particular form of the distribution of that component. An example is the Bernoulli clustering model, which is discussed in Sect. 13.3.2. In PLSA, the generative process3 is inherently designed for dimensionality reduction rather than clustering, and diﬀerent parts of the same document can be generated by diﬀerent mixture components. It is assumed that there are k aspects (or latent topics) denoted by G₁ . . . G_k. The generative process builds the document-term matrix as follows:

1. Select a latent component (aspect) G_m with probability P (G_m).

2. Generate the indices (i, j) of a document–word pair (X_i, w_j ) with probabilities P (X_i|G_m) and P (w_j |G_m), respectively. Increment the frequency of entry (i, j ) in the document-term matrix by 1. The document and word indices are generated in a prob-abilistically independent way.

All the parameters of this generative process, such as P (G_m), P (X_i|G_m), and P (w_j |G_m), need to be estimated from the observed frequencies in the n × d document-term matrix.

Although the aspects G₁ . . . G_k are analogous to the clusters of Sect. 13.3.2, they are not the same. Note that each iteration of the generative process of Sect. 13.3.2 creates the final frequency vector of an entire row of the document-term matrix. In PLSA, even a single matrix entry may have frequency contributions from various mixture components. Indeed, even in deterministic latent semantic analysis, a document is expressed as a linear combina-tion of diﬀerent latent directions. Therefore, the interpretation of each mixture component as a cluster is more direct in the method of Sect. 13.3.2. The generative diﬀerences between these models are illustrated in Fig. 13.3. Nevertheless, PLSA can also be used for clus-tering because of the highly interpretable and nonnegative nature of the underlying latent factorization. The relationship with and applicability to clustering will be discussed later.

The original work [271] uses an asymmetric generative process, which is equivalent to the (simpler) symmetric generative process discussed here.

442 CHAPTER 13. MINING TEXT DATA

An important assumption in PLSA is that the selected documents and words are condi-tionally independent after the latent topical component G_m has been fixed. In other words, the following is assumed:

P (		, w_j \|G_m) = P (		\|G_m) · P (w_j \|G_m)	(13.7)
	X_i		X_i

This implies that the joint probability P (X_i, w_j ) for selecting a document–word pair can be expressed in the following way:

P (

, w_j ) =

P (G_m) · P (

, w_j |G_m) =

P (G_m) · P (

|G_m) · P (w_j |G_m). (13.8)

X_i

m=1

It is important to note that local independence between documents and words within a latent component does not imply global independence between the same pair over the entire corpus. The local independence assumption is useful in the derivation of EM algorithm.

In PLSA, the posterior probability P (G_m|X_i, w_j ) of the latent component associated with a particular document–word pair is estimated. The EM algorithm starts by initializing P (G_m), P ( X_i|G_m), and P (w_j |G_m) to 1/k , 1/n, and 1/d, respectively. Here, k, n, and d denote the number of clusters, number of documents, and number of words, respectively. The algorithm iteratively executes the following E- and M-steps to convergence:

(E-step) Estimate posterior probability P (G_m|X_i, w_j ) in terms of P (G_m), P (X_i|G_m), and P (w_j |G_m).

(M-step) Estimate P (G_m), P (X_i|G_m) and P (w_j |G_m) in terms of the posterior prob-ability P (G_m|X_i, w_j ), and observed data about word-document co-occurrence using log-likelihood maximization.

These steps are iteratively repeated to convergence. It now remains to discuss the details of the E-step and the M-step. First, the E-step is discussed. The posterior probability estimated in the E-step can be expanded using the Bayes rule:

				) =	P (G_m) · P (		, w_j \|G_m)	.
P (	G_m\|					^Xi			(13.9)
		X	, w


		i	j		P (X_i, w_j )

The numerator of the right-hand side of the aforementioned equation can be expanded using

Eq. 13.7, and the denominator can be expanded using Eq. 13.8:

P (G_m) · P (

|G_m) · P (w_j |G_m)

P (

G_m|

) =

X_i

(13.10)

, w

_r₌₁P (G_r) · P (X_i|G_r) · P (w_j |G_r)

This shows that the E-step can be implemented in terms of the estimated values P (G_m),

(X_i|G_m), and P (w_j |G_m).

It remains to show how these values can be estimated using the observed word- document co-occurrences in the M-step. The posterior probabilities P (G_m|X_i, w_j ) may be viewed as weights attached with word-document co-occurrence pairs for each aspect G_m. These weights can be leveraged to estimate the values P (G_m), P (X_i |G_m), and P (w_j |G_m) for each aspect using maximization of the log-likelihood function. The details of the log-likelihood function, and the diﬀerential calculus associated with the maximization process will not be discussed here. Rather, the final estimated values will be presented directly. Let f (X_i, w_j ) represent

13.4. TOPIC MODELING

443

the observed frequency of the occurrence of word w_j in document X_i in the corpus. Then, the estimations in the M-step are as follows:

P (

|G_m) ∝

f (

, w_j ) · P (G_m|

, w_j ) ∀i ∈ {1 . . . n}, m ∈ {1

. . . k}

(13.11)

X_i

^wj

P (w_j |G_m) ∝

f (

, w_j ) · P (G_m|

, w_j )

∀j ∈ {1 . . . d}, m ∈ {1

. . . k}

(13.12)

X_i

P (G_m) ∝

f (

, w_j ) · P (G_m|

, w_j )

∀m ∈ {1 . . . k}.

(13.13)

X_i

w_j

^Xi

Each of these estimations may be scaled to a probability by ensuring that they sum to 1 over all the outcomes for that random variable. This scaling corresponds to the constant of proportionality associated with the “∝” notation in the aforementioned equations. Fur-thermore, these estimations can be used to decompose the original document-term matrix into a product of three matrices, which is very similar to SVD/LSA. This relationship will be explored in the next section.

13.4.1 Use in Dimensionality Reduction and Comparison with Latent Semantic Analysis

The three key sets of parameters estimated in the M - step are P (X_i|G_m), P (w_j |G_m), and P (G_m), respectively. These sets of parameters provide an SVD-like matrix factorization of the n × d document-term matrix D . Assume that the document-term matrix D is scaled by a constant to sum to an aggregate probability value of 1. Therefore, the (i, j)th entry of D can be viewed as an observed instantiation of the probabilistic quantity P (X_i, w_j ). Let Q_k be the n × k matrix, for which the (i, m)th entry is P (X_i|G_m), let Σ_k be the k × k diagonal matrix for which the mth diagonal entry is P (G_m), and let P_k be the d × k matrix for which the (j, m)th entry is P (w_j |G _m). Then, the (i, j)th entry P (X_i, w_j ) of the matrix D can be expressed in terms of the entries of the aforementioned matrices according to Eq. 13.8, which is replicated here:

	k
P (X_i, w_j ) =	P (G_m) · P (X_i\|G_m) · P (w_j \|G_m).	(13.14)

m=1

This LHS of the equation is equal to the (i, j)th entry of D, whereas the RHS of the equation is the (i, j )th entry of the matrix product Q_kΣ_kP_kT . Depending on the number of components k, the LHS can only approximate the matrix D, which is denoted by D_k. By stacking up the n × d conditions of Eq. 13.14, the following matrix condition is obtained:

D	= Q	Σ	P ^T .	(13.15)
k	k	k	k

It is instructive to note that the matrix decomposition in Eq. 13.15 is similar to that in SVD/LSA (cf. Eq. 2.12 of Chap. 2). Therefore, as in LSA, D_k is an approximation of the document-term matrix D , and the transformed representation in k-dimensional space is given by Q_kΣ_k. However, the transformed representations will be diﬀerent in PLSA and LSA. This is because diﬀerent objective functions are optimized in the two cases. LSA minimizes the mean-squared error of the approximation, whereas PLSA maximizes the log-likelihood fit to a probabilistic generative model. One advantage of PLSA is that the entries of Q_k and P_k and the transformed coordinate values are nonnegative and have clear

444 CHAPTER 13. MINING TEXT DATA

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 268 269 270 271 272 273 274 275 ... 423