Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	273/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 269 270 271 272 273 274 275 276 ... 423

1-Data Mining tarjima

WORDS

d

DOCUMENTS		SCALED
DOCUMENTS	n	DOCUMENT
		DOCUMENT
		TERM
		MATRIX

D = [P(X_i, w_j)]

TOPICS

DOCUMENTS	kDOMINANT	k
DOCUMENTS	kDOMINANT	BASISVECTORSOFINVERTEDLISTS
	n

Q_k = [P(X_i|G_m)]

TOPICS
k

x	TOPICS	k	k	x	TOPICS
x			k	x

P(G_m): PRIOR PROBABILITY
OF TOPIC G_m
WORDS

d

k DOMINANT

BASIS VECTORS OF DOCUMENTS

P_k^T = [P(w_j|G_m)]

Figure 13.4: Matrix factorization of PLSA

		LION	TIGER	CHEETAH	JAGUAR	PORSCHE	FERRARI
	X₁					0	0
CATS	X₂					0	0
	X₃					0	0
BOTH	^X4
CARS	X₅	0	0	0
CARS	X₆	0	0	0
	X₆	0	0	0

	CATS	CARS
X₁		0
X₂		0
X₃		0
^X4
X₅	0	0
X₆	0	0
		^Qk

			LION	TIGER	CHEETAH	JAGUAR	PORSCHE	FERARI
X ₀	0	X _CARS^CATS					0	0
X ₀	0	X _CARS^CATS	0	0	0		0	0
	k				^Pk	T

Figure 13.5: An example of PLSA (Revisiting Fig. 6.22 of Chap. 6)

probabilistic interpretability. By examining the probability values in each column of P_k, one can immediately infer the topical words of the corresponding aspect. This is not possible in LSA, where the entries in the corresponding matrix P_k do not have clear probabilistic significance and may even be negative. One advantage of LSA is that the transformation can be interpreted in terms of the rotation of an orthonormal axis system. In LSA, the columns of P_k are a set of orthonormal vectors representing this rotated basis. This is not the case in PLSA. Orthogonality of the basis system in LSA enables straightforward projection of out-of-sample documents (i.e., documents not included in D) onto the new rotated axis system.

Interestingly, as in SVD/LSA, the latent properties of the transpose of the document matrix are revealed by PLSA. Each row of P_kΣ_k can be viewed as the transformed coordi-nates of the vertical or inverted list representation (rows of the transpose) of the document matrix D in the basis space defined by columns of Q_k . These complementary properties are illustrated in Fig. 13.4. PLSA can also be viewed as a kind of nonnegative matrix factorization method (cf. Sect. 6.8 of Chap. 6) in which matrix elements are interpreted as probabilities and the maximum-likelihood estimate of a generative model is maximized rather than minimizing the Frobenius norm of the error matrix.

An example of an approximately optimal PLSA matrix factorization of a toy 6 ×6 exam-ple, with 6 documents and 6 words, is illustrated in Fig. 13.5. This example is the same (see Fig. 6.22) as the one used for nonnegative matrix factorization (NMF) in Chap. 6. Note that the factorizations in the two cases are very similar except that all basis vectors

13.4. TOPIC MODELING

445

are normalized to sum to 1 in PLSA, and the dominance of the basis vectors is reflected in a separate diagonal matrix containing the prior probabilities. Although the factorization presented here for PLSA is identical to that of NMF for intuitive understanding, the fac-torizations will usually be slightly diﬀerent4 because of the diﬀerence in objective functions in the two cases. Also, most of the entries in the factorized matrices will not be exactly 0 in a real example, but many of them might be quite small.

As in LSA, the problems of synonymy and polysemy are addressed by PLSA. For exam-ple, if an aspect G₁ explains the topic of cats, then two documents X and Y containing the words “cat” and “kitten,” respectively, will have positive values of the transformed coordinate for aspect G₁. Therefore, similarity computations between these documents will be improved in the transformed space. A word with multiple meanings (polysemous word) may have positive components in diﬀerent aspects. For example, a word such as “jaguar” can either be a cat or a car. If G₁ be an aspect that explains the topic of cats, and G₂ is an aspect that explains the topic of cars, then both P (“jaguar”|G₁) and P (“jaguar”|G₂) may be highly positive. However, the other words in the document will provide the context necessary to reinforce one of these two aspects. A document X that is mostly about cats will have a high value of P (X|G₁), whereas a document Y that is mostly about cars will have a high value of P (Y |G₂). This will be reflected in the matrix Q_k = [P (X_i|G_m)]_n×k and the new transformed coordinate representation Q_kΣ_k. Therefore, the computations will also be robust in terms of adjusting for polysemy eﬀects. In general, semantic concepts will be amplified in the transformed representation Q_kΣ_k. Therefore, many data mining applica-tions will perform more robustly in terms of the n × k transformed representation Q_kΣ_k rather than the original n × d document-term matrix.

13.4.2 Use in Clustering and Comparison with Probabilistic Clustering

The estimated parameters have intuitive interpretations in terms of clustering. In the Bayes model for clustering (Fig. 13.3a), the generative process is optimized to clustering docu-ments, whereas the generative process in topic modeling (Fig. 13.3b) is optimized to discov-ering the latent semantic components. The latter can be shown to cluster document–word pairs, which is diﬀerent from clustering documents. Therefore, although the same parame-ter set P (w_j |G_m) and P (X|G_m ) is estimated in the two cases, qualitatively diﬀerent results will be obtained. The model of Fig. 13.3a generates a document from a unique hidden component (cluster), and the final soft clustering is a result of uncertainty in estimation from observed data. On the other hand, in the probabilistic latent semantic model, diﬀerent parts of the same document may be generated by diﬀerent aspects, even at the generative modeling level. Thus, documents are not generated by individual mixture components, but by a combination of mixture components. In this sense, PLSA provides a more realistic model because the diverse words of an unusual document discussing both cats and cars (see Fig. 13.5) can be generated by distinct aspects. In Bayes clustering, even though such a document is generated in entirety by one of the mixture components, it may have sim-ilar assignment (posterior) probabilities with respect to two or more clusters because of estimation uncertainty. This diﬀerence is because PLSA was originally intended as a data transformation and dimensionality reduction method, rather than as a clustering method. Nevertheless, good document clusters can usually be derived from PLSA as well. The value P (G_m|X_i) provides an assignment probability of the document X_i to aspect (or “cluster”)

The presented factorization for PLSA is approximately optimal, but not exactly optimal.

446 CHAPTER 13. MINING TEXT DATA

G_m and can be derived from the parameters estimated in the M-step using the Bayes rule as follows:

				P (G_m) · P (		\|G_m)
P (	G_m\|		) =		X_i		.	(13.16)
		X			X_i
		X
		i		k
				_r₌₁^P (^Gr) ^· ^P (^Xi^\|Gr)

Thus, the PLSA approach can also be viewed a soft clustering method that provides assign-ment probabilities of documents to clusters. In addition, the quantity P (w_j |G_m), which is estimated in the M-step, provides probabilistic information about the probabilistic aﬃnity of diﬀerent words to aspects (or topics). The terms with the highest probability values for a specific aspect G_m can be viewed as a cluster digest for that topic.

As the PLSA approach also provides a multidimensional n × k coordinate representation Q_kΣ_k of the documents, a diﬀerent way of performing the clustering would be to represent the documents in this new space and use a k-means algorithm on the transformed corpus. Because the noise impact of synonymy and polysemy has been removed by PLSA, the k-means approach will generally be more eﬀective on the reduced representation than on the original corpus.

13.4.3 Limitations of PLSA

Although the PLSA method is an intuitively sound model for probabilistic modeling, it does have a number of practical drawbacks. The number of parameters grows linearly with the number of documents. Therefore, such an approach can be slow and may overfit the training data because of the large number of estimated parameters. Furthermore, while PLSA provides a generative model of document–word pairs in the training data, it cannot easily assign probabilities to previously unseen documents. Most of the other EM mixture models discussed in this book, such as the probabilistic Bayes model, are much better at assigning probabilities to previously unseen documents. To address these issues, Latent Dirichlet Allocation (LDA) was defined. This model uses Dirichlet priors on the topics, and generalizes relatively easily to new documents. In this sense, LDA is a fully generative model. The bibliographic notes contain pointers to this model.

13.5 Specialized Classification Methods for Text

As in clustering, classification algorithms are aﬀected by the nonnegative, sparse and high-dimensional nature of text data. An important eﬀect of sparsity is that the presence of a word in a document is more informative than the absence of the word. This observation has implications for classification methods such as the Bernoulli model used for Bayes classification that treat the presence and absence of a word in a symmetric way.

Popular techniques in the text domain include instance-based methods, the Bayes clas-sifier, and the SVM classifier. The Bayes classifier is very popular because Web text is often combined with other types of features such as URLs or side information. It is relatively easy to incorporate these features into the Bayes classifier. The sparse high-dimensional nature of text also necessitates the design of more refined multinomial Bayes models for the text domain. SVM classifiers are also extremely popular for text data because of their high accuracy. The major issue with the use of the SVM classifier is that the high-dimensional nature of text necessitates performance enhancements to such classifiers. In the following, some of these algorithms will be discussed.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 269 270 271 272 273 274 275 276 ... 423