Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	271/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 267 268 269 270 271 272 273 274 ... 423

1-Data Mining tarjima

D₁

D₂

D₃

D₄

D₅

D₆

CHAMPION	ELECTRON	TROPHY	RELATVITY	QUANTUM	TOURAMENT			CHAMPION
2	0	1	1	0	3		D₁	2
							D₁	2
0	2	0	1	3	0	SPORTS	D₄	2
						SPORTS	D₄	2
1	3	0	1	2	0	CO CLUSTER
1	3	0	1	2	0		D₆	1
							D₆	1
2	0	2	0	0	3		D₂	0
							D₂	0
0	2	1	1	3	0		D₃	1
							D₃	1
1	0	2	0	0	3		D₅	0
							D₅	0

TROPHY	TOURNAMENT	ELECTRON	RELATIVITY	QUANTUM
1	3	0	1	0
2	3	0	0	0
2	3	0	0	0

0	0	2	1	3
0	0	3	1	2	PHYSICS
0	0	3	1	2	CO CLUSTER
1	0	2	1	3

(a) Document-term matrix (b) Re-arranged document-term matrix

Figure 13.1: Illustrating row and column reordering in co-clustering

To illustrate this point, a toy example2 of a 6 × 6 document-word matrix has been illustrated in Fig. 13.1a. The entries in the matrix correspond to the word frequencies in six documents denoted by D₁ . . . D₆. The six words in this case are champion, electron, trophy, relativity, quantum, and tournament. It is easy to see that some of the words are from sports- related topics, whereas other words are from science-related topics. Note that the nonzero entries in the matrix of Fig. 13.1a seem to be arranged randomly. It should be noted that the documents {D₁, D₄, D₆} seem to contain words relating to sports topics, whereas the documents {D₂, D₃, D₅} seem to contain words relating to scientific topics. However, this is not evident from the random distribution of the entries in Fig. 13.1a. On the other hand, if the rows and columns were permuted, so that all the sports-related rows/columns occur earlier than all the science-related rows/columns, then the resulting matrix is shown in Fig. 13.1b. In this case, there is a clear block structure to the entries, in which disjoint rectangular blocks contain most of the nonzero entries. These rectangular blocks are shaded in Fig. 13.1b. The goal is to minimize the weights of the nonzero entries outside these shaded blocks.

How, then, can the co-clustering problem be solved? The simplest solution is to convert the problem to a bipartite graph partitioning problem, so that the aggregate weight of the nonzero entries in the nonshaded regions is equal to the aggregate weight of the edges across the partitions. A node set N_d is created, in which each node represents a document in the collection. A node set N_w is created, in which each node represents a word in the collection. An undirected bipartite graph G = (N_d ∪ N_w, A) is created, such that an edge (i, j) in A corresponds to a nonzero entry in the matrix, where i ∈ N_d and j ∈ N_w. The weight of an edge is equal to the frequency of the term in the document. The bipartite graph for the co-cluster of Fig. 13.1 is illustrated in Fig. 13.2. A partitioning of this graph represents a simultaneous partitioning of the rows and columns. In this case, a 2-way partitioning has been illustrated for simplicity, although a k -way partitioning could be constructed in general. Note that each partition contains a set of documents and a corresponding set of words. It is easy to see that the corresponding documents and words in each graph partition of Fig. 13.2 represent the shaded areas in Fig. 13.1b. It is also easy to see that the weight

While the document-term matrix is square in this specific toy example, this might not be the case in general because the corpus size n, and the lexicon size d are generally diﬀerent.

440						CHAPTER 13. MINING TEXT DATA
		DOCUMENTS				WORDS
	D₁				2	CHAMPION
SPORTS			2	1	3¹
SPORTS	D		2	2		TROPHY
CO CLUSTER	4		3
	4
		D	1	2 ₃		TOURNAMENT
2 WAY	6
2 WAY
CUT
VALUE	^D2		2			ELECTRON
			2	1
			3
PHYSICS			3
PHYSICS			1 ₃
CO CLUSTER	D₃		1 ₃			RELATIVITY
CO CLUSTER	D₃		2	1		RELATIVITY
			2
	D		1	2	1	QUANTUM
	D				3	QUANTUM
	5

Figure 13.2: Graph partitioning for co-clustering

of edges across the partition represents the weight of the nonzero entries in Fig. 13.1b. Therefore, a k-way co-clustering problem can be converted to a k-way graph partitioning problem. The overall co-clustering approach may be described as follows:

Create a graph G = (N_d ∪ N_w, A) with nodes in N_d representing documents, nodes in N_w representing words, and edges in A with weights representing nonzero entries in matrix D.

Use a k-way graph partitioning algorithm to partition the nodes in N_d ∪ N_w into k groups.

Report row–column pairs (R_iV_i) for i ∈ {1 . . . k}. Here, R_i represents the rows cor-responding to nodes in N_d for the ith cluster, and V_i represents the columns corre-sponding to the nodes in N_w for the ith cluster.

It remains to be explained, how the k-way graph partitioning may be performed. The problem of graph partitioning is addressed in Sect. 19.3 of Chap. 19. Any of these algo-rithms may be leveraged to determine the required graph partitions. Specialized methods for bipartite graph partitioning are also discussed in the bibliographic notes.

13.4 Topic Modeling

Topic modeling can be viewed as a probabilistic version of the latent semantic analysis (LSA) method, and the most basic version of the approach is referred to as Probabilistic Latent Semantic Analysis (PLSA). It provides an alternative method for performing dimensionality reduction and has several advantages over traditional LSA.

Probabilistic latent semantic analysis is an expectation maximization-based mixture modeling algorithm. However, the way in which the EM algorithm is used is diﬀerent than the other examples of the EM algorithm in this book. This is because the underlying gen-erative process is diﬀerent, and is optimized to discovering the correlation structure of the words rather than the clustering structure of the documents. This is because the approach

13.4. TOPIC MODELING			441
	SELECT HIDDEN	SELECT HIDDEN

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 267 268 269 270 271 272 273 274 ... 423