Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə271/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   267   268   269   270   271   272   273   274   ...   423
1-Data Mining tarjima

D1


D2


D3


D4


D5


D6



CHAMPION

ELECTRON

TROPHY

RELATVITY

QUANTUM

TOURAMENT







CHAMPION




2

0

1

1

0

3




D1

2




























0

2

0

1

3

0

SPORTS

D4

2

























1

3

0

1

2

0

CO CLUSTER













D6

1




























2

0

2

0

0

3




D2

0




























0

2

1

1

3

0




D3

1




























1

0

2

0

0

3




D5

0































TROPHY

TOURNAMENT

ELECTRON

RELATIVITY

QUANTUM







1

3

0

1

0







2

3

0

0

0







2

3

0

0

0




























0

0

2

1

3







0

0

3

1

2

PHYSICS




CO CLUSTER




1

0

2

1

3










(a) Document-term matrix (b) Re-arranged document-term matrix


Figure 13.1: Illustrating row and column reordering in co-clustering


To illustrate this point, a toy example2 of a 6 × 6 document-word matrix has been illustrated in Fig. 13.1a. The entries in the matrix correspond to the word frequencies in six documents denoted by D1 . . . D6. The six words in this case are champion, electron, trophy, relativity, quantum, and tournament. It is easy to see that some of the words are from sports- related topics, whereas other words are from science-related topics. Note that the nonzero entries in the matrix of Fig. 13.1a seem to be arranged randomly. It should be noted that the documents {D1, D4, D6} seem to contain words relating to sports topics, whereas the documents {D2, D3, D5} seem to contain words relating to scientific topics. However, this is not evident from the random distribution of the entries in Fig. 13.1a. On the other hand, if the rows and columns were permuted, so that all the sports-related rows/columns occur earlier than all the science-related rows/columns, then the resulting matrix is shown in Fig. 13.1b. In this case, there is a clear block structure to the entries, in which disjoint rectangular blocks contain most of the nonzero entries. These rectangular blocks are shaded in Fig. 13.1b. The goal is to minimize the weights of the nonzero entries outside these shaded blocks.

How, then, can the co-clustering problem be solved? The simplest solution is to convert the problem to a bipartite graph partitioning problem, so that the aggregate weight of the nonzero entries in the nonshaded regions is equal to the aggregate weight of the edges across the partitions. A node set Nd is created, in which each node represents a document in the collection. A node set Nw is created, in which each node represents a word in the collection. An undirected bipartite graph G = (Nd ∪ Nw, A) is created, such that an edge (i, j) in A corresponds to a nonzero entry in the matrix, where i ∈ Nd and j ∈ Nw. The weight of an edge is equal to the frequency of the term in the document. The bipartite graph for the co-cluster of Fig. 13.1 is illustrated in Fig. 13.2. A partitioning of this graph represents a simultaneous partitioning of the rows and columns. In this case, a 2-way partitioning has been illustrated for simplicity, although a k -way partitioning could be constructed in general. Note that each partition contains a set of documents and a corresponding set of words. It is easy to see that the corresponding documents and words in each graph partition of Fig. 13.2 represent the shaded areas in Fig. 13.1b. It is also easy to see that the weight








  • While the document-term matrix is square in this specific toy example, this might not be the case in general because the corpus size n, and the lexicon size d are generally different.

440













CHAPTER 13. MINING TEXT DATA







DOCUMENTS

WORDS







D1







2

CHAMPION




SPORTS




2

1

31







D

2




TROPHY




CO CLUSTER

4

3































D

1

2 3

TOURNAMENT




2 WAY

6


































CUT



















VALUE

D2

2







ELECTRON







1










3







PHYSICS
















1 3










CO CLUSTER

D3




RELATIVITY




2

1




























D

1

2

1

QUANTUM













3







5
















Figure 13.2: Graph partitioning for co-clustering


of edges across the partition represents the weight of the nonzero entries in Fig. 13.1b. Therefore, a k-way co-clustering problem can be converted to a k-way graph partitioning problem. The overall co-clustering approach may be described as follows:



  1. Create a graph G = (Nd ∪ Nw, A) with nodes in Nd representing documents, nodes in Nw representing words, and edges in A with weights representing nonzero entries in matrix D.




  1. Use a k-way graph partitioning algorithm to partition the nodes in Nd ∪ Nw into k groups.




  1. Report row–column pairs (RiVi) for i ∈ {1 . . . k}. Here, Ri represents the rows cor-responding to nodes in Nd for the ith cluster, and Vi represents the columns corre-sponding to the nodes in Nw for the ith cluster.

It remains to be explained, how the k-way graph partitioning may be performed. The problem of graph partitioning is addressed in Sect. 19.3 of Chap. 19. Any of these algo-rithms may be leveraged to determine the required graph partitions. Specialized methods for bipartite graph partitioning are also discussed in the bibliographic notes.


13.4 Topic Modeling


Topic modeling can be viewed as a probabilistic version of the latent semantic analysis (LSA) method, and the most basic version of the approach is referred to as Probabilistic Latent Semantic Analysis (PLSA). It provides an alternative method for performing dimensionality reduction and has several advantages over traditional LSA.


Probabilistic latent semantic analysis is an expectation maximization-based mixture modeling algorithm. However, the way in which the EM algorithm is used is different than the other examples of the EM algorithm in this book. This is because the underlying gen-erative process is different, and is optimized to discovering the correlation structure of the words rather than the clustering structure of the documents. This is because the approach





13.4. TOPIC MODELING










441







SELECT HIDDEN







SELECT HIDDEN











Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   267   268   269   270   271   272   273   274   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin