(a) Document-term matrix (b) Re-arranged document-term matrix
Figure 13.1: Illustrating row and column reordering in co-clustering
To illustrate this point, a toy example2 of a 6
× 6 document-word matrix has been illustrated in Fig. 13.1a. The entries in the matrix correspond to the word frequencies in six documents denoted by
D1 . . . D6. The six words in this case are
champion, electron, trophy, relativity, quantum, and
tournament. It is easy to see that some of the words are
from sports- related topics, whereas other words are from science-related topics. Note that the nonzero entries in the matrix of Fig. 13.1a seem to be arranged randomly. It should be noted that the documents
{D1, D4, D6} seem to contain words relating to sports topics, whereas the documents
{D2, D3, D5} seem to contain words relating to scientific topics. However, this is not evident from the random distribution of the entries in Fig. 13.1a. On the other hand, if the rows and columns were permuted, so that all the sports-related rows
/columns occur earlier than all the science-related rows
/columns, then the resulting matrix is shown in Fig. 13.1b. In this case, there is a clear block structure to the entries, in which disjoint rectangular blocks contain most of the nonzero entries. These rectangular blocks are shaded in Fig. 13.1b. The goal is to minimize the weights of the nonzero entries outside these shaded blocks.
How, then, can the co-clustering problem be solved? The simplest solution is to convert the problem to a bipartite graph partitioning problem, so that the aggregate weight of the nonzero entries in the nonshaded regions is equal to the aggregate weight of the edges across the partitions. A node set Nd is created, in which each node represents a document in the collection. A node set Nw is created, in which each node represents a word in the collection. An undirected bipartite graph G = (Nd ∪ Nw, A) is created, such that an edge (i, j) in A corresponds to a nonzero entry in the matrix, where i ∈ Nd and j ∈ Nw. The weight of an edge is equal to the frequency of the term in the document. The bipartite graph for the co-cluster of Fig. 13.1 is illustrated in Fig. 13.2. A partitioning of this graph represents a simultaneous partitioning of the rows and columns. In this case, a 2-way partitioning has been illustrated for simplicity, although a k -way partitioning could be constructed in general. Note that each partition contains a set of documents and a corresponding set of words. It is easy to see that the corresponding documents and words in each graph partition of Fig. 13.2 represent the shaded areas in Fig. 13.1b. It is also easy to see that the weight
Topic modeling can be viewed as a probabilistic version of the latent semantic analysis (LSA) method, and the most basic version of the approach is referred to as Probabilistic Latent Semantic Analysis (PLSA). It provides an alternative method for performing dimensionality reduction and has several advantages over traditional LSA.
Probabilistic latent semantic analysis is an expectation maximization-based mixture modeling algorithm. However, the way in which the EM algorithm is used is different than the other examples of the EM algorithm in this book. This is because the underlying gen-erative process is different, and is optimized to discovering the correlation structure of the words rather than the clustering structure of the documents. This is because the approach