192 CHAPTER 6. CLUSTER ANALYSIS
Figure 6.22: An example of non-negative matrix factorization
For each row Xi of D (document vector), and each k -dimensional row Yi of U (transformed document vector), the aforementioned equation can be rewritten as follows:
This is exactly in the same form as any standard dimensionality reduction method, where the columns of V provide the basis space and row-vector Yi represents the reduced coordinates. In other words, the document vector Xi can be rewritten as an approximate (non-negative) linear combination of the k basis vectors. The value of k is typically small compared to the full dimensionality because the column vectors of V discover the latent structure in the data. Furthermore, the non-negativity of the matrices U and V ensures that the documents are expressed as a non-negative combination of the key concepts (or, clustered regions) in the term-based feature space.
An example of NMF for a toy 6 × 6 document-term matrix D is illustrated in Fig. 6.22. The rows correspond to 6 documents {X1 . . . X6} and the 6 words correspond to columns. The matrix entries correspond to word frequencies in the documents. The documents {X1, X2, X3} are related to cats, the documents {X5, X6} are related to cars, and the document X4 is related to both. Thus, there are two natural clusters in the data, and the matrix is correspondingly factorized into two matrices U and V T with rank k = 2. An approximately optimal factorization, with each entry rounded to the nearest integer, is illustrated in Fig. 6.22. Note that most of the entries in the factorized matrices will not be exactly 0 in a real-world example, but many of them might be close to 0, and almost all will be non-integer values. It is evident that the columns and rows, respectively, of U and V map to either the car or the cat cluster in the data. The 6 × 2 matrix U provides information about the relationships of 6 documents to 2 clusters, whereas the 6 × 2 matrix
provides information about the corresponding relationships of 6 words to 2 clusters. Each document can be assigned to the cluster for which it has the largest coordinate in U .
The rank-k matrix factorization U V T can be decomposed into k components by express-ing the matrix product in terms of the k columns Ui and Vi, respectively, of U and V :
Dostları ilə paylaş: |