Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	117/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 113 114 115 116 117 118 119 120 ... 423

1-Data Mining tarjima

D≈UV^T.

(6.30)

192 CHAPTER 6. CLUSTER ANALYSIS

Figure 6.22: An example of non-negative matrix factorization

For each row X_i of D (document vector), and each k -dimensional row Y_i of U (transformed document vector), the aforementioned equation can be rewritten as follows:

	≈		V ^T .	(6.31)
^Xi		^Yi

This is exactly in the same form as any standard dimensionality reduction method, where the columns of V provide the basis space and row-vector Y_i represents the reduced coordinates. In other words, the document vector X_i can be rewritten as an approximate (non-negative) linear combination of the k basis vectors. The value of k is typically small compared to the full dimensionality because the column vectors of V discover the latent structure in the data. Furthermore, the non-negativity of the matrices U and V ensures that the documents are expressed as a non-negative combination of the key concepts (or, clustered regions) in the term-based feature space.

An example of NMF for a toy 6 × 6 document-term matrix D is illustrated in Fig. 6.22. The rows correspond to 6 documents {X₁ . . . X₆} and the 6 words correspond to columns. The matrix entries correspond to word frequencies in the documents. The documents {X₁, X₂, X₃} are related to cats, the documents {X₅, X₆} are related to cars, and the document X₄ is related to both. Thus, there are two natural clusters in the data, and the matrix is correspondingly factorized into two matrices U and V T with rank k = 2. An approximately optimal factorization, with each entry rounded to the nearest integer, is illustrated in Fig. 6.22. Note that most of the entries in the factorized matrices will not be exactly 0 in a real-world example, but many of them might be close to 0, and almost all will be non-integer values. It is evident that the columns and rows, respectively, of U and V map to either the car or the cat cluster in the data. The 6 × 2 matrix U provides information about the relationships of 6 documents to 2 clusters, whereas the 6 × 2 matrix

provides information about the corresponding relationships of 6 words to 2 clusters. Each document can be assigned to the cluster for which it has the largest coordinate in U .

The rank-k matrix factorization U V T can be decomposed into k components by express-ing the matrix product in terms of the k columns U_i and V_i, respectively, of U and V :

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 113 114 115 116 117 118 119 120 ... 423