Data Mining: The Textbook

INTEREST GROUP A CO CLUSTER

Yüklə 17,13 Mb.

səhifə	359/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 355 356 357 358 359 360 361 362 ... 423

1-Data Mining tarjima

INTEREST GROUP A CO CLUSTER

U₁

U₄

^U6

U₂

U₃

U₅
GLADIATOR HURSPARTACUSGODFATHERGOODFELLASSCARFACE

BEN

1 1 1

11

1 1

1 1 1

1 1

INTEREST

GROUP B
CO CLUSTER

(a) Co-cluster (b) User-item graph

Figure 18.6: Co-clustering of user-item graph

user-based methods and item-based methods can be used to make predictions for the missing entries.

The co-clustering approach also has a nice interpretation in terms of the user-item graph. Let G = (N_u ∪ N_i, A) denote the preference graph, where N_u is the set of nodes representing users, and N_i is the set of nodes representing items. An undirected edge exists in A for each nonzero entry of the utility matrix. Then the co-cluster is a clustering of this graph structure. The corresponding 2-way graph partition is illustrated in Fig. 18.6b. Because of this interpretation in terms of user-item graphs, the approach is able to exploit item-item and user-user similarity simultaneously. Co-clustering methods are also closely related to latent factor models such as nonnegative matrix factorization that simultaneously cluster rows and columns with the use of latent factors.

18.5.5 Latent Factor Models

The clustering methods discussed in the previous section use the aggregate properties of the data to make robust predictions. This can be achieved in a more robust way with latent factor models. This approach can be used either for ratings matrices or for positive preference utility matrices. Latent factor models have increasingly become more popular in recent years. The key idea behind latent factor models is that many dimensionality reduction and matrix factorization methods summarize the correlations across rows and columns in the form of lower dimensional vectors, or latent factors. Furthermore, collaborative filtering is essentially a missing data imputation problem, in which these correlations are used to make predictions. Therefore, these latent factors become hidden variables that encode the correlations in the data matrix in a concise way and can be used to make predictions. A robust estimation of the k-dimensional dominant latent factors is often possible even from incompletely specified data, when the value of k is much less than d. This is because the more concisely defined latent factors can be estimated accurately with the sparsely specified data matrix, as long as the number of specified entries is large enough.

The n users are represented in terms of n corresponding k-dimensional factors, denoted by the vectors U₁ . . . U_n. The d items are represented by d corresponding k-dimensional

612 CHAPTER 18. MINING WEB DATA

factors, denoted by the vectors I₁ . . . I_d. The value of k represents the reduced dimensionality of the latent representation. Then, the rating r_ij for user i and item j is estimated by the vector dot product of the corresponding latent factors:

						(18.14)

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 355 356 357 358 359 360 361 362 ... 423