Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə357/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   353   354   355   356   357   358   359   360   ...   423
1-Data Mining tarjima

U1


U2


U3


U4


U5


U6

GLADIATOR

GODFATHER

BEN HUR

GOODFELLAS

SCARFACE

SPARTACUS

1







5




2




5







4




5

3




1













3







4










3

5




5




4












CHAPTER 18. MINING WEB DATA






GLADIATOR

GODFATHER

BEN HUR

GOODFELLAS

SCARFACE

SPARTACUS

U1

1







1




1

U2




1







1




U3

1

1




1







U4







1







1

U5










1

1




U6

1




1













(a) Ratings-based utility (b) Positive-preference utility


Figure 18.4: Examples of utility matrices.


the n × d utility matrix to exceed 105. The matrix is also extremely sparse. For example, in a movie data set, a typical user may have specified no more than 10 ratings, out of a universe of more than 105 movies.

At a basic level, collaborative filtering can be viewed as a missing-value estimation or matrix completion problem, in which an incomplete n × d utility matrix is specified, and it is desired to estimate the missing values. As discussed in the bibliographic notes, many methods exist in the traditional statistics literature on missing-value estimation. However, collaborative filtering problems present a particularly challenging special case in terms of data size and sparsity.


18.5.1 Content-Based Recommendations

In content-based recommendations, the user is associated with a set of documents that describe his or her interests. Multiple documents may be associated with a user correspond-ing to his or her specified demographic profile, specified interests at registration time, the product description of the items bought, and so on. These documents can then be aggregated into a single textual content-based profile of the user in a vector space representation.


The items are also associated with textual descriptions. When the textual descriptions of the items match the user profile, this can be viewed as an indicator of similarity. When no utility matrix is available, the content-based recommendation method uses a simple k-nearest neighbor approach. The top-k items are found that are closest to the user textual profile. The cosine similarity with tf-idf can be used, as discussed in Chap. 13.


On the other hand, when a utility matrix is available, the problem of finding the most relevant items for a particular user can be viewed as a traditional classification problem. For each user, we have a set of training documents representing the descriptions of the items for which that user has specified utilities. The labels represent the utility values. The descriptions of the remaining items for that user can be viewed as the test documents for classification. When the utility matrix contains numeric ratings, the class variables are



18.5. RECOMMENDER SYSTEMS

607

numeric. The regression methods discussed in Sect. 11.5 of Chap. 11 may be used in this case. Logistic and ordered probit regression are particularly popular. In cases where only positive preferences (rather than ratings) are available in the utility matrix, all the specified utility entries correspond to positive examples for the item. The classification is then per-formed only on the remaining test documents. One challenge is that only a small number of positive training examples are specified, and the remaining examples are unlabeled. In such cases, specialized classification methods using only positive and unlabeled methods may be used. Refer to the bibliographic notes of Chap. 11. Content-based methods have the advan-tage that they do not even require a utility matrix and leverage domain-specific content information. On the other hand, content information biases the recommendation towards items described by similar keywords to what the user has seen in the past. Collaborative filtering methods work directly with the utility matrix, and can therefore avoid such biases.


18.5.2 Neighborhood-Based Methods for Collaborative Filtering


The basic idea in neighborhood-based methods is to use either user–user similarity, or item– item similarity to make recommendations from a ratings matrix.


18.5.2.1 User-Based Similarity with Ratings


In this case, the top-k similar users to each user are determined with the use of a similarity function. Thus, for the target user i, its similarity to all the other users is computed. Therefore, a similarity function needs to be defined between users. In the case of a ratings-based matrix, the similarity computation is tricky because different users may have different scales of ratings. One user may be biased towards liking most items, and another user may be biased toward not liking most of the items. Furthermore, different users may have rated different items. One measure that captures the similarity between the rating vectors of two users is the Pearson correlation coefficient. Let X = (x1 . . . xs) and Y = (y1 . . . ys)



be the common (specified) ratings between a pair of users, with means xˆ =

s




i=1 xi/s




and yˆ =

s







i=1 yi/s, respectively. Alternatively, the mean rating of a user is computed by




averaging over all her specified ratings rather than using only co-rated items by the pair of users at hand. This alternative way of computing the mean is more common, and it can significantly affect the pairwise Pearson computation. Then, the Pearson correlation coefficient between the two users is defined as follows:

























Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   353   354   355   356   357   358   359   360   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin