A
|
A
|
|
A
|
|
H HUB
|
H
|
|
|
A
|
|
A
|
A
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
H
|
|
|
H
|
|
|
A
|
|
|
|
|
H
|
H
|
|
|
|
|
|
|
H
|
|
A
|
AUTHORITY
|
A
|
|
|
|
|
|
|
|
|
|
|
H
|
|
H
|
HUBS
|
|
|
|
|
|
|
|
|
|
AUTHORITIES
|
|
(a) Hub and authority
|
(b) Network organization between
|
|
|
examples
|
hubs and authorities
|
|
Figure 18.3: Illustrating hubs and authorities
knowledge on that subject. This will result in many pages linking to the authority page. A hub is a page with many out -links to authorities. These represent a compilation of the links on a particular topic. Thus, a hub page provides guidance to Web users about where they can find the resources on a particular topic. Examples of the typical node-centric topology of hubs and authorities in the Web graph are illustrated in Fig. 18.3a.
The main insight used by the HITS algorithm is that good hubs point to many good authorities. Conversely, good authority pages are pointed to by many hubs. An example of the typical organization of hubs and authorities is illustrated in Fig. 18.3b. This mutually reinforcing relationship is leveraged by the HITS algorithm. For any query issued by the user, the HITS algorithm starts with the list of relevant pages and expands them with a
hub ranking and an authority ranking.
The HITS algorithm starts by collecting the top-r most relevant results to the search query at hand. A typical value of r is 200. This defines the root set R . Typically, a query to a commercial search engine or content-based evaluation is used to determine the root set. For each node in R, the algorithm determines all nodes immediately connected (either in-linking or out-linking) to R. This provides a larger base set S. Because the base set S can be rather large, the maximum number of in-linking nodes to any node in R that are added to S is restricted to k . A typical value of k used is around 50. Note that this still results in a rather large base set because each of the possibly 200 root nodes might bring 50 in-linking nodes, along with out-linking nodes.
Let G = (S, A) be the subgraph of the Web graph defined on the (expanded) base set S, where A is the set of edges between nodes in the root set S. The entire analysis of the HITS algorithm is restricted to this subgraph. Each page (node) i ∈ S is assigned both a hub score h(i) and authority score a(i). It is assumed that the hub and authority scores are normalized, so that the sum of the squares of the hub scores and the sum of the squares of
604 CHAPTER 18. MINING WEB DATA
the authority scores are each equal to 1. Higher values of the score indicate better quality.
The hub and authority scores are related to one another in the following way:
h(i) =
|
a(j)
|
∀i ∈ S
|
(18.10)
|
|
j:(i,j)∈A
|
|
|
a(i) =
|
h(j)
|
∀i ∈ S.
|
(18.11)
|
j:(j,i)∈A
The basic idea is to reward hubs for pointing to good authorities and reward authorities for being pointed to by good hubs. It is easy to see that the aforementioned system of equations reinforces this mutually enhancing relationship. This is a linear system of equa-tions that can be solved using an iterative method. The algorithm starts by initializing h0(i) = a0(i) = 1/ |S|. Let ht(i) and at (i) denote the hub and authority scores of the ith node, respectively, at the end of the tth iteration. For each t ≥ 0, the algorithm executes the following iterative steps in the (t + 1)th iteration:
for each i ∈ S set at+1(i) ⇐
|
j:(j,i)∈A
|
for each i ∈ S set ht+1(i) ⇐
|
j:(i,j)∈A
|
Normalize L2-norm of each of hub and
ht( j);
at+1( j);
authority vectors to 1;
For hub-vector h = [ h(1) . . . h( n)] T and authority-vector a = [ a(1) . . . a ( n)] T , the updates can be expressed as a = AT h and h = Aa, respectively, when the edge set A is treated as an |S| × |S| adjacen cy matrix. The iteration is repeated to convergence. It can be shown that the hub vector h and the authority vector a converge in directions proportional to the dominant eigenvectors of AAT and AT A (see Exercise 6), respectively. This is because the relevant pair of updates can be shown to be equivalent to power-iteration updates of AAT and AT A, respectively.
18.5 Recommender Systems
Ever since the popularization of web-based transactions, it has become increasingly easy to collect data about user buying behaviors. This data includes information about user profiles, interests, browsing behavior, buying behavior, and ratings about various items. It is natural to leverage such data to make recommendations to customers about possible buying interests.
In the recommendation problem, the user–item pairs have utility values associated with them. Thus, for n users and d items, this results in an n × d matrix D of utility values. This is also referred to as the utility-matrix. The utility value for a user-item pair could correspond to either the buying behavior or the ratings of the user for the item. Typically, a small subset of the utility values are specified in the form of either customer buying behavior or ratings. It is desirable to use these specified values to make recommendations. The nature of the utility matrix has a significant influence on the choice of recommendation algorithm:
Positive preferences only: In this case, the specified utility matrix only contains posi-tive preferences. For example, a specification of a “like” option on a social networking site, the browsing of an item at an online site, or the buying of a specified quantity of an item, corresponds to a positive preference. Thus, the utility matrix is sparse, with a prespecified set of positive preferences. For example, the utility matrix may contain the raw quantities of the item bought by each user, a normalized mathematical func-tion of the quantities, or a weighted function of buying and browsing behavior. These
18.5. RECOMMENDER SYSTEMS
|
605
|
functions are typically specified heuristically by the analyst in an application-specific way. Entries that correspond to items not bought or browsed by the user may remain unspecified.
Positive and negative preferences (ratings): In this case, the user specifies the ratings that represent their like or dislike for the item. The incorporation of user dislike in the analysis is significant because it makes the problem more complex and often requires some changes to the underlying algorithms.
An example of a ratings-based utility matrix is illustrated in Fig. 18.4a, and an example of a positive-preference utility matrix is illustrated in Fig. 18.4b. In this case, there are six users, labeled U1 . . . U6, and six movies with specified titles. Higher ratings indicate more positive feedback in Fig. 18.4a. The missing entries correspond to unspecified preferences in both cases. This difference significantly changes the algorithms used in the two cases. In particular, the two matrices in Fig. 18.4 have the same specified entries, but they provide very different insights. For example, the users U1 and U3 are very different in Fig. 18.4a because they have very different ratings for their commonly specified entries. On the other hand, these users would be considered very similar in Fig. 18.4b because these users have expressed a positive preference for the same items. The ratings-based utility provides a way for users to express negative preferences for items. For example, user U1 does not like the movie Gladiator in Fig. 18.4a. There is no mechanism to specify this in the positive-preference utility matrix of Fig. 18.4b beyond a relatively ambiguous missing entry. In other words, the matrix in Fig. 18.4b is less expressive. While Fig. 18.4b provides an example of a binary matrix, it is possible for the nonzero entries to be arbitrary positive values. For example, they could correspond to the quantities of items bought by the different users.
This difference has an impact on the types of algorithms that are used in the two cases. Allowing for positive and negative preferences generally makes the problem harder. From a data collection point of view, it is also harder to infer negative preferences when they are inferred from customer behavior rather than ratings. Recommendations can also be enhanced with the use of content in the user and item representations.
Content-based recommendations: In this case, the users and items are both associated with feature-based descriptions. For example, item profiles can be determined by using the text of the item description. A user might also have explicitly specified their interests in a profile. Alternatively, their profile can be inferred from their buying or browsing behavior.
Collaborative filtering: Collaborative filtering, as the name implies, is the leveraging of the user preferences in the form of ratings or buying behavior in a “collaborative” way, for the benefit of all users. Specifically, the utility matrix is used to determine either relevant users for specific items, or relevant items for specific users in the rec-ommendation process. A key intermediate step in this approach is the determination of similar groups of items and users. The patterns in these peer groups provide the collaborative knowledge needed in the recommendation process.
The two models are not exclusive. It is often possible to combine content-based methods with collaborative filtering methods to create a combined preference score. Collaborative filtering methods are generally among the more commonly used models and will therefore be discussed in greater detail in this section.
It is important to understand that the utility matrices used in collaborative filtering algorithms are extremely large and sparse. It is not uncommon for the values of n and d in
606
Dostları ilə paylaş: |