Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə355/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   351   352   353   354   355   356   357   358   ...   423
1-Data Mining tarjima

DEAD END






















1/4




1/4






















2

























1
















DEAD END COMPONENT







1





























































1




1/2







1/3

1/4

1/3

1/2







4




























1/2






















1/4







1/2

1

1




















































1/3

4













1/2










3




2




3

6







1/2







1

5



















1










DASHED TRANSITIONS ADDED

























TO REMOVE DEAD END

























(a) Dead-end node




(b) Dead-end component







Figure 18.2: Transition probabilities for PageRank computation with different types of dead ends


steady state probability becomes unique and independent of the starting state. The value of α may also be viewed as a smoothing or damping probability. Large values of α typically result in the steady-state probability of different pages to become more even. For example, if the value of α is chosen to be 1, then all pages will have the same steady-state probability of visits.

How are the steady-state probabilities determined? Let G = (N, A ) be the directed Web graph, in which nodes correspond to pages, and edges correspond to hyperlinks. The total number of nodes is denoted by n. It is assumed that A also includes the added edges from dead-end nodes to all other nodes. The set of nodes incident on i is denoted by In(i), and the set of end points of the outgoing links of node i is denoted by Out (i). The steady-state probability at a node i is denoted by π(i). In general, the transitions of a Web surfer can be visualized as a Markov chain, in which an n × n transition matrix P is defined for a Web graph with n nodes. The PageRank of a node i is equal to the steady-state probability π(i) for node i, in the Markov chain model. The probability2 pij of transitioning from node i to node j, is defined as 1/|Out(i)|. Examples of transition probabilities are illustrated in Fig. 18.2. These transition probabilities do not, however, account for teleportation which will be addressed3 separately below.


Let us examine the transitions into a given node i. The steady-state probability π(i) of node i is the sum of the probability of a teleportation into it and the probability that one of the in-linking nodes directly transitions into it. The probability of a teleportation into the node is exactly α/n because a teleportation occurs in a step with probability α, and all nodes are equally likely to be the beneficiary of the teleportation. The probability of a


transition into node i is given by (1 − α) · jIn(i) π(j) · pji , as the sum of the probabilities of transitions from different in-linking nodes. Therefore, at steady-state, the probability of




2In some applications such as bibliographic networks, the edge (i, j) may have a weight denoted by wij .

The transition probability pij is defined in such cases by

wij




.




j∈Out(i)

wij




3An alternative way to achieve this goal is to modify G by multiplying existing edge transition proba-bilities by the factor (1 − α) and then adding α/n to the transition probability between each pair of nodes in G. As a result G will become a directed clique with bidirectional edges between each pair of nodes. Such strongly connected Markov chains have unique steady-state probabilities. The resulting graph can then be treated as a Markov chain without having to separately account for the teleportation component. This model is equivalent to that discussed in the chapter.

600 CHAPTER 18. MINING WEB DATA


a transition into node i is defined by the sum of the probabilities of the teleportation and transition events are as follows:




π(i) = α/n + (1 − α) · π(j) · pji. (18.3)


j∈In(i)

For example, the equation for node 2 in Fig. 18.2a can be written as follows:




π(2) = α/4 + (1 − α) · (π(1) + π(2)/4 + π(3)/3 + π(4)/2).

There will be one such equation for each node, and therefore it is convenient to write the entire system of equations in matrix form. Let π = (π(1) . . . π(n))T be the n-dimensional column vector representing the steady-state probabilities of all the nodes, and let e be an n-dimensional column vector of all 1 values. The system of equations can be rewritten in matrix form as follows:








= αe/n + (1 − α)P T







(18.4)




π

π.




The first term on the right-hand side corresponds to a teleportation, and the second term corresponds to a direct transition from an incoming node. In addition, because the vector π represents a probability, the sum of its components n π(i) must be equal to 1:




i=1


n



π(i) = 1.

(18.5)



i=1

Note that this is a linear system of equations that can be easily solved using an iterative method. The algorithm starts off by initializing π(0) = e/n, and it derives π(t+1) from π(t) by repeating the following iterative step:








(t+1) ⇐ αe/n + (1 − α)P T




(t).

(18.6)




π

π




After each iteration, the entries of π(t+1) are normalized by scaling them to sum to 1. These steps are repeated until the difference between π(t+1) and π(t) is a vector with magnitude less than a user-defined threshold. This approach is also referred to as the power-iteration method . It is important to understand that PageRank computation is expensive, and it cannot be computed on the fly for a user query during Web search. Rather, the PageRank values for all the known Web pages are precomputed and stored away. The stored PageRank value for a page is accessed only when the page is included in the search results for a particular query for use in the final ranking, as indicated by Eq. 18.2.


The PageRank values can be shown to be the n components of the largest left eigenvec-tor4 of the stochastic transition matrix P (see Exercise 5), for which the eigenvalue is 1. The largest eigenvalue of a stochastic transition matrix is always 1. The left eigenvectors of P are the same as the right eigenvectors of P T . Interestingly, the largest right eigenvectors of the stochastic transition matrix P of an undirected graph can be used to construct spectral embeddings (cf. Sect. 19.3.4 of Chap. 19), which are used for network clustering.





  • The left eigenvector X of P is a row vector satisfying XP = λX. The right eigenvector Y is a column vector satisfying P Y = λY . For asymmetric matrices, the left and right eigenvectors are not the same. How-ever, the eigenvalues are always the same. The unqualified term “eigenvector” refers to the right eigenvector by default.




18.4. RANKING ALGORITHMS

601

18.4.1.1 Topic-Sensitive PageRank




Topic-sensitive PageRank is designed for cases in which it is desired to provide greater importance to some topics than others in the ranking process. While personalization is less common in large-scale commercial search engines, it is more common in smaller scale site-specific search applications. Typically, users may be more interested in certain combinations of topics than others. The knowledge of such interests may be available to a personalized search engine because of user registration. For example, a particular user may be more interested in the topic of automobiles. Therefore, it is desirable to rank pages related to automobiles higher when responding to queries by this user. This can also be viewed as the personalization of ranking values. How can this be achieved?

The first step is to fix a list of base topics, and determine a high-quality sample of pages from each of these topics. This can be achieved with the use of a resource such as the Open Directory Project (ODP),5 which can provide a base list of topics and sample Web pages for each topic. The PageRank equations are now modified, so that the teleportation is only performed on this sample set of Web documents, rather than on the entire space of Web documents. Let ep be an n-dimensional personalization (column) vector with one entry for each page. An entry in ep takes on the value of 1, if that page is included in the sample set, and 0 otherwise. Let the number of nonzero entries in ep be denoted by np. Then, the PageRank Eq. 18.4 can be modified as follows:








= αep

/np + (1 − α)P T







(18.7)




π

π.




The same power-iteration method can be used to solve the personalized PageRank problem. The selective teleportations bias the random walk, so that pages in the structural locality of the sampled pages will be ranked higher. As long as the sample of pages is a good representative of different (structural) localities of the Web graph, in which pages of specific topics exist, such an approach will work well. Therefore, for each of the different topics, a separate PageRank vector can be precomputed and stored for use during query time.


In some cases, the user is interested in specific combinations of topics such as sports and automobiles. Clearly, the number of possible combinations of interests can be very large, and it is not reasonably possible or necessary to prestore every personalized PageRank vector. In such cases, only the PageRank vectors for the base topics are computed. The final result for a user is defined as a weighted linear combination of the topic-specific PageRank vectors, where the weights are defined by the user-specified interest in the different topics.


18.4.1.2 SimRank


The notion of SimRank was defined to compute the structural similarity between nodes. SimRank determines symmetric similarities between nodes. In other words, the similarity between nodes i and j, is the same as that between j and i. Before discussing SimRank, we define a related but slightly different asymmetric ranking problem:




Given a target node iq and a subset of nodes S ⊆ N from graph G = (N, A), rank the nodes in S in their order of similarity to iq.

Such a query is very useful in recommender systems in which users and items are arranged in the form of a bipartite graph of preferences, in which nodes corresponds to users and items, and edges correspond to preferences. The node iq may correspond to an item node,








  • http://www.dmoz.org.

602 CHAPTER 18. MINING WEB DATA

and the set S may correspond to user nodes. Alternatively, the node iq may correspond to a user node, and the set S may correspond to item nodes. Recommender systems will be discussed in Sect. 18.5. Recommender systems are closely related to search, in that they also perform ranking of target objects, but while taking user preferences into account.


This problem can be viewed as a limiting case of topic-sensitive PageRank, in which the teleportation is performed to the single node iq. Therefore, the personalized PageRank Eq. 18.7 can be directly adapted by using the teleportation vector ep = eq, that is, a vector of all 0s, except for a single 1, corresponding to the node iq. Furthermore, the value of np in this case is set to 1:








= αeq

+ (1 − α)PT







(18.8)




π

π.




The solution to the aforementioned equation will provide high ranking values to nodes in the structural locality of iq. This definition of similarity is asymmetric because the simi-larity value assigned to node j starting from query node i is different from the similarity value assigned to node i starting from query node j. Such an asymmetric similarity mea-sure is suitable for query-centered applications such as search engines and recommender systems, but not necessarily for arbitrary network-based data mining applications. In some applications, symmetric pairwise similarity between nodes is required. While it is possible to average the two topic-sensitive PageRank values in opposite directions to create a symmetric measure, the SimRank method provides an elegant and intuitive solution.


The SimRank approach is as follows. Let In(i) represent the in-linking nodes of i. The SimRank equation is naturally defined in a recursive way, as follows:





SimRank(i, j) =

C

SimRank(p, q).

(18.9)




|In(i)| · |In(j)| pIn(i) qIn(j)




Here C is a constant in (0, 1) that can be viewed as a kind of decay rate of the recursion. As the boundary condition, the value of SimRank(i, j) is set to 1 when i = j. When either i or j do not have in-linking nodes, the value of SimRank(i, j) is set to 0. To compute SimRank, an iterative approach is used. The value of SimRank(i, j) is initialized to 1 if i = j, and 0 otherwise. The algorithm subsequently updates the SimRank values between all node pairs iteratively using Eq. 18.9 until convergence is reached.


The notion of SimRank has an interesting intuitive interpretation in terms of random walks. Consider two random surfers walking in lockstep backward from node i and node j till they meet. Then the number of steps taken by each of them is a random variable L(i, j). Then, SimRank(i, j) can be shown to be equal to the expected value of CL(i,j). The decay constant C is used to map random walks of length l to a similarity value of Cl. Note that because C < 1, smaller distances will lead to higher similarity and vice versa.


Random walk-based methods are generally more robust than the shortest path distance to measure similarity between nodes. This is because random walks measures implicitly account for the number of paths between nodes, whereas shortest paths do not. A detailed discussion of this issue can be found in Sect. 3.5.1.2 of Chap. 3.


18.4.2 HITS


The Hypertext Induced Topic Search (HITS) algorithm is a query-dependent algorithm for ranking pages. The intuition behind the approach lies in an understanding of the typical structure of the Web that is organized into hubs and authorities.


An authority is a page with many in-links. Typically, it contains authoritative content on a particular subject, and, therefore, many Web users may trust that page as a resource of



18.4. RANKING ALGORITHMS

603







A























Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   351   352   353   354   355   356   357   358   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin