Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	354/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 350 351 352 353 354 355 356 357 ... 423

1-Data Mining tarjima

RankScore = f (IRScore, RepScore).

(18.2)

The exact function f (·, ·) used by commercial search engines is proprietary, but it is always monotonically related to both the IRScore and RepScore . Various other factors, such as the geographic location of the browser, also seem to play a role in the ranking.

It should be pointed out, that citation-based reputation scores are not completely immune to other types of spamming that involve coordinated creation of a large num-ber of links to a Web page. Furthermore, the use of anchor text of pointing Web pages in the content portion of the rank score can sometimes lead to amusingly irrelevant search results. For example, a few years back, a search on the keyword “miserable failure” in the Google search engine, returned as its top result, the oﬃcial biography of a previous president of the United States of America. This is because many Web pages were constructed in a coordinated way to use the anchor text “miserable failure” to point to this biography. This practice of influencing search results by coordinated linkage construction to a particular site is referred to as Googlewashing. Such practices are less often economically motivated, but are more often used for comical or satirical purposes.

Therefore, the ranking algorithms used by search engines are not perfect but have, nevertheless, improved significantly over the years. The algorithms used to compute the reputation-based ranking score will be discussed in the next section.

18.4 Ranking Algorithms

The PageRank algorithm uses the linkage structure of the Web for reputation-based ranking. The PageRank method is independent of the user query, because it only precomputes the reputation portion of the score in Eq. 18.2. The HITS algorithm is query-specific. It uses a number of intuitions about how authoritative sources on various topics are linked to one another in a hyperlinked environment.

598 CHAPTER 18. MINING WEB DATA

18.4.1 PageRank

The PageRank algorithm models the importance of Web pages with the use of the citation (or linkage) structure in the Web. The basic idea is that highly reputable documents are more likely to be cited (or in-linked) by other reputable Web pages.

A random surfer model on the Web graph is used to achieve this goal. Consider a random surfer who visits random pages on the Web by selecting random links on a page. The long-term relative frequency of visits to any particular page is clearly influenced by the number of in-linking pages to it. Furthermore, the long-term frequency of visits to any page will be higher if it is linked to by other frequently visited (or reputable) pages. In other words, the PageRank algorithm models the reputation of a Web page in terms of its long-term frequency of visits by a random surfer. This long-term frequency is also referred to as the steady-state probability. This model is also referred to as the random walk model.

The basic random surfer model does not work well for all possible graph topologies. A critical issue is that some Web pages may have no outgoing links, which may result in the random surfer getting trapped at specific nodes. In fact, a probabilistic transition is not even meaningfully defined at such a node. Such nodes are referred to as dead ends. An example of a dead-end node is illustrated in Fig. 18.2a. Clearly, dead ends are undesirable because the transition process for PageRank computation cannot be defined at that node. To address this issue, two modifications are incorporated in the random surfer model. The first modification is to add links from the dead-end node (Web page) to all nodes (Web pages), including a self-loop to itself. Each such edge has a transition probability of 1/n. This does not fully solve the problem, because the dead ends can also be defined on groups of nodes. In these cases, there are no outgoing links from a group of nodes to the remaining nodes in the graph. This is referred to as a dead-end component, or absorbing component. An example of a dead-end component is illustrated in Fig. 18.2b.

Dead-end components are common in the Web graph because the Web is not strongly connected. In such cases, the transitions at individual nodes can be meaningfully defined, but the steady-state transitions will stay trapped in these dead-end components. All the steady-state probabilities will be concentrated in dead-end components because there can be no transition out of a dead-end component after a transition occurs into it. Therefore, as long as even a minuscule probability of transition into a dead-end component1 exists, all the steady-state probability becomes concentrated in such components. This situation is not desirable from the perspective of PageRank computation in a large Web graph, where dead-end components are not necessarily an indicator of popularity. Furthermore, in such cases, the final probability distribution of nodes in various dead-end components is not unique and it is dependent on the starting state. This is easy to verify by observing that random walks starting in diﬀerent dead-end components will have their respective steady-state distributions concentrated within the corresponding components.

While the addition of edges solves the problem for dead-end nodes, an additional step is required to address the more complex issue of dead-end components. Therefore, aside from the addition of these edges, a teleportation, or restart step is used within the random surfer model. This step is defined as follows. At each transition, the random surfer may either jump to an arbitrary page with probability α, or it may follow one of the links on the page with probability (1 − α). A typical value of α used is 0.1. Because of the use of teleportation, the

¹A formal mathematical treatment characterizes this in terms of the ergodicity of the underlying Markov chains. In ergodic Markov chains, a necessary requirement is that it is possible to reach any state from any other state using a sequence of one or more transitions. This condition is referred to as strong connectivity. An informal description is provided here to facilitate understanding.

18.4. RANKING ALGORITHMS

599

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 350 351 352 353 354 355 356 357 ... 423