Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə353/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   349   350   351   352   353   354   355   356   ...   423
1-Data Mining tarjima

Content-spamming: In this case, the Web host owner fills up repeated keywords in the hosted Web page, even though these keywords are not actually visible to the user. This is achieved by controlling the color of the text and the background of the page. Thus, the idea is to maximize the content relevance of the Web page to the search engine, without a corresponding increase in the visible level of relevance.




  1. Cloaking: This is a more sophisticated approach, in which the Web site serves different content to crawlers than it does to users. Thus, the Web site first determines whether the incoming request is from a crawler or from a user. If the incoming request is from a user, then the actual content (e.g., advertising content) is served. If the request is from a crawler, then the content that is most relevant to specific keywords is served. As a result, the search engine will use different content to respond to user search requests from what a Web user will actually see.

It is obvious that such spamming will significantly reduce the quality of the search results. Search engines also have significant incentives to improve the quality of their results to sup-port their paid advertising model, in which the explicitly marked sponsored links appearing on the side bar of the search results are truly paid advertisements. Search engines do not want advertisements (disguised by spamming) to be served as bona fide results to the query, especially when such results reduce the quality of the user experience. This has led to an adversarial relationship between search engines and spammers, in which the former use reputation-based algorithms to reduce the impact of spam. At the other end of Web site owners, a search engine optimization (SEO) industry attempts to optimize search results by using their knowledge of the algorithms used by search engines, either through the general principles used by engines or through reverse engineering of search results.


For a given search, it is almost always the case that a small subset of the results is more informative or provides more accurate information. How can such pages be deter-mined? Fortunately, the Web provides several natural voting mechanisms to determine the reputation of pages.





  1. Page citation mechanisms: This is the most common mechanism used to determine the quality of Web pages. When a page is of high quality, many other Web pages point to it. A citation can be logically viewed as a vote for the Web page. While the

18.4. RANKING ALGORITHMS

597

number of in- linking pages can be used as a rough indicator of the quality, it does not provide a complete view because it does not account for the quality of the pages pointing to it. To provide a more holistic citation-based vote, an algorithm referred to as PageRank is used.





  1. User feedback or behavioral analysis mechanisms: When a user chooses a Web page from among the responses to a search result, this is clear evidence of the relevance of that page to the user. Therefore, other similar pages, or pages accessed by other similar users can be returned. Such an approach is generally hard to implement in search because of limited user-identification mechanisms. Some search engines, such as Excite, have used various forms of relevance feedback. While these mechanisms are used less often by search engines, they are nevertheless quite important for commercial recommender systems. In commercial recommender systems, the recommendations are made by the Web site itself during user browsing, rather than by search engines. This is because commercial sites have stronger user-identification mechanisms (e.g., user registration) to enable more powerful algorithms for inferring user interests.

Typically, the reputation score is determined using PageRank-like algorithms. Therefore, if IRScore and RepScore are the content- and reputation-based scores of the Web page, respectively, then the final ranking score is computed as a function of these scores:






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   349   350   351   352   353   354   355   356   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin