F rontierList = S;
repeat
Use algorithm A to select URL X ∈ F rontierSet;
F rontierList = F rontierList − {X};
Fetch URL X and add to repository;
Add all relevant URLs in fetched document X to end of F rontierList;
until termination criterion;
end
Figure 18.1: The basic crawler algorithm
HTTP protocol. This is the same mechanism used by browsers to fetch Web pages. The main difference is that the fetching is now done by an automated program using automated selection decisions, rather than by the manual specification of a link by a user with a Web browser. The fetched page is stored in a local repository, and the URLs inside it are extracted. These URLs are then added to the frontier list, provided that they have not already been visited. Therefore, a separate data structure, in the form of a hash table, needs to be maintained to store all visited URLs. In practical implementations of crawlers, not all unvisited URLs are added to the frontier list due to Web spam, spider traps, topical preference, or simply a practical limit on the size of the frontier list. These issues will be discussed later. After the relevant URLs have been added to the frontier list, the next iteration repeats the process with the next URL on the list. The process terminates when the frontier list is empty. If the frontier list is empty, it does not necessarily imply that the entire Web has been crawled. This is because the Web is not strongly connected, and many pages are unreachable from most randomly chosen seed sets. Because most practical crawlers such as search engines are incremental crawlers that refresh pages over previous crawls, it is usually easy to identify unvisited seeds from previous crawls and add them to the frontier list, if needed. With large seed sets, such as a previously crawled repository of the Web, it is possible to robustly crawl most pages. The basic crawler algorithm is described in Fig. 18.1.
Thus, the crawler is a graph search algorithm that discovers the outgoing links from nodes by parsing Web pages and extracting the URLs. The choice of the selection algo-rithm A will typically result in a bias in the crawling algorithm, especially in cases where it is impossible to crawl all the relevant pages due to resource limitations. For example, a breadth-first crawler is more likely to crawl a page with many links pointing to it. Inter-estingly, such biases are sometimes desirable in crawlers because it is impossible for any crawler to index the entire Web. Because the indegree of a Web page is often closely related to its PageRank, a measure of a Web page’s quality, this bias is not necessarily undesirable. Crawlers use a variety of other selection strategies defined by the algorithm A.
Because most universal crawlers are incremental crawlers that are intended to refresh previous crawls, it is desirable to crawl frequently changing pages. The change fre-quency can be estimated from repeated previous crawls of the same page. Some resources such as news portals are updated frequently. Therefore, frequently updated pages may be selected by the algorithm A.
18.2. WEB CRAWLING AND RESOURCE DISCOVERY
|
593
|
The selection algorithm A may specifically choose Web pages with high PageRank from frontier list. The computation of PageRank is discussed in Sect. 18.4.1.
A practice, a combination of factors are used by the commercial crawlers employed by search engines.
18.2.2 Preferential Crawlers
In the preferential crawler, only pages satisfying a user -defined criterion need to be crawled. This criterion may be specified in the form of keyword presence in the page, a topical crite-rion defined by a machine learning algorithm, a geographical criterion about page location, or a combination of the different criteria. In general, an arbitrary predicate may be specified by the user, which forms the basis of the crawling. In these cases, the major change is to the approach used for updating the frontier list during crawling.
The Web page needs to meet the user-specified criterion in order for its extracted URLs to be added to the frontier list.
In some cases, the anchor text may be examined to determine the relevance of the Web page to the user-specified query.
In context-focused crawlers, the crawler is trained to learn the likelihood that relevant pages are within a short distance of the page, even if the Web page is itself not directly relevant to the user-specified criterion. For example, a Web page on “data mining” is more likely to point to a Web page on “information retrieval,” even though the data mining page may not be relevant to the query on “information retrieval.” URLs from such pages may be added to the frontier list. Therefore, heuristics need to be designed to learn such context-specific relevance.
Changes may also be made to the algorithm A. For example, URLs with more relevant anchor text, or with relevant tokens in the Web address, may be selected first by algorithm A. A URL such as http://www.golf.com, with the word “golf” in the Web address may be more relevant to the topic of “golf,” than a URL without the word in it. The bibliographic notes contain pointers to a number of heuristics that are commonly used for preferential resource discovery.
18.2.3 Multiple Threads
When a crawler issues a request for a URL and waits for it, the system is idle, with no work being done at the crawler end. This would seem to be a waste of resources. A natural way to speed up the crawling is by leveraging concurrency. The idea is to use multiple threads of the crawler that update a shared data structure for visited URLs and the page repository. In such cases, it is important to implement concurrency control mechanisms for locking or unlocking the relevant data structures during updates. The concurrent design can significantly speed up a crawler with more efficient use of resources. In practical implementations of large search engines, the crawler is distributed geographically with each “sub-crawler” collecting pages in its geographical proximity.
18.2.4 Combatting Spider Traps
The main reason that the crawling algorithm always visits distinct Web pages is that it maintains a list of previously visited URLs for comparison purposes. However, some
594 CHAPTER 18. MINING WEB DATA
shopping sites create dynamic URLs in which the last page visited is appended at the end of the user sequence to enable the server to log the user action sequences within the URL for future analysis. For example, when a user clicks on the link for page2 from http://www.examplesite.com/page1, the new dynamically created URL will be http://www.examplesite.com/page1/page2. Pages that are visited further will continue to be appended to the end of the URL, even if these pages were visited before. A natural way to combat this is to limit the maximum size of the URL. Furthermore, a maximum limit may also be placed on the number of URLs crawled from a particular site.
18.2.5 Shingling for Near Duplicate Detection
One of the major problems with the Web pages collected by a crawler is that many duplicates of the same page may be crawled. This is because the same Web page may be mirrored at multiple sites. Therefore, it is crucial to have the ability to detect near duplicates. An approach known as shingling is commonly used for this purpose.
A k-shingle from a document is simply a string of k consecutively occurring words in the document. A shingle can also be viewed as a k-gram. For example, consider the document comprising the following sentence:
Dostları ilə paylaş: |