Usage-centric applications: The user activity on the Web is mined to make inferences. The different ways in which user activity can be mined are as follows:
Recommender systems: In these cases, preference information in the form of either ratings for product items or product buying behavior is used to make recommen-dations to other like-minded users.
Web log analysis: Web logs are a useful resource for Web site owners to determine relevant patterns of user browsing. These patterns can be leveraged for making inferences such as finding anomalous patterns, user interests, and optimal Web site design.
Many of the aforementioned applications overlap with other chapters in the book. For example, content-centric data mining applications have already been covered in previous chapters of this book, especially in Chap. 13 on mining text data. Some of these methods
18.2. WEB CRAWLING AND RESOURCE DISCOVERY
|
591
|
do need to be modified to account for the additional linkage data. Many linkage mining applications are discussed in Chap. 19 on social network analysis. Therefore, this chapter will focus on the applications that are not primarily covered by other chapters. Among the content-centric applications, Web crawling, search, and ranking will be discussed. Among the usage-centric applications, recommender systems and Web log mining applications will be discussed.
This chapter is organized as follows. Sect. 18.2 discusses Web crawlers and resource dis-covery. Search engine indexing and query-processing methods are discussed in Sect. 18.3. Ranking algorithms are presented in Sect. 18.4. Recommender systems are discussed in Sect. 18.5. Methods for mining Web logs are discussed in Sect. 18.6. The summary is pre-sented in Sect. 18.7.
18.2 Web Crawling and Resource Discovery
Web crawlers are also referred to as spiders or robots. The primary motivation for Web crawling is that the resources on the Web are dispensed widely across globally distributed sites. While the Web browser provides a graphical user interface to access these pages in an interactive way, the full power of the available resources cannot be leveraged with the use of only a browser. In many applications, such as search and knowledge discovery, it is necessary to download all the relevant pages at a central location, to allow machine learning algorithms to use these resources efficiently.
Web crawlers have numerous applications. The most important and well-known appli-cation is search, in which the downloaded Web pages are indexed, to provide responses to user keyword queries. All the well-known search engines, such as Google and Bing, employ crawlers to periodically refresh the downloaded Web resources at their servers. Such crawlers are also referred to as universal crawlers because they are intended to crawl all pages on the Web irrespective of their subject matter or location. Web crawlers are also used for business intelligence, in which the Web sites related to a particular subject are crawled or the sites of a competitor are monitored and incrementally crawled as they change. Such crawlers are also referred to as preferential crawlers because they discriminate between the relevance of different pages for the application at hand.
18.2.1 A Basic Crawler Algorithm
While the design of a crawler is quite complex, with a distributed architecture and many processes or threads, the following describes a simple sequential and universal crawler that captures the essence of how crawlers are constructed.
The basic crawler algorithm, described in a very general way, uses a seed set of Universal Resource Locators (URLs) S, and a selection algorithm A as the input. The algorithm A decides which document to crawl next from a current frontier list of URLs. The frontier list represents URLs extracted from the Web pages. These are the candidates for pages that can eventually be fetched by the crawler. The selection algorithm A is important because it regulates the basic strategy used by the crawler to discover the resources. For example, if new URLs are appended to the end of the frontier list, and the algorithm A selects documents from the beginning of the list, then this corresponds to a breadth-first algorithm.
The basic crawler algorithm proceeds as follows. First, the seed set of URLs is added to the frontier list. In each iteration, the selection algorithm A picks one of the URLs from the frontier list. This URL is deleted from the frontier list and then fetched using the
592 CHAPTER 18. MINING WEB DATA
Algorithm BasicCrawler(Seed URLs: S, Selection Algorithm: A)
begin
Dostları ilə paylaş: |