Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	352/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 348 349 350 351 352 353 354 355 ... 423

1-Data Mining tarjima

Mary had a little lamb, its fleece was white as snow.

The set of 2-shingles extracted from this sentence is “Mary had”, “had a”, “a little”, “little lamb”, “lamb its ”, “its fleece”, “fleece was”, “was white”, “white as”, and “as snow”. Note that the number of k-shingles extracted from a document is no longer than the length of the document, and 1-shingles are simply the set of words in the document. Let S₁ and S₂ be the k-shingles extracted from two documents D₁ and D₂. Then, the shingle-based similarity between D₁ and D₂ is simply the Jaccard coeﬃcient between S₁ and S₂

J(S₁, S₂) =	\|S₁	∩ S₂	\|	.	(18.1)

	\|S₁	∪ S₂	\|

Typically, the value of k ranges between 5 and 10 depending on the corpus size and applica-tion domain. The advantage of using k -shingles instead of the individual words (1 -shingles) for Jaccard coeﬃcient computation is that shingles are less likely than words to repeat in diﬀerent documents. There are rk distinct shingles for a lexicon of size r. For k ≥ 5, the chances of many shingles recurring in two documents becomes very small. Therefore, if two documents have many k-shingles in common, they are very likely to be near duplicates. To save space, the individual shingles are hashed into 4-byte (32-bit) numbers that are used for comparison purposes. Such a representation also enables better eﬃciency.

18.3 Search Engine Indexing and Query Processing

After the documents have been crawled, they are leveraged for query processing. There are two primary stages to the search index construction:

Oﬄine stage: This is the stage in which the search engine preprocesses the crawled documents to extract the tokens and constructs an index to enable eﬃcient search. A quality-based ranking score is also computed for each page at this stage.

18.3. SEARCH ENGINE INDEXING AND QUERY PROCESSING

595

Online query processing: This preprocessed collection is utilized for online query pro-cessing. The relevant documents are accessed and then ranked using both their rele-vance to the query and their quality.

The preprocessing steps for Web document processing are described in Chap. 13 on mining text data. The relevant tokens are extracted and stemmed. Stop words are removed. These documents are then transformed to the vector space representation for indexing.

After the documents have been transformed to the vector space representation, an inverted index is constructed on the document collection. The construction of inverted indices is described in Sect. 5.3.1.2 of Chap. 5. The inverted list maps each word identifier to a list of document identifiers containing it. The frequency of the word is also stored with the document identifier in the inverted list. In many implementations, the position information of the word in the document is stored as well.

Aside from the inverted index that maps words to documents, an index is needed for accessing the storage location of the inverted word lists relevant to the query terms. These locations are then used to access the inverted lists. Therefore, a vocabulary index is required as well. In practice, many indexing methods such as hashing and tries are commonly used. Typically, a hash function is applied to each word in the query term, to yield the logical address of the corresponding inverted list.

For a given set of words, all the relevant inverted lists are accessed, and the intersection of these inverted lists is determined. This intersection is used to determine the Web document identifiers that contain all, or most of, the search terms. In cases, where one is interested only in documents containing most of the search terms, the intersection of diﬀerent subsets of inverted lists is performed to determine the best match. Typically, to speed up the process, two indexes are constructed. A smaller index is constructed on only the titles of the Web page, or anchor text of pages pointing to the page. If enough documents are found in the smaller index, then the larger index is not referenced. Otherwise, the larger index is accessed. The logic for using the smaller index is that the title of a Web page and the anchor text of Web pages pointing to it, are usually highly representative of the content in the page.

Typically, the number of pages returned for common queries may be of the order of millions or more. Obviously, such a large number of query results will not be easy for a human user to assimilate. A typical browser interface will present only the first few (say 10) results to the human user in a single view of the search results, with the option of browsing other less relevant results. Therefore, one of the most important problems in search engine query processing is that of ranking. The aforementioned processing of the inverted index does provide a content-based score. This score can be leveraged for ranking. While the exact scoring methodology used by commercial engines is proprietary, a number of factors are known to influence the content-based score:

A word is given diﬀerent weights, depending upon whether it occurs in the title, body, URL token, or the anchor text of a pointing Web page. The occurrence of the term in the title or the anchor text of a Web page pointing to that page is generally given higher weight.

The number of occurrences of a keyword in a document will be used in the score. Larger numbers of occurrences are obviously more desirable.

The prominence of a term in font size and color may be leveraged for scoring. For example, larger font sizes will be given a larger score.

596 CHAPTER 18. MINING WEB DATA

When multiple keywords are specified, their relative positions in the documents are used as well. For example, if two keywords occur close together in a Web page, then this increases the score.

The content-based score is not suﬃcient, however, because it does not account for the rep-utation, or the quality, of the page. It is important to use such mechanisms because of the uncoordinated and open nature of Web development. After all, the Web allows anyone to publish almost anything, and therefore there is little control on the quality of the results. A user may publish incorrect material either because of poor knowledge on the subject, economic incentives, or with a deliberately malicious intent of publishing misleading infor-mation.

Another problem arises from the impact of Web spam, in which Web site owners inten-tionally serve misleading content to rank their results higher. Commercial Web site owners have significant economic incentives to ensure that their sites are ranked higher. For exam-ple, an owner of a business on golf equipment, would want to ensure that a search on the word “golf” ranks his or her site as high as possible. There are several strategies used by Web site owners to rank their results higher.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 348 349 350 351 352 353 354 355 ... 423