Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	266/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 262 263 264 265 266 267 268 269 ... 423

1-Data Mining tarjima

Stemming: Variations of the same word need to be consolidated. For example, singular and plural representations of the same word, and diﬀerent tenses of the same word are consolidated. In many cases, stemming refers to common root extraction from words, and the extracted root may not even be a word in of itself. For example, the common root of hoping and hope is hop. Of course, the drawback is that the word hop has a diﬀerent meaning and usage of its own. Therefore, while stemming usually improves recall in document retrieval, it can sometimes worsen precision slightly. Nevertheless, stemming usually enables higher quality results in mining applications.

Punctuation marks: After stemming has been performed, punctuation marks, such as commas and semicolons, are removed. Furthermore, numeric digits are removed. Hyphens are removed, if the removal results in distinct and meaningful words. Typi-cally, a base dictionary may be available for these operations. Furthermore, the distinct parts of the hyphenated word can either be treated as separate words, or they may be merged into a single word.

After the aforementioned steps, the resulting document may contain only semantically rel-evant words. This document is treated as a bag- of-words, in which relative ordering is irrelevant. In spite of the obvious loss of ordering information in this representation, the bag-of-words representation is surprisingly eﬀective.

432 CHAPTER 13. MINING TEXT DATA

13.2.1 Document Normalization and Similarity Computation

The problem of document normalization is closely related to that of similarity computation. While the issue of text similarity is discussed in Chap. 3, it is also discussed here for completeness. Two primary types of normalization are applied to documents:

Inverse document frequency: Higher frequency words tend to contribute noise to data mining operations such as similarity computation. The removal of stop words is moti-vated by this aspect. The concept of inverse document frequency generalizes this principle in a softer way, where words with higher frequency are weighted less.

Frequency damping: The repeated presence of a word in a document will typically bias the similarity computation significantly. To provide greater stability to the similarity computation, a damping function is applied to word frequencies so that the frequencies of diﬀerent words become more similar to one another. It should be pointed out that frequency damping is optional, and the eﬀects vary with the application at hand. Some applications, such as clustering, have shown comparable or better performance without damping. This is particularly true if the underlying data sets are relatively clean and have few spam documents.

In the following, these diﬀerent types of normalization will be discussed. The inverse docu-ment frequency id_i of the ith term is a decreasing function of the number of documents n_i in which it occurs:

id_i = log(n/n_i).

(13.1)

Here, the number of documents in the collection is denoted by n. Other ways of computing the inverse document frequency are possible, though the impact on the similarity function is usually limited.

Next, the concept of frequency damping is discussed. This normalization ensures that the excessive presence of a single word does not throw oﬀ the similarity computation. Consider a document with word-frequency vector X = (x₁ . . . x_d), where d is the size of the lexicon. A damping function f (·), such as the square root or the logarithm, is optionally applied to the frequencies before similarity computation:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 262 263 264 265 266 267 268 269 ... 423