Stemming: Variations of the same word need to be consolidated. For example, singular and plural representations of the same word, and different tenses of the same word are consolidated. In many cases, stemming refers to common root extraction from words, and the extracted root may not even be a word in of itself. For example, the common root of hoping and hope is hop. Of course, the drawback is that the word hop has a different meaning and usage of its own. Therefore, while stemming usually improves recall in document retrieval, it can sometimes worsen precision slightly. Nevertheless, stemming usually enables higher quality results in mining applications.
Punctuation marks: After stemming has been performed, punctuation marks, such as commas and semicolons, are removed. Furthermore, numeric digits are removed. Hyphens are removed, if the removal results in distinct and meaningful words. Typi-cally, a base dictionary may be available for these operations. Furthermore, the distinct parts of the hyphenated word can either be treated as separate words, or they may be merged into a single word.
After the aforementioned steps, the resulting document may contain only semantically rel-evant words. This document is treated as a bag- of-words, in which relative ordering is irrelevant. In spite of the obvious loss of ordering information in this representation, the bag-of-words representation is surprisingly effective.
432 CHAPTER 13. MINING TEXT DATA
13.2.1 Document Normalization and Similarity Computation
The problem of document normalization is closely related to that of similarity computation. While the issue of text similarity is discussed in Chap. 3, it is also discussed here for completeness. Two primary types of normalization are applied to documents:
Inverse document frequency: Higher frequency words tend to contribute noise to data mining operations such as similarity computation. The removal of stop words is moti-vated by this aspect. The concept of inverse document frequency generalizes this principle in a softer way, where words with higher frequency are weighted less.
Frequency damping: The repeated presence of a word in a document will typically bias the similarity computation significantly. To provide greater stability to the similarity computation, a damping function is applied to word frequencies so that the frequencies of different words become more similar to one another. It should be pointed out that frequency damping is optional, and the effects vary with the application at hand. Some applications, such as clustering, have shown comparable or better performance without damping. This is particularly true if the underlying data sets are relatively clean and have few spam documents.
In the following, these different types of normalization will be discussed. The inverse docu-ment frequency idi of the ith term is a decreasing function of the number of documents ni in which it occurs:
Here, the number of documents in the collection is denoted by n. Other ways of computing the inverse document frequency are possible, though the impact on the similarity function is usually limited.
Next, the concept of frequency damping is discussed. This normalization ensures that the excessive presence of a single word does not throw off the similarity computation. Consider a document with word-frequency vector X = (x1 . . . xd), where d is the size of the lexicon. A damping function f (·), such as the square root or the logarithm, is optionally applied to the frequencies before similarity computation:
Dostları ilə paylaş: |