Digital libraries: A recent trend in article and book production is to rely on digitized versions, rather than hard copies. This has led to the proliferation of digital libraries in which effective document management becomes crucial. Furthermore mining tools are also used in some domains, such as biomedical literature, to glean useful insights.
Web and Web-enabled applications: The Web is a vast repository of documents that is further enriched with links and other types of side information. Web documents are also referred to as hypertext. The additional side information available with hypertext can be useful in the knowledge discovery process. In addition, many web-enabled applications, such as social networks, chat boards, and bulletin boards, are a significant source of text for analysis.
Newswire services: An increasing trend in recent years has been the de-emphasis of printed newspapers and a move toward electronic news dissemination. This trend creates a massive stream of news documents that can be analyzed for important events and insights.
The set of features (or dimensions) of text is also referred to as its lexicon. A collection of documents is referred to as a corpus. A document can be viewed as either a sequence, or a multidimensional record. A text document is, after all, a discrete sequence of words, also
C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 13
|
429
|
c Springer International Publishing Switzerland 2015
430 CHAPTER 13. MINING TEXT DATA
referred to as a string. Therefore, many sequence-mining methods discussed in Chap. 15 are theoretically applicable to text. However, such sequence mining methods are rarely used in the text domain. This is partially because sequence mining methods are most effective when the length of the sequences and the number of possible tokens are both relatively modest. On the other hand, documents can often be long sequences drawn on a lexicon of several hundred thousand words.
In practice, text is usually represented as multidimensional data in the form of frequency-annotated bag-of- words. Words are also referred to as terms. Although such a representation loses the ordering information among the words, it also enables the use of much larger classes of multidimensional techniques. Typically, a preprocessing approach is applied in which the very common words are removed, and the variations of the same word are consolidated. The processed documents are then represented as an unordered set of words, where normalized frequencies are associated with the individual words. The resulting representation is also referred to as the vector space representation of text. The vector space representation of a document is a multidimensional vector that contains a frequency associated with each word (dimension) in the document. The overall dimensionality of this data set is equal to the number of distinct words in the lexicon. The words from the lexicon that are not present in the document are assigned a frequency of 0. Therefore, text is not very different from the multidimensional data type that has been studied in the preceding chapters.
Due to the multidimensional nature of the text, the techniques studied in the afore-mentioned chapters can also be applied to the text domain with a modest number of mod-ifications. What are these modifications, and why are they needed? To understand these modifications, one needs to understand a number of specific characteristics that are unique to text data:
Dostları ilə paylaş: |