Data Mining: The Textbook
13.5.1 Instance-Based Classifiers Instance-based classifiers work surprisingly well for text, especially when a preprocessing phase of clustering or dimensionality reduction is performed. The simplest form of the nearest neighbor classifier returns the dominant class label of the top-k nearest neighbors with the cosine similarity. Weighting the vote with the cosine similarity value often provides more robust results. However because of the sparse and high-dimensional nature of text collections, this basic procedure can be modified in two ways to improve both the efficiency and the effectiveness. The first method uses dimensionality reduction in the form of latent semantic indexing. The second method uses fine-grained clustering to perform centroid-based classification. 13.5.1.1 Leveraging Latent Semantic Analysis A major source of error in instance- based classification is the noise inherent in text col-lections. This noise is often a result of synonymy and polysemy. For example, the words comical and hilarious mean approximately the same thing. Polysemy refers to the fact that the same word may mean two different things. For example, the word jaguar could refer to a car or a cat. Typically, the significance of a word can be understood only in the context of other words in the document. These characteristics of text create challenges for classifica-tion algorithms because the computation of similarity with the use of word frequencies may not be completely accurate. For example, two documents containing the words comical and hilarious, respectively, may not be deemed sufficiently similar because of synonymy effects. In latent semantic indexing, dimensionality reduction is applied to the collection to reduce these effects. Latent semantic analysis (LSA) is an approach that relies on singular value decomposi-tion (SVD) to create a reduced representation for the text collection. The reader is advised to refer to Sect. 2.4.3.3 of Chap. 2 for details of SVD and LSA. The latent semantic analysis (LSA) method is an application of the SVD method to the n × d document-term matrix D, where d is the size of the lexicon, and n is the number of documents. The eigenvectors with the largest eigenvalues of the square d × d matrix DT D are used for data representation. The sparsity of the data set results in a low intrinsic dimensionality. Therefore, in the text domain, the reduction in dimensionality resulting from LSA is rather drastic. For example, it is not uncommon to be able to represent a corpus drawn on a lexicon of size 100,000 in less than 300 dimensions. The removal of the dimensions with small eigenvalues typically leads to a reduction in the noise effects of synonymy and polysemy. This data representation is no longer sparse and resembles multidimensional numeric data. A conventional k-nearest neighbor classifier with cosine similarity can be used on this transformed corpus. The LSA method does require an additional effort up front to create the eigenvectors. 13.5.1.2 Centroid-Based Classification Centroid-based classification is a fast alternative to k -nearest neighbor classifiers. The basic idea is to use an off-the-shelf clustering algorithm to partition the documents of each class into clusters. The number of clusters derived from the documents of each class is proportional to the number of documents in that class. This ensures that the clusters in each class are of approximately the same granularity. Class labels are associated with individual clusters rather than the actual documents. The cluster digests from the centroids are extracted by retaining only the most frequent words in that centroid. Typically, about 200 to 400 words are retained in each centroid. The 448 CHAPTER 13. MINING TEXT DATA lexicon in each of these centroids provides a stable and topical representation of the subjects in each class. An example of the (weighted) word vectors for two classes corresponding to the labels “Business schools” and “Law schools” could be as follows: Yüklə 17,13 Mb. Dostları ilə paylaş: |