Data Mining: The Textbook
The SVMPerf and SVMLight classifiers were described in [291] and [292], respectively. A survey on SVM classification may be found in [124]. General surveys on text classification may be found in [31, 33, 453]. The first-story detection problem was first proposed in the context of the topic detection and tracking effort [557]. The micro-cluster-based novelty detection method described in this chapter was adapted from [48]. Probabilistic models for novelty detection may be found in [545]. A general discussion on the topic of first-story detection may be found in [5]. 13.9 Exercises Implement a computer program that parses a set of text, and converts it to the vector space representation. Use tf-idf normalization. Download a list of stop words from http://www.ranks.nl/resources/stopwords.htmland remove them from the document, before creating the vector space representation. Discuss the weaknesses of the k-medoids algorithm when applied to text data. Suppose you paired the shared nearest neighbor similarity function (see Chap. 2) with cosine similarity to implement the k-means clustering algorithm for text. What is its advantage over the direct use of cosine similarity? Design a combination of hierarchical and k-means algorithms in which merging oper-ations are interleaved with the assignment operations. Discuss its advantages and dis-advantages with respect to the scatter/gather clustering algorithm in which merging strictly precedes assignment. Suppose that you have a large collection of short tweets from Twitter. Design a Bayes classifier which uses the identity as well as the exact position of each of the first ten words in the tweet to perform classification. How would you handle tweets containing less than ten words? Design a modification of single-linkage text clustering algorithms, which is able to avoid excessive chaining. Discuss why the multinomial Bayes classification model works better on longer docu-ments with large lexicons than the Bernoulli Bayes model. Suppose that you have class labels associated with documents. Describe a simple supervised dimensionality reduction approach that uses PLSA on a derivative of the document-term matrix to yield basis vectors which are each biased towards one or more of the classes. You should be able to control the level of supervision with a parameter λ. Design an EM algorithm for clustering text data, in which the documents are generated from the multinomial distribution instead of the Bernoulli distribution. Under what scenarios would you prefer this clustering algorithm over the Bernoulli model? For the case of binary classes, show that the Rocchio method defines a linear decision boundary. How would you characterize the decision boundary in the multiclass case? Design a method which uses the EM algorithm to discover outlier documents. Chapter 14 Mining Time Series Data “The only reason for time is so that everything doesn’t happen at once.—Albert Einstein 14.1 Introduction Temporal data is common in data mining applications. Typically, this is a result of continu-ously occurring processes in which the data is collected by hardware or software monitoring devices. The diversity of domains is quite significant and extends from the medical to the financial domain. Some examples of such data are as follows: Yüklə 17,13 Mb. Dostları ilə paylaş: |