Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə279/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   275   276   277   278   279   280   281   282   ...   423
1-Data Mining tarjima

13.9. EXERCISES

455

The SVMPerf and SVMLight classifiers were described in [291] and [292], respectively. A survey on SVM classification may be found in [124]. General surveys on text classification may be found in [31, 33, 453].


The first-story detection problem was first proposed in the context of the topic detection and tracking effort [557]. The micro-cluster-based novelty detection method described in this chapter was adapted from [48]. Probabilistic models for novelty detection may be found in [545]. A general discussion on the topic of first-story detection may be found in [5].


13.9 Exercises





  1. Implement a computer program that parses a set of text, and converts it to the vector space representation. Use tf-idf normalization. Download a list of stop words from http://www.ranks.nl/resources/stopwords.htmland remove them from the document, before creating the vector space representation.




  1. Discuss the weaknesses of the k-medoids algorithm when applied to text data.




  1. Suppose you paired the shared nearest neighbor similarity function (see Chap. 2) with cosine similarity to implement the k-means clustering algorithm for text. What is its advantage over the direct use of cosine similarity?




  1. Design a combination of hierarchical and k-means algorithms in which merging oper-ations are interleaved with the assignment operations. Discuss its advantages and dis-advantages with respect to the scatter/gather clustering algorithm in which merging strictly precedes assignment.




  1. Suppose that you have a large collection of short tweets from Twitter. Design a Bayes classifier which uses the identity as well as the exact position of each of the first ten words in the tweet to perform classification. How would you handle tweets containing less than ten words?




  1. Design a modification of single-linkage text clustering algorithms, which is able to avoid excessive chaining.




  1. Discuss why the multinomial Bayes classification model works better on longer docu-ments with large lexicons than the Bernoulli Bayes model.




  1. Suppose that you have class labels associated with documents. Describe a simple supervised dimensionality reduction approach that uses PLSA on a derivative of the document-term matrix to yield basis vectors which are each biased towards one or more of the classes. You should be able to control the level of supervision with a parameter λ.




  1. Design an EM algorithm for clustering text data, in which the documents are generated from the multinomial distribution instead of the Bernoulli distribution. Under what scenarios would you prefer this clustering algorithm over the Bernoulli model?




  1. For the case of binary classes, show that the Rocchio method defines a linear decision boundary. How would you characterize the decision boundary in the multiclass case?




  1. Design a method which uses the EM algorithm to discover outlier documents.



Chapter 14


Mining Time Series Data

The only reason for time is so that everything doesn’t happen at once.—Albert Einstein


14.1 Introduction


Temporal data is common in data mining applications. Typically, this is a result of continu-ously occurring processes in which the data is collected by hardware or software monitoring devices. The diversity of domains is quite significant and extends from the medical to the financial domain. Some examples of such data are as follows:




1   ...   275   276   277   278   279   280   281   282   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin