Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə124/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   120   121   122   123   124   125   126   127   ...   423
1-Data Mining tarjima

6.12. EXERCISES

203




  1. Write a computer program to implement a hierarchical merging algorithm with the single-linkage merging criterion.




  1. Write a computer program to implement the EM algorithm, in which there are two spherical Gaussian clusters with the same radius. Download the Ionosphere data set from the UCI Machine Learning Repository [213]. Apply the algorithm to the data set (with randomly chosen centers), and record the centroid of the Gaussian in each iteration. Now apply the k-means algorithm implemented in Exercise 3, with the same set of initial seeds as Gaussian centroids. How do the centroids in the two algorithms compare over the different iterations?




  1. Implement the computer program of Exercise 7 with a general Gaussian distribution, rather than a spherical Gaussian.




  1. Consider a 1-dimensional data set with three natural clusters. The first cluster contains the consecutive integers {1 . . . 5}. The second cluster contains the consecutive integers {8 . . . 12}. The third cluster contains the data points {24, 28, 32, 36, 40}. Apply a k-means algorithm with initial centers of 1, 11, and 28. Does the algorithm determine the correct clusters?




  1. If the initial centers are changed to 1, 2, and 3, does the algorithm discover the correct clusters? What does this tell you?




  1. Use the data set of Exercise 9 to show how hierarchical algorithms are sensitive to local density variations.




  1. Use the data set of Exercise 9 to show how grid-based algorithms are sensitive to local density variations.




  1. It is a fundamental fact of linear algebra that any rank-k matrix has a singular value decomposition in which exactly k singular values are non-zero. Use this result to show that the lowest error of rank-k approximation in SVD is the same as that of unconstrained matrix factorization in which basis vectors are not constrained to be orthogonal. Assume that the Frobenius norm of the error matrix is used in both cases to compute the approximation error.




  1. Suppose that you constructed a k-nearest neighbor similarity graph from a data set with weights on edges. Describe the bottom-up single-linkage algorithm in terms of the similarity graph.




  1. Suppose that a shared nearest neighbor similarity function (see Chap. 3) is used in conjunction with the k-medoids algorithm to discover k clusters from n data points. The number of nearest neighbors used to define shared nearest neighbor similarity is m. Describe how a reasonable value of m may be selected in terms of k and n, so as to not result in poor algorithm performance.




  1. Suppose that matrix factorization is used to approximately represent a data matrix D as D ≈ D = U V T . Show that one or more of the rows/columns of U and V can be multiplied with constant factors, so as represent D = U V T in an infinite number of different ways. What would be a reasonable choice of U and V among these solutions?

204 CHAPTER 6. CLUSTER ANALYSIS





  1. Explain how each of the internal validity criteria is biased toward one of the algorithms.




  1. Suppose that you generate a synthetic data set containing arbitrarily oriented Gaus-sian clusters. How well does the SSQ criterion reflect the quality of the clusters?




  1. Which algorithms will perform best for the method of synthetic data generation in Exercise 18?

Chapter 7


Cluster Analysis: Advanced Concepts

The crowd is just as important as the group. It takes everything to make it work.”—Levon Helm


7.1 Introduction


In the previous chapter, the basic data clustering methods were introduced. In this chapter, several advanced clustering scenarios will be studied, such as the impact of the size, dimen-sionality, or type of the underlying data. In addition, it is possible to obtain significant insights with the use of advanced supervision methods, or with the use of ensemble-based algorithms. In particular, two important aspects of clustering algorithms will be addressed:






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   120   121   122   123   124   125   126   127   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin