Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	44/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 40 41 42 43 44 45 46 47 ... 423

1-Data Mining tarjima

2.7. EXERCISES

the form of LSA [184, 416]. It has been shown in many domains [25 , 184, 416] that the use of methods such as SVD, LSA, and PCA unexpectedly improves the quality of the underlying representation after performing the reduction. This improvement is because of reduction in noise eﬀects by discarding the low-variance dimensions. Applications of SVD to data imputation are found in [23] and Chap. 18 of this book. Other methods for dimensionality reduction and transformation include Kalman filtering [260], Fastmap [202], and nonlinear methods such as Laplacian eigenmaps [90], MDS [328], and ISOMAP [490].

Many dimensionality reduction methods have also been proposed in recent years that simultaneously perform type transformation together with the reduction process. These include wavelet transformation [475] and graph embedding methods such as ISOMAP and Laplacian eigenmaps [90, 490]. A tutorial on spectral methods for graph embedding may be found in [371].

2.7 Exercises

Consider the time-series (−3, −1, 1, 3, 5, 7, ∗). Here, a missing entry is denoted by ∗. What is the estimated value of the missing entry using linear interpolation on a window of size 3?

Suppose you had a bunch of text documents, and you wanted to determine all the personalities mentioned in these documents. What class of technologies would you use to achieve this goal?

Download the Arrythmia data set from the UCI Machine Learning Repository [213]. Normalize all records to a mean of 0 and a standard deviation of 1. Discretize each numerical attribute into (a) 10 equi-width ranges and (b) 10 equi-depth ranges.

Suppose that you had a set of arbitrary objects of diﬀerent types representing diﬀerent characteristics of widgets. A domain expert gave you the similarity value between every pair of objects. How would you convert these objects into a multidimensional data set for clustering?

Suppose that you had a data set, such that each data point corresponds to sea-surface temperatures over a square mile of resolution 10×10. In other words, each data record contains a 10 × 10 grid of temperature values with spatial locations. You also have some text associated with each 10 × 10 grid. How would you convert this data into a multidimensional data set?

Suppose that you had a set of discrete biological protein sequences that are annotated with text describing the properties of the protein. How would you create a multidi-mensional representation from this heterogeneous data set?

Download the Musk data set from the UCI Machine Learning Repository [213]. Apply PCA to the data set, and report the eigenvectors and eigenvalues.

Repeat the previous exercise using SVD.

For a mean-centered data set with points X₁ . . . X_n, show that the following is true:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 40 41 42 43 44 45 46 47 ... 423