Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	238/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 234 235 236 237 238 239 240 241 ... 423

1-Data Mining tarjima

11.11. EXERCISES

387

Design a modification of the uncertainty sampling approach in which the dollar-costs of querying various instances are known to be diﬀerent. Assume that the cost of querying instance i is known to be c_i.

Consider a situation where a classifier gives very consistent class-label predictions when trained on samples of the (training) data. Which ensemble method should you not use? Why?

Design a heuristic variant of the AdaBoost algorithm, which will perform better than AdaBoost in terms of reducing the variance component of the error. Does this mean that the overall error of this ensemble variant will be lower than that of AdaBoost?

Would you rather use a linear SVM to create the ensemble component in bagging or a kernel SVM? What would you do in the case of boosting?

Consider a d-dimensional data set. Suppose that you used the 1-nearest neighbor class label in a randomly chosen subspace with dimensionality d/2 as a classification model. This classifier is repeatedly used on a test instance to create a majority-vote prediction. Discuss the bias-variance mechanism with which such a classifier will reduce error.

For any d × n matrix A and scalar λ, use its singular value decomposition to show that the following is always true:

(AA^T + λI_d)⁻¹A = A(A^T A + λI_n)⁻¹.

Here, I_d and I_n are d × d and n × n identity matrices, respectively.

Let the singular value decomposition of an n × d matrix D be QΣP T . According to Chap. 2, its pseudoinverse is P Σ+QT . Here, Σ+ is obtained by inverting the nonzero diagonal entries of the n × d matrix Σ and then transposing the resulting matrix.

Use this result to show that:

D⁺ = (D^T D)⁺D^T .

Show that an alternative way of computing the pseudoinverse is as follows:

D⁺ = D^T (DD^T )⁺.

Discuss the eﬃciency of various methods of computing the pseudoinverse of D with varying values of n and d.

Discuss the usefulness of any of the aforementioned methods for computing the pseudoinverse in the context of incorporating the kernel trick in linear regression.

Chapter 12

Mining Data Streams

“You never step into the same stream twice.”—Heraclitus

12.1 Introduction

Advances in hardware technology have led to new ways of collecting data at a more rapid rate than before. For example, many transactions of everyday life, such as using a credit card or a phone, lead to automated data collection. Similarly, new ways of collecting data, such as wearable sensors and mobile devices, have added to the deluge of dynamically available data. An important assumption in these forms of data collection is that the data continuously accumulate over time at a rapid rate. These dynamic data sets are referred to as data streams.

A key assumption in the streaming paradigm is that it is no longer possible to store all the data because of resource constraints. While it is possible to archive such data using distributed “big data” frameworks, this approach comes at the expense of enormous stor-age costs and the loss of real-time processing capabilities. In many cases, such frameworks are not practical because of high costs and other analytical considerations. The streaming framework provides an alternative approach, where real-time analysis can often be per-formed with carefully designed algorithms, without a significant investment in specialized infrastructure. Some examples of application domains relevant to streaming data are as follows:

Yüklə 17,13 Mb.

Dostları ilə paylaş: