Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə238/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   234   235   236   237   238   239   240   241   ...   423
1-Data Mining tarjima

11.11. EXERCISES

387




  1. Design a modification of the uncertainty sampling approach in which the dollar-costs of querying various instances are known to be different. Assume that the cost of querying instance i is known to be ci.




  1. Consider a situation where a classifier gives very consistent class-label predictions when trained on samples of the (training) data. Which ensemble method should you not use? Why?




  1. Design a heuristic variant of the AdaBoost algorithm, which will perform better than AdaBoost in terms of reducing the variance component of the error. Does this mean that the overall error of this ensemble variant will be lower than that of AdaBoost?




  1. Would you rather use a linear SVM to create the ensemble component in bagging or a kernel SVM? What would you do in the case of boosting?




  1. Consider a d-dimensional data set. Suppose that you used the 1-nearest neighbor class label in a randomly chosen subspace with dimensionality d/2 as a classification model. This classifier is repeatedly used on a test instance to create a majority-vote prediction. Discuss the bias-variance mechanism with which such a classifier will reduce error.




  1. For any d × n matrix A and scalar λ, use its singular value decomposition to show that the following is always true:

(AAT + λId)1A = A(AT A + λIn)1.


Here, Id and In are d × d and n × n identity matrices, respectively.





  1. Let the singular value decomposition of an n × d matrix D be QΣP T . According to Chap. 2, its pseudoinverse is P Σ+QT . Here, Σ+ is obtained by inverting the nonzero diagonal entries of the n × d matrix Σ and then transposing the resulting matrix.




    1. Use this result to show that:



D+ = (DT D)+DT .



  1. Show that an alternative way of computing the pseudoinverse is as follows:



D+ = DT (DDT )+.



  1. Discuss the efficiency of various methods of computing the pseudoinverse of D with varying values of n and d.




  1. Discuss the usefulness of any of the aforementioned methods for computing the pseudoinverse in the context of incorporating the kernel trick in linear regression.



Chapter 12


Mining Data Streams

You never step into the same stream twice.”—Heraclitus


12.1 Introduction


Advances in hardware technology have led to new ways of collecting data at a more rapid rate than before. For example, many transactions of everyday life, such as using a credit card or a phone, lead to automated data collection. Similarly, new ways of collecting data, such as wearable sensors and mobile devices, have added to the deluge of dynamically available data. An important assumption in these forms of data collection is that the data continuously accumulate over time at a rapid rate. These dynamic data sets are referred to as data streams.


A key assumption in the streaming paradigm is that it is no longer possible to store all the data because of resource constraints. While it is possible to archive such data using distributed “big data” frameworks, this approach comes at the expense of enormous stor-age costs and the loss of real-time processing capabilities. In many cases, such frameworks are not practical because of high costs and other analytical considerations. The streaming framework provides an alternative approach, where real-time analysis can often be per-formed with carefully designed algorithms, without a significant investment in specialized infrastructure. Some examples of application domains relevant to streaming data are as follows:






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   234   235   236   237   238   239   240   241   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin