Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə174/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   170   171   172   173   174   175   176   177   ...   423
1-Data Mining tarjima

9.8. EXERCISES

283

9.8 Exercises





  1. Suppose that algorithm A is designed for outlier detection in numeric data, whereas algorithm B is designed for outlier detection in categorical data. Show how you can use these algorithms to perform outlier detection in a mixed-attribute data set.




  1. Design an algorithm for categorical outlier detection using the Mahalanobis distance. What are the advantages of such an approach?




  1. Implement a distance-based outlier detection algorithm with the use of match-based similarity.




  1. Design a feature bagging approach that uses arbitrary subspaces of the data rather than axis-parallel ones. Show how arbitrary subspaces may be efficiently sampled in a data distribution-sensitive way.




  1. Compare and contrast multiview clustering with subspace ensembles in outlier detec-tion.




  1. Implement any two outlier detection algorithms of your choice. Convert the scores to Z-numbers. Combine the scores using the max function.



Chapter 10


Data Classification

Science is the systematic classification of experience.”—George Henry Lewes


10.1 Introduction


The classification problem is closely related to the clustering problem discussed in Chaps. 6 and 7. While the clustering problem is that of determining similar groups of data points, the classification problem is that of learning the structure of a data set of examples, already partitioned into groups , that are referred to as categories or classes. The learning of these categories is typically achieved with a model. This model is used to estimate the group identifiers (or class labels) of one or more previously unseen data examples with unknown labels. Therefore, one of the inputs to the classification problem is an example data set that has already been partitioned into different classes. This is referred to as the training data, and the group identifiers of these classes are referred to as class labels. In most cases, the class labels have a clear semantic interpretation in the context of a specific application, such as a group of customers interested in a specific product, or a group of data objects with a desired property of interest. The model learned is referred to as the training model. The previously unseen data points that need to be classified are collectively referred to as the test data set. The algorithm that creates the training model for prediction is also sometimes referred to as the learner.


Classification is, therefore, referred to as supervised learning because an example data set is used to learn the structure of the groups, just as a teacher supervises his or her students towards a specific goal. While the groups learned by a classification model may often be related to the similarity structure of the feature variables, as in clustering, this need not necessarily be the case. In classification, the example training data is paramount in providing the guidance of how groups are defined. Given a data set of test examples, the groups created by a classification model on the test examples will try to mirror the number and structure of the groups available in the example data set of training instances. Therefore, the classification problem may be intuitively stated as follows:






C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 10

285

c Springer International Publishing Switzerland 2015



286 CHAPTER 10. DATA CLASSIFICATION



Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   170   171   172   173   174   175   176   177   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin