Data Mining: The Textbook
1.8 Bibliographic Notes The problem of data mining is generally studied by multiple research communities corre-sponding to statistics, data mining, and machine learning. These communities are highly overlapping and often share many researchers in common. The machine learning and statis-tics communities generally approach data mining from a theoretical and statistical perspec-tive. Some good books written in this context may be found in [95, 256, 389]. However, because the machine learning community is generally focused on supervised learning meth-ods, these books are mostly focused on the classification scenario. More general data min-ing books, which are written from a broader perspective, may be found in [250 , 485, 536]. Because the data mining process often has to interact with databases, a number of relevant database textbooks [434, 194] provide knowledge about data representation and integration issues. A number of books have also been written on each of the major areas of data mining. The frequent pattern mining problem and its variations have been covered in detail in [34]. Numerous books have been written on the topic of data clustering. A well-known data clus-tering book [ 284] discusses the classical techniques from the literature. Another book [219] discusses the more recent methods for data clustering, although the material is somewhat basic. The most recent book [32 ] in the literature provides a very comprehensive overview of the different data clustering algorithms. The problem of data classification has been addressed in the standard machine learning books [95, 256, 389]. The classification problem has also been studied extensively by the pattern recognition community [189]. More recent surveys on the topic may be found in [33]. The problem of outlier detection has been studied in detail in [89 , 259]. These books are, however, written from a statistical perspective and do not address the problem from the perspective of the computer science community. The problem has been addressed from the perspective of the computer science community in [5]. 1.9 Exercises
(c) ZIP code, (d) State of residence, (e) Height, (f) Weight?
26 CHAPTER 1. AN INTRODUCTION TO DATA MINING containing her own readings and the pressure readings. What is the process of creating such a single database called?
Chapter 2 Data Preparation “Success depends upon previous preparation, and without such preparation there is sure to be failure.”—Confucius 2.1 Introduction The raw format of real data is usually widely variable. Many values may be missing, incon-sistent across different data sources, and erroneous. For the analyst, this leads to numerous challenges in using the data effectively. For example, consider the case of evaluating the interests of consumers from their activity on a social media site. The analyst may first need to determine the types of activity that are valuable to the mining process. The activ-ity might correspond to the interests entered by the user, the comments entered by the user, and the set of friendships of the user along with their interests. All these pieces of information are diverse and need to be collected from different databases within the social media site. Furthermore, some forms of data, such as raw logs, are often not directly usable because of their unstructured nature. In other words, useful features need to be extracted from these data sources. Therefore, a data preparation phase is needed. The data preparation phase is a multistage process that comprises several individual steps, some or all of which may be used in a given application. These steps are as follows:
|