Data Mining: The Textbook
1.8 Bibliographic Notes The problem of data mining is generally studied by multiple research communities corre-sponding to statistics, data mining, and machine learning. These communities are highly overlapping and often share many researchers in common. The machine learning and statis-tics communities generally approach data mining from a theoretical and statistical perspec-tive. Some good books written in this context may be found in [95, 256, 389]. However, because the machine learning community is generally focused on supervised learning meth-ods, these books are mostly focused on the classification scenario. More general data min-ing books, which are written from a broader perspective, may be found in [250 , 485, 536]. Because the data mining process often has to interact with databases, a number of relevant database textbooks [434, 194] provide knowledge about data representation and integration issues. A number of books have also been written on each of the major areas of data mining. The frequent pattern mining problem and its variations have been covered in detail in [34]. Numerous books have been written on the topic of data clustering. A well-known data clus-tering book [ 284] discusses the classical techniques from the literature. Another book [219] discusses the more recent methods for data clustering, although the material is somewhat basic. The most recent book [32 ] in the literature provides a very comprehensive overview of the different data clustering algorithms. The problem of data classification has been addressed in the standard machine learning books [95, 256, 389]. The classification problem has also been studied extensively by the pattern recognition community [189]. More recent surveys on the topic may be found in [33]. The problem of outlier detection has been studied in detail in [89 , 259]. These books are, however, written from a statistical perspective and do not address the problem from the perspective of the computer science community. The problem has been addressed from the perspective of the computer science community in [5]. 1.9 Exercises An analyst collects surveys from different participants about their likes and dislikes. Subsequently, the analyst uploads the data to a database, corrects erroneous or missing entries, and designs a recommendation algorithm on this basis. Which of the following actions represent data collection, data preprocessing, and data analysis? (a) Conduct-ing surveys and uploading to database, (b) correcting missing entries, (c) designing a recommendation algorithm. What is the data type of each of the following kinds of attributes (a) Age, (b) Salary, (c) ZIP code, (d) State of residence, (e) Height, (f) Weight? An analyst obtains medical notes from a physician for data mining purposes, and then transforms them into a table containing the medicines prescribed for each patient. What is the data type of (a) the original data, and (b) the transformed data? (c) What is the process of transforming the data to the new format called? An analyst sets up a sensor network in order to measure the temperature of different locations over a period. What is the data type of the data collected? The same analyst as discussed in Exercise 4 above finds another database from a different source containing pressure readings. She decides to create a single database 26 CHAPTER 1. AN INTRODUCTION TO DATA MINING containing her own readings and the pressure readings. What is the process of creating such a single database called? An analyst processes Web logs in order to create records with the ordering information for Web page accesses from different users. What is the type of this data? Consider a data object corresponding to a set of nucleotides arranged in a certain order. What is this type of data? It is desired to partition customers into similar groups on the basis of their demo-graphic profile. Which data mining problem is best suited to this task? Suppose in Exercise 8, the merchant already knows for some of the customers whether or not they have bought widgets. Which data mining problem would be suited to the task of identifying groups among the remaining customers, who might buy widgets in the future? Suppose in Exercise 9, the merchant also has information for other items bought by the customers (beyond widgets). Which data mining problem would be best suited to finding sets of items that are often bought together with widgets? Suppose that a small number of customers lie about their demographic profile, and this results in a mismatch between the buying behavior and the demographic profile, as suggested by comparison with the remaining data. Which data mining problem would be best suited to finding such customers? Chapter 2 Data Preparation “Success depends upon previous preparation, and without such preparation there is sure to be failure.”—Confucius 2.1 Introduction The raw format of real data is usually widely variable. Many values may be missing, incon-sistent across different data sources, and erroneous. For the analyst, this leads to numerous challenges in using the data effectively. For example, consider the case of evaluating the interests of consumers from their activity on a social media site. The analyst may first need to determine the types of activity that are valuable to the mining process. The activ-ity might correspond to the interests entered by the user, the comments entered by the user, and the set of friendships of the user along with their interests. All these pieces of information are diverse and need to be collected from different databases within the social media site. Furthermore, some forms of data, such as raw logs, are often not directly usable because of their unstructured nature. In other words, useful features need to be extracted from these data sources. Therefore, a data preparation phase is needed. The data preparation phase is a multistage process that comprises several individual steps, some or all of which may be used in a given application. These steps are as follows: Yüklə 17,13 Mb. Dostları ilə paylaş: |