Data Mining: The Textbook


DATA ANALYTICAL PROCESSING



Yüklə 17,13 Mb.
səhifə11/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   7   8   9   10   11   12   13   14   ...   423
1-Data Mining tarjima

DATA




ANALYTICAL PROCESSING










DATA

PREPROCESSING




OUTPUT













CLEANING
















COLLECTION

FEATURE







BUILDING

BUILDING

FOR










AND



















ANALYST










EXTRACTION




BLOCK 1

BLOCK 2










INTEGRATION






























FEEDBACK (OPTIONAL)


FEEDBACK (OPTIONAL)

Figure 1.1: The data processing pipeline


possible to directly use a standard data mining problem, such as the four “superprob-lems” discussed earlier, for the application at hand. However, these four problems have such wide coverage that many applications can be broken up into components that use these different building blocks. This book will provide examples of this process.


The overall data mining process is illustrated in Fig. 1.1. Note that the analytical block in Fig. 1.1 shows multiple building blocks representing the design of the solution to a particular application. This part of the algorithmic design is dependent on the skill of the analyst and often uses one or more of the four major problems as a building block. This is, of course, not always the case, but it is frequent enough to merit special treatment of these four problems within this book. To explain the data mining process, we will use an example from a recommendation scenario.


Example 1.2.1 Consider a scenario in which a retailer has Web logs corresponding to customer accesses to Web pages at his or her site. Each of these Web pages corresponds to a product, and therefore a customer access to a page may often be indicative of interest in that particular product. The retailer also stores demographic profiles for the different customers. The retailer wants to make targeted product recommendations to customers using the customer demographics and buying behavior.

Sample Solution Pipeline In this case, the first step for the analyst is to collect the relevant data from two different sources. The first source is the set of Web logs at the site. The second is the demographic information within the retailer database that were collected during Web registration of the customer. Unfortunately, these data sets are in a very different format and cannot easily be used together for processing. For example, consider a sample log entry of the following form:


98.206.207.157 - - [31/Jul/2013:18:09:38 -0700] "GET /productA.htm HTTP/1.1" 200 328177 "-" "Mozilla/5.0 (Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10B329 Safari/8536.25" "retailer.net"

The log may contain hundreds of thousands of such entries. Here, a customer at IP address 98.206.207.157 has accessed productA.htm. The customer from the IP address can be iden-tified using the previous login information, by using cookies, or by the IP address itself, but this may be a noisy process and may not always yield accurate results. The analyst would need to design algorithms for deciding how to filter the different log entries and use only those which provide accurate results as a part of the cleaning and extraction process. Furthermore, the raw log contains a lot of additional information that is not necessarily





1.2. THE DATA MINING PROCESS

5

of any use to the retailer. In the feature extraction process, the retailer decides to create one record for each customer, with a specific choice of features extracted from the Web page accesses. For each record, an attribute corresponds to the number of accesses to each product description. Therefore, the raw logs need to be processed, and the accesses need to be aggregated during this feature extraction phase. Attributes are added to these records for the retailer’s database containing demographic information in a data integration phase. Missing entries from the demographic records need to be estimated for further data clean-ing. This results in a single data set containing attributes for the customer demographics and customer accesses.


At this point, the analyst has to decide how to use this cleaned data set for making recommendations. He or she decides to determine similar groups of customers, and make recommendations on the basis of the buying behavior of these similar groups. In particular, the building block of clustering is used to determine similar groups. For a given customer, the most frequent items accessed by the customers in that group are recommended. This provides an example of the entire data mining pipeline. As you will learn in Chap. 18, there are many elegant ways of performing the recommendations, some of which are more effective than the others depending on the specific definition of the problem. Therefore, the entire data mining process is an art form, which is based on the skill of the analyst, and cannot be fully captured by a single technique or building block. In practice, this skill can be learned only by working with a diversity of applications over different scenarios and data types.


1.2.1 The Data Preprocessing Phase


The data preprocessing phase is perhaps the most crucial one in the data mining process. Yet, it is rarely explored to the extent that it deserves because most of the focus is on the analytical aspects of data mining. This phase begins after the collection of the data, and it consists of the following steps:






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   7   8   9   10   11   12   13   14   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin