Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə29/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   25   26   27   28   29   30   31   32   ...   423
1-Data Mining tarjima

2.3. DATA CLEANING

35

The aforementioned issues may be a significant source of inaccuracy for data mining appli-cations. Methods are needed to remove or correct missing and erroneous entries from the data. There are several important aspects of data cleaning:





  1. Handling missing entries: Many entries in the data may remain unspecified because of weaknesses in data collection or the inherent nature of the data. Such missing entries may need to be estimated. The process of estimating missing entries is also referred to as imputation.




  1. Handling incorrect entries: In cases where the same information is available from multiple sources, inconsistencies may be detected. Such inconsistencies can be removed as a part of the analytical process. Another method for detecting the incorrect entries is to use domain-specific knowledge about what is already known about the data. For example, if a person’s height is listed as 6 m, it is most likely incorrect. More generally, data points that are inconsistent with the remaining data distribution are often noisy. Such data points are referred to as outliers. It is, however, dangerous to assume that such data points are always caused by errors. For example, a record representing credit card fraud is likely to be inconsistent with respect to the patterns in most of the (normal) data but should not be removed as “incorrect” data.




  1. Scaling and normalization: The data may often be expressed in very different scales (e.g., age and salary). This may result in some features being inadvertently weighted too much so that the other features are implicitly ignored. Therefore, it is important to normalize the different features.

The following sections will discuss each of these aspects of data cleaning.


2.3.1 Handling Missing Entries


Missing entries are common in databases where the data collection methods are imperfect. For example, user surveys are often unable to collect responses to all questions. In cases where data contribution is voluntary, the data is almost always incompletely specified. Three classes of techniques are used to handle missing entries:





  1. Any data record containing a missing entry may be eliminated entirely. However, this approach may not be practical when most of the records contain missing entries.




  1. The missing values may be estimated or imputed. However, errors created by the imputation process may affect the results of the data mining algorithm.




  1. The analytical phase is designed in such a way that it can work with missing values. Many data mining methods are inherently designed to work robustly with missing values. This approach is usually the most desirable because it avoids the additional biases inherent in the imputation process.

The problem of estimating missing entries is directly related to the classification problem. In the classification problem, a single attribute is treated specially, and the other features are used to estimate its value. In this case, the missing value can occur on any feature, and therefore the problem is more challenging, although it is fundamentally not different. Many of the methods discussed in Chaps. 10 and 11 for classification can also be used for missing value estimation. In addition, the matrix completion methods discussed in Sect. 18.5 of Chap. 18 may also be used.


36 CHAPTER 2. DATA PREPARATION






FEATURE Y




11









































































10
















X NOISE

















































9





































8





































7





































6





































5










X NOISE





























































4





































3





































−2

0

2

4

6

8

10

12

14

16




FEATURE X





Figure 2.1: Finding noise by data-centric methods


In the case of dependency-oriented data, such as time series or spatial data, missing value estimation is much simpler. In this case, the behavioral attribute values of contextually nearby records are used for the imputation process. For example, in a time-series data set, the average of the values at the time stamp just before or after the missing attribute may be used for estimation. Alternatively, the behavioral values at the last n time-series data stamps can be linearly interpolated to determine the missing value. For the case of spatial data, the estimation process is quite similar, where the average of values at neighboring spatial locations may be used.
2.3.2 Handling Incorrect and Inconsistent Entries

The key methods that are used for removing or correcting the incorrect and inconsistent entries are as follows:





  1. Inconsistency detection: This is typically done when the data is available from different sources in different formats. For example, a person’s name may be spelled out in full in one source, whereas the other source may only contain the initials and a last name. In such cases, the key issues are duplicate detection and inconsistency detection. These topics are studied under the general umbrella of data integration within the database field.




  1. Domain knowledge: A significant amount of domain knowledge is often available in terms of the ranges of the attributes or rules that specify the relationships across different attributes. For example, if the country field is “United States,” then the city field cannot be “Shanghai.” Many data scrubbing and data auditing tools have been developed that use such domain knowledge and constraints to detect incorrect entries.




  1. Data-centric methods: In these cases, the statistical behavior of the data is used to detect outliers. For example, the two isolated data points in Fig. 2.1 marked as “noise” are outliers. These isolated points might have arisen because of errors in the data collection process. However, this may not always be the case because the anomalies may be the result of interesting behavior of the underlying system. Therefore, any detected outlier may need to be manually examined before it is discarded. The use of

2.4. DATA REDUCTION AND TRANSFORMATION

37

data-centric methods for cleaning can sometimes be dangerous because they can result in the removal of useful knowledge from the underlying system. The outlier detection problem is an important analytical technique in its own right, and is discussed in detail in Chaps. 8 and 9.


The methods for addressing erroneous and inconsistent entries are generally highly domain specific.


2.3.3 Scaling and Normalization


In many scenarios, the different features represent different scales of reference and may therefore not be comparable to one another. For example, an attribute such as age is drawn on a very different scale than an attribute such as salary. The latter attribute is typically orders of magnitude larger than the former. As a result, any aggregate function computed on the different features (e.g., Euclidean distances) will be dominated by the attribute of larger magnitude.


To address this problem, it is common to use standardization. Consider the case where the jth attribute has mean μj and standard deviation σj . Then, the jth attribute value xji of the ith record Xi may be normalized as follows:





zj =

xij − μj

(2.2)







i

σj
















The vast majority of the normalized values will typically lie in the range [3, 3] under the normal distribution assumption.


A second approach uses min-max scaling to map all attributes to the range [0, 1]. Let minj and maxj represent the minimum and maximum values of attribute j. Then, the jth
attribute value xji of the ith record Xi may be scaled as follows:



yj =


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   25   26   27   28   29   30   31   32   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin