Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	29/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 25 26 27 28 29 30 31 32 ... 423

1-Data Mining tarjima

2.3. DATA CLEANING

The aforementioned issues may be a significant source of inaccuracy for data mining appli-cations. Methods are needed to remove or correct missing and erroneous entries from the data. There are several important aspects of data cleaning:

Handling missing entries: Many entries in the data may remain unspecified because of weaknesses in data collection or the inherent nature of the data. Such missing entries may need to be estimated. The process of estimating missing entries is also referred to as imputation.

Handling incorrect entries: In cases where the same information is available from multiple sources, inconsistencies may be detected. Such inconsistencies can be removed as a part of the analytical process. Another method for detecting the incorrect entries is to use domain-specific knowledge about what is already known about the data. For example, if a person’s height is listed as 6 m, it is most likely incorrect. More generally, data points that are inconsistent with the remaining data distribution are often noisy. Such data points are referred to as outliers. It is, however, dangerous to assume that such data points are always caused by errors. For example, a record representing credit card fraud is likely to be inconsistent with respect to the patterns in most of the (normal) data but should not be removed as “incorrect” data.

Scaling and normalization: The data may often be expressed in very diﬀerent scales (e.g., age and salary). This may result in some features being inadvertently weighted too much so that the other features are implicitly ignored. Therefore, it is important to normalize the diﬀerent features.

The following sections will discuss each of these aspects of data cleaning.

2.3.1 Handling Missing Entries

Missing entries are common in databases where the data collection methods are imperfect. For example, user surveys are often unable to collect responses to all questions. In cases where data contribution is voluntary, the data is almost always incompletely specified. Three classes of techniques are used to handle missing entries:

Any data record containing a missing entry may be eliminated entirely. However, this approach may not be practical when most of the records contain missing entries.

The missing values may be estimated or imputed. However, errors created by the imputation process may aﬀect the results of the data mining algorithm.

The analytical phase is designed in such a way that it can work with missing values. Many data mining methods are inherently designed to work robustly with missing values. This approach is usually the most desirable because it avoids the additional biases inherent in the imputation process.

The problem of estimating missing entries is directly related to the classification problem. In the classification problem, a single attribute is treated specially, and the other features are used to estimate its value. In this case, the missing value can occur on any feature, and therefore the problem is more challenging, although it is fundamentally not diﬀerent. Many of the methods discussed in Chaps. 10 and 11 for classification can also be used for missing value estimation. In addition, the matrix completion methods discussed in Sect. 18.5 of Chap. 18 may also be used.

36 CHAPTER 2. DATA PREPARATION

FEATURE Y

11

10						X NOISE

9
8
7
6
5			X NOISE

4
3
−2	0	2		4	6		8	10	12	14	16

FEATURE X

Figure 2.1: Finding noise by data-centric methods

In the case of dependency-oriented data, such as time series or spatial data, missing value estimation is much simpler. In this case, the behavioral attribute values of contextually nearby records are used for the imputation process. For example, in a time-series data set, the average of the values at the time stamp just before or after the missing attribute may be used for estimation. Alternatively, the behavioral values at the last n time-series data stamps can be linearly interpolated to determine the missing value. For the case of spatial data, the estimation process is quite similar, where the average of values at neighboring spatial locations may be used.
2.3.2 Handling Incorrect and Inconsistent Entries

The key methods that are used for removing or correcting the incorrect and inconsistent entries are as follows:

Inconsistency detection: This is typically done when the data is available from diﬀerent sources in diﬀerent formats. For example, a person’s name may be spelled out in full in one source, whereas the other source may only contain the initials and a last name. In such cases, the key issues are duplicate detection and inconsistency detection. These topics are studied under the general umbrella of data integration within the database field.

Domain knowledge: A significant amount of domain knowledge is often available in terms of the ranges of the attributes or rules that specify the relationships across diﬀerent attributes. For example, if the country field is “United States,” then the city field cannot be “Shanghai.” Many data scrubbing and data auditing tools have been developed that use such domain knowledge and constraints to detect incorrect entries.

Data-centric methods: In these cases, the statistical behavior of the data is used to detect outliers. For example, the two isolated data points in Fig. 2.1 marked as “noise” are outliers. These isolated points might have arisen because of errors in the data collection process. However, this may not always be the case because the anomalies may be the result of interesting behavior of the underlying system. Therefore, any detected outlier may need to be manually examined before it is discarded. The use of

2.4. DATA REDUCTION AND TRANSFORMATION

data-centric methods for cleaning can sometimes be dangerous because they can result in the removal of useful knowledge from the underlying system. The outlier detection problem is an important analytical technique in its own right, and is discussed in detail in Chaps. 8 and 9.

The methods for addressing erroneous and inconsistent entries are generally highly domain specific.

2.3.3 Scaling and Normalization

In many scenarios, the diﬀerent features represent diﬀerent scales of reference and may therefore not be comparable to one another. For example, an attribute such as age is drawn on a very diﬀerent scale than an attribute such as salary. The latter attribute is typically orders of magnitude larger than the former. As a result, any aggregate function computed on the diﬀerent features (e.g., Euclidean distances) will be dominated by the attribute of larger magnitude.

To address this problem, it is common to use standardization. Consider the case where the jth attribute has mean μ_j and standard deviation σ_j . Then, the jth attribute value x^j_i of the ith record X_i may be normalized as follows:

z^j =	x_i^j − μ_j	(2.2)

i	σ_j

The vast majority of the normalized values will typically lie in the range [−3, 3] under the normal distribution assumption.

A second approach uses min-max scaling to map all attributes to the range [0, 1]. Let min_j and max_j represent the minimum and maximum values of attribute j. Then, the jth
attribute value x^j_i of the ith record X_i may be scaled as follows:

y^j =

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 25 26 27 28 29 30 31 32 ... 423