xij − minj
|
(2.3)
|
|
|
i
|
maxj − minj
|
|
|
|
|
|
This approach is not effective when the maximum and minimum values are extreme value outliers because of some mistake in data collection. For example, consider the age attribute where a mistake in data collection caused an additional zero to be appended to an age, resulting in an age value of 800 years instead of 80. In this case, most of the scaled data along the age attribute will be in the range [0, 0.1], as a result of which this attribute may be de-emphasized. Standardization is more robust to such scenarios.
2.4 Data Reduction and Transformation
The goal of data reduction is to represent it more compactly. When the data size is smaller, it is much easier to apply sophisticated and computationally expensive algorithms. The reduction of the data may be in terms of the number of rows (records) or in terms of the number of columns (dimensions). Data reduction does result in some loss of information. The use of a more sophisticated algorithm may sometimes compensate for the loss in infor-mation resulting from data reduction. Different types of data reduction are used in various applications:
38 CHAPTER 2. DATA PREPARATION
Data sampling: The records from the underlying data are sampled to create a much smaller database. Sampling is generally much harder in the streaming scenario where the sample needs to be dynamically maintained.
Feature selection: Only a subset of features from the underlying data is used in the analytical process. Typically, these subsets are chosen in an application-specific way. For example, a feature selection method that works well for clustering may not work well for classification and vice versa. Therefore, this section will discuss the issue of feature subsetting only in a limited way and defer a more detailed discussion to later chapters.
Data reduction with axis rotation: The correlations in the data are leveraged to repre-sent it in a smaller number of dimensions. Examples of such data reduction methods include principal component analysis (PCA), singular value decomposition (SVD), or latent semantic analysis (LSA) for the text domain.
Data reduction with type transformation: This form of data reduction is closely related to data type portability. For example, time series are converted to multidimensional data of a smaller size and lower complexity by discrete wavelet transformations. Simi-larly, graphs can be converted to multidimensional representations by using embedding techniques.
Each of the aforementioned aspects will be discussed in different segments of this section.
2.4.1 Sampling
The main advantage of sampling is that it is simple, intuitive, and relatively easy to imple-ment. The type of sampling used may vary with the application at hand.
2.4.1.1 Sampling for Static Data
It is much simpler to sample data when the entire data is already available, and therefore the number of base data points is known in advance. In the unbiased sampling approach, a predefined fraction f of the data points is selected and retained for analysis. This is extremely simple to implement, and can be achieved in two different ways, depending upon whether or not replacement is used.
In sampling without replacement from a data set D with n records, a total of n · f records are randomly picked from the data. Thus, no duplicates are included in the sample, unless the original data set D also contains duplicates. In sampling with replacement from a data set D with n records, the records are sampled sequentially and independently from the entire data set D for a total of n · f times. Thus, duplicates are possible because the same record may be included in the sample over sequential selections. Generally, most applications do not use replacement because unnecessary duplicates can be a nuisance for some data mining applications, such as outlier detection. Some other specialized forms of sampling are as follows:
Biased sampling: In biased sampling, some parts of the data are intentionally empha-sized because of their greater importance to the analysis. A classical example is that of temporal-decay bias where more recent records have a larger chance of being included in the sample, and stale records have a lower chance of being included. In exponential-decay bias, the probability p(X) of sampling a data record X, which was generated
Dostları ilə paylaş: |