Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə28/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   24   25   26   27   28   29   30   31   ...   423
1-Data Mining tarjima

2.2. FEATURE EXTRACTION AND PORTABILITY

31







Table 2.1: Portability of different data types




























Source data type

Destination data type




Methods




















































Numeric

Categorical




Discretization







Categorical

Numeric




Binarization










Text

Numeric

Latent semantic analysis (LSA)










Time series

Discrete sequence




SAX










Time series

Numeric multidimensional




DWT, DFT










Discrete sequence

Numeric multidimensional




DWT, DFT










Spatial

Numeric multidimensional




2-d DWT










Graphs

Numeric multidimensional




MDS, spectral










Any type

Graphs




Similarity graph



















(Restricted applicability)

















































ranges [a, a · α], [a · α, a · α2], and so on, for some α > 1. This kind of range may be useful when the attribute shows an exponential distribution across a range. In fact, if the attribute frequency distribution for an attribute can be modeled in functional form, then a natural approach would be to select ranges [a, b] such that f (b) − f (a) is the same for some function f (·). The idea is to select this function f (·) in such a way that each range contains an approximately similar number of records. However, in most cases, it is hard to find such a function f (·) in closed form.



  1. Equi-depth ranges: In this case, the ranges are selected so that each range has an equal number of records. The idea is to provide the same level of granularity to each range. An attribute can be divided into equi-depth ranges by first sorting it, and then selecting the division points on the sorted attribute value, such that each range contains an equal number of records.

The process of discretization can also be used to convert time-series data to discrete sequence data.


2.2.2.2 Categorical to Numeric Data: Binarization


In some cases, it is desirable to use numeric data mining algorithms on categorical data. Because binary data is a special form of both numeric and categorical data, it is possible to convert the categorical attributes to binary form and then use numeric algorithms on the binarized data. If a categorical attribute has φ different values, then φ different binary attributes are created. Each binary attribute corresponds to one possible value of the cate-gorical attribute. Therefore, exactly one of the φ attributes takes on the value of 1, and the remaining take on the value of 0.


2.2.2.3 Text to Numeric Data

Although the vector-space representation of text can be considered a sparse numeric data set with very high dimensionality, this special numeric representation is not very amenable to conventional data mining algorithms. For example, one typically uses specialized simi-larity functions, such as the cosine, rather than the Euclidean distance for text data. This is the reason that text mining is a distinct area in its own right with its own family of specialized algorithms. Nevertheless, it is possible to convert a text collection into a form


32 CHAPTER 2. DATA PREPARATION


that is more amenable to the use of mining algorithms for numeric data. The first step is to use latent semantic analysis (LSA) to transform the text collection to a nonsparse rep-resentation with lower dimensionality. Furthermore, after transformation, each document



X = (x1 . . . xd) needs to be scaled to







d

2 (x1 . . . xd). This scaling is necessary to ensure












1


































i=1 xi




that documents of varying length are treated in a uniform way. After this scaling, traditional


numeric measures, such as the Euclidean distance, work more effectively. LSA is discussed in Sect. 2.4.3.3 of this chapter. Note that LSA is rarely used in conjunction with this kind of scaling. Rather, traditional text mining algorithms are directly applied to the reduced representation obtained from LSA.


2.2.2.4 Time Series to Discrete Sequence Data


Time-series data can be converted to discrete sequence data using an approach known as symbolic aggregate approximation (SAX). This method comprises two steps:





  1. Window-based averaging: The series is divided into windows of length w, and the average time-series value over each window is computed.




  1. Value-based discretization: The (already averaged) time-series values are discretized into a smaller number of approximately equi-depth intervals. This is identical to the equi-depth discretization of numeric attributes that was discussed earlier. The idea is to ensure that each symbol has an approximately equal frequency in the time series. The interval boundaries are constructed by assuming that the time-series values are distributed with a Gaussian assumption. The mean and standard deviation of the (windowed) time-series values are estimated in the data-driven manner to instantiate the parameters of the Gaussian distribution. The quantiles of the Gaussian distribu-tion are used to determine the boundaries of the intervals. This is more efficient than sorting all the data values to determine quantiles, and it may be a more practical approach for a long (or streaming) time series. The values are discretized into a small number (typically 3 to 10) of intervals for the best results. Each such equi-depth inter-val is mapped to a symbolic value. This creates a symbolic representation of the time series, which is essentially a discrete sequence.

Thus, SAX might be viewed as an equi-depth discretization approach after window-based averaging.


2.2.2.5 Time Series to Numeric Data


This particular transformation is very useful because it enables the use of multidimensional algorithms for time-series data. A common method used for this conversion is the discrete wavelet transform (DWT). The wavelet transform converts the time series data to multidi-mensional data, as a set of coefficients that represent averaged differences between different portions of the series. If desired, a subset of the largest coefficients may be used to reduce the data size. This approach will be discussed in Sect. 2.4.4.1 on data reduction. An alterna-tive method, known as the discrete Fourier transform (DFT), is discussed in Sect. 14.2.4.2 of Chap. 14. The common property of these transforms is that the various coefficients are no longer as dependency oriented as the original time-series values.



2.2. FEATURE EXTRACTION AND PORTABILITY

33

2.2.2.6 Discrete Sequence to Numeric Data


This transformation can be performed in two steps. The first step is to convert the discrete sequence to a set of (binary) time series, where the number of time series in this set is equal to the number of distinct symbols. The second step is to map each of these time series into a multidimensional vector using the wavelet transform. Finally, the features from the different series are combined to create a single multidimensional record.


To convert a sequence to a binary time series, one can create a binary string in which the value denotes whether or not a particular symbol is present at a position. For example, consider the following nucleotide sequence, which is drawn on four symbols:


ACACACTGTGACTG


This series can be converted into the following set of four binary time series corresponding to the symbols A, C, T, and G, respectively:


10101000001000


01010100000100


00000010100010


00000001010001


A wavelet transformation can be applied to each of these series to create a multidimensional set of features. The features from the four different series can be appended to create a single numeric multidimensional record.


2.2.2.7 Spatial to Numeric Data


Spatial data can be converted to numeric data by using the same approach that was used for time-series data. The main difference is that there are now two contextual attributes (instead of one). This requires modification of the wavelet transformation method. Section 2.4.4.1 will briefly discuss how the one-dimensional wavelet approach can be generalized when there are two contextual attributes. The approach is fairly general and can be used for any number of contextual attributes.


2.2.2.8 Graphs to Numeric Data


Graphs can be converted to numeric data with the use of methods such as multidimen-sional scaling (MDS) and spectral transformations. This approach works for those appli-cations where the edges are weighted, and represent similarity or distance relationships between nodes. The general approach of MDS can achieve this goal, and it is discussed in Sect. 2.4.4.2. A spectral approach can also be used to convert a graph into a multi-dimensional representation. This is also a dimensionality reduction scheme that converts the structural information into a multidimensional representation. This approach will be discussed in Sect. 2.4.4.3.


2.2.2.9 Any Type to Graphs for Similarity-Based Applications

Many applications are based on the notion of similarity. For example, the clustering problem is defined as the creation of groups of similar objects, whereas the outlier detection problem is defined as one in which a subset of objects differing significantly from the remaining objects are identified. Many forms of classification models, such as nearest neighbor classi-fiers, are also dependent on the notion of similarity. The notion of pairwise similarity can


34 CHAPTER 2. DATA PREPARATION


be best captured with the use of a neighborhood graph. For a given set of data objects





  • = {O1 . . . On}, a neighborhood graph is defined as follows:




    1. A single node is defined for each object in O. This is defined by the node set N , containing n nodes where the node i corresponds to the object Oi.




    1. An edge exists between Oi and Oj , if the distance d(Oi, Oj ) is less than a particular threshold . Alternatively, the k-nearest neighbors of each node may be used. Because the k-nearest neighbor relationship is not symmetric, this results in a directed graph. The directions on the edges are ignored, and the parallel edges are removed. The weight wij of the edge (i, j) is equal to a kernelized function of the distance between the objects Oi and Oj , so that larger weights indicate greater similarity. An example is the heat kernel:

wij = ed(Oi,Oj )2/t2

(2.1)

Here, t is a user-defined parameter.

A wide variety of data mining algorithms are available for network data. All these methods can also be used on the similarity graph. Note that the similarity graph can be crisply defined for data objects of any type, as long as an appropriate distance function can be defined. This is the reason that distance function design is so important for virtually any data type. The issue of distance function design will be addressed in Chap. 3. Note that this approach is useful only for applications that are based on the notion of similarity or distances. Nevertheless, many data mining problems are directed or indirectly related to notions of similarity and distances.


2.3 Data Cleaning

The data cleaning process is important because of the errors associated with the data collection process. Several sources of missing entries and errors may arise during the data collection process. Some examples are as follows:





  1. Some data collection technologies, such as sensors, are inherently inaccurate because of the hardware limitations associated with collection and transmission. Sometimes sensors may drop readings because of hardware failure or battery exhaustion.




  1. Data collected using scanning technologies may have errors associated with it because optical character recognition techniques are far from perfect. Furthermore, speech-to-text data is also prone to errors.




  1. Users may not want to specify their information for privacy reasons, or they may specify incorrect values intentionally. For example, it has often been observed that users sometimes specify their birthday incorrectly on automated registration sites such as those of social networks. In some cases, users may choose to leave several fields empty.




  1. A significant amount of data is created manually. Manual errors are common during data entry.




  1. The entity in charge of data collection may not collect certain fields for some records, if it is too costly. Therefore, records may be incompletely specified.


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   24   25   26   27   28   29   30   31   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin