Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	28/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 24 25 26 27 28 29 30 31 ... 423

1-Data Mining tarjima

2.2. FEATURE EXTRACTION AND PORTABILITY	31

		Table 2.1: Portability of diﬀerent data types

	Source data type		Destination data type		Methods


	Numeric		Categorical		Discretization
	Categorical		Numeric		Binarization
	Text		Numeric	Latent semantic analysis (LSA)
	Time series		Discrete sequence		SAX
	Time series		Numeric multidimensional		DWT, DFT
	Discrete sequence		Numeric multidimensional		DWT, DFT
	Spatial		Numeric multidimensional		2-d DWT
	Graphs		Numeric multidimensional		MDS, spectral
	Any type		Graphs		Similarity graph
					(Restricted applicability)

ranges [a, a · α], [a · α, a · α²], and so on, for some α > 1. This kind of range may be useful when the attribute shows an exponential distribution across a range. In fact, if the attribute frequency distribution for an attribute can be modeled in functional form, then a natural approach would be to select ranges [a, b] such that f (b) − f (a) is the same for some function f (·). The idea is to select this function f (·) in such a way that each range contains an approximately similar number of records. However, in most cases, it is hard to find such a function f (·) in closed form.

Equi-depth ranges: In this case, the ranges are selected so that each range has an equal number of records. The idea is to provide the same level of granularity to each range. An attribute can be divided into equi-depth ranges by first sorting it, and then selecting the division points on the sorted attribute value, such that each range contains an equal number of records.

The process of discretization can also be used to convert time-series data to discrete sequence data.

2.2.2.2 Categorical to Numeric Data: Binarization

In some cases, it is desirable to use numeric data mining algorithms on categorical data. Because binary data is a special form of both numeric and categorical data, it is possible to convert the categorical attributes to binary form and then use numeric algorithms on the binarized data. If a categorical attribute has φ diﬀerent values, then φ diﬀerent binary attributes are created. Each binary attribute corresponds to one possible value of the cate-gorical attribute. Therefore, exactly one of the φ attributes takes on the value of 1, and the remaining take on the value of 0.

2.2.2.3 Text to Numeric Data

Although the vector-space representation of text can be considered a sparse numeric data set with very high dimensionality, this special numeric representation is not very amenable to conventional data mining algorithms. For example, one typically uses specialized simi-larity functions, such as the cosine, rather than the Euclidean distance for text data. This is the reason that text mining is a distinct area in its own right with its own family of specialized algorithms. Nevertheless, it is possible to convert a text collection into a form

32 CHAPTER 2. DATA PREPARATION

that is more amenable to the use of mining algorithms for numeric data. The first step is to use latent semantic analysis (LSA) to transform the text collection to a nonsparse rep-resentation with lower dimensionality. Furthermore, after transformation, each document

X = (x₁ . . . x_d) needs to be scaled to			d	₂(x₁ . . . x_d). This scaling is necessary to ensure
	√	1

			i=1 ^xi

that documents of varying length are treated in a uniform way. After this scaling, traditional

numeric measures, such as the Euclidean distance, work more eﬀectively. LSA is discussed in Sect. 2.4.3.3 of this chapter. Note that LSA is rarely used in conjunction with this kind of scaling. Rather, traditional text mining algorithms are directly applied to the reduced representation obtained from LSA.

2.2.2.4 Time Series to Discrete Sequence Data

Time-series data can be converted to discrete sequence data using an approach known as symbolic aggregate approximation (SAX). This method comprises two steps:

Window-based averaging: The series is divided into windows of length w, and the average time-series value over each window is computed.

Value-based discretization: The (already averaged) time-series values are discretized into a smaller number of approximately equi-depth intervals. This is identical to the equi-depth discretization of numeric attributes that was discussed earlier. The idea is to ensure that each symbol has an approximately equal frequency in the time series. The interval boundaries are constructed by assuming that the time-series values are distributed with a Gaussian assumption. The mean and standard deviation of the (windowed) time-series values are estimated in the data-driven manner to instantiate the parameters of the Gaussian distribution. The quantiles of the Gaussian distribu-tion are used to determine the boundaries of the intervals. This is more eﬃcient than sorting all the data values to determine quantiles, and it may be a more practical approach for a long (or streaming) time series. The values are discretized into a small number (typically 3 to 10) of intervals for the best results. Each such equi-depth inter-val is mapped to a symbolic value. This creates a symbolic representation of the time series, which is essentially a discrete sequence.

Thus, SAX might be viewed as an equi-depth discretization approach after window-based averaging.

2.2.2.5 Time Series to Numeric Data

This particular transformation is very useful because it enables the use of multidimensional algorithms for time-series data. A common method used for this conversion is the discrete wavelet transform (DWT). The wavelet transform converts the time series data to multidi-mensional data, as a set of coeﬃcients that represent averaged diﬀerences between diﬀerent portions of the series. If desired, a subset of the largest coeﬃcients may be used to reduce the data size. This approach will be discussed in Sect. 2.4.4.1 on data reduction. An alterna-tive method, known as the discrete Fourier transform (DFT), is discussed in Sect. 14.2.4.2 of Chap. 14. The common property of these transforms is that the various coeﬃcients are no longer as dependency oriented as the original time-series values.

2.2. FEATURE EXTRACTION AND PORTABILITY

2.2.2.6 Discrete Sequence to Numeric Data

This transformation can be performed in two steps. The first step is to convert the discrete sequence to a set of (binary) time series, where the number of time series in this set is equal to the number of distinct symbols. The second step is to map each of these time series into a multidimensional vector using the wavelet transform. Finally, the features from the diﬀerent series are combined to create a single multidimensional record.

To convert a sequence to a binary time series, one can create a binary string in which the value denotes whether or not a particular symbol is present at a position. For example, consider the following nucleotide sequence, which is drawn on four symbols:

ACACACTGTGACTG

This series can be converted into the following set of four binary time series corresponding to the symbols A, C, T, and G, respectively:

10101000001000

01010100000100

00000010100010

00000001010001

A wavelet transformation can be applied to each of these series to create a multidimensional set of features. The features from the four diﬀerent series can be appended to create a single numeric multidimensional record.

2.2.2.7 Spatial to Numeric Data

Spatial data can be converted to numeric data by using the same approach that was used for time-series data. The main diﬀerence is that there are now two contextual attributes (instead of one). This requires modification of the wavelet transformation method. Section 2.4.4.1 will briefly discuss how the one-dimensional wavelet approach can be generalized when there are two contextual attributes. The approach is fairly general and can be used for any number of contextual attributes.

2.2.2.8 Graphs to Numeric Data

Graphs can be converted to numeric data with the use of methods such as multidimen-sional scaling (MDS) and spectral transformations. This approach works for those appli-cations where the edges are weighted, and represent similarity or distance relationships between nodes. The general approach of MDS can achieve this goal, and it is discussed in Sect. 2.4.4.2. A spectral approach can also be used to convert a graph into a multi-dimensional representation. This is also a dimensionality reduction scheme that converts the structural information into a multidimensional representation. This approach will be discussed in Sect. 2.4.4.3.

2.2.2.9 Any Type to Graphs for Similarity-Based Applications

Many applications are based on the notion of similarity. For example, the clustering problem is defined as the creation of groups of similar objects, whereas the outlier detection problem is defined as one in which a subset of objects diﬀering significantly from the remaining objects are identified. Many forms of classification models, such as nearest neighbor classi-fiers, are also dependent on the notion of similarity. The notion of pairwise similarity can

34 CHAPTER 2. DATA PREPARATION

be best captured with the use of a neighborhood graph. For a given set of data objects

= {O₁ . . . O_n}, a neighborhood graph is defined as follows:

A single node is defined for each object in O. This is defined by the node set N , containing n nodes where the node i corresponds to the object O_i.

An edge exists between O_i and O_j , if the distance d(O_i, O_j ) is less than a particular threshold . Alternatively, the k-nearest neighbors of each node may be used. Because the k-nearest neighbor relationship is not symmetric, this results in a directed graph. The directions on the edges are ignored, and the parallel edges are removed. The weight w_ij of the edge (i, j) is equal to a kernelized function of the distance between the objects O_i and O_j , so that larger weights indicate greater similarity. An example is the heat kernel:

_w_ij= _e−^d⁽^Oi^,Oj ⁾²^/t²

(2.1)

Here, t is a user-defined parameter.

A wide variety of data mining algorithms are available for network data. All these methods can also be used on the similarity graph. Note that the similarity graph can be crisply defined for data objects of any type, as long as an appropriate distance function can be defined. This is the reason that distance function design is so important for virtually any data type. The issue of distance function design will be addressed in Chap. 3. Note that this approach is useful only for applications that are based on the notion of similarity or distances. Nevertheless, many data mining problems are directed or indirectly related to notions of similarity and distances.

2.3 Data Cleaning

The data cleaning process is important because of the errors associated with the data collection process. Several sources of missing entries and errors may arise during the data collection process. Some examples are as follows:

Some data collection technologies, such as sensors, are inherently inaccurate because of the hardware limitations associated with collection and transmission. Sometimes sensors may drop readings because of hardware failure or battery exhaustion.

Data collected using scanning technologies may have errors associated with it because optical character recognition techniques are far from perfect. Furthermore, speech-to-text data is also prone to errors.

Users may not want to specify their information for privacy reasons, or they may specify incorrect values intentionally. For example, it has often been observed that users sometimes specify their birthday incorrectly on automated registration sites such as those of social networks. In some cases, users may choose to leave several fields empty.

A significant amount of data is created manually. Manual errors are common during data entry.

The entity in charge of data collection may not collect certain fields for some records, if it is too costly. Therefore, records may be incompletely specified.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 24 25 26 27 28 29 30 31 ... 423