n(n−1)
The incoming data point is inserted into the reservoir. The probability of Case II is equal to insertion probability k/n of incoming data points. Subsequently, existing reservoir points are retained with probability (k − 1)/k because exactly one of them is ejected. Because the inductive assumption implies that any of the earlier points in the data stream was originally present in the reservoir with probability k/(n − 1), it implies that the probability of a point being included in the reservoir and Case II event is given by the product p2 of the three aforementioned probabilities:
p2 =
|
k
|
|
k − 1
|
|
k
|
|
=
|
k(k − 1)
|
(2.5)
|
|
n
|
k
|
|
n − 1
|
n(n − 1)
|
|
|
|
|
|
|
|
Therefore, the total probability of a stream point being retained in the reservoir after the nth data point arrival is given by the sum of p1 and p2. It can be shown that this is equal to k/n.
It is possible to extend reservoir sampling to cases where temporal bias is present in the data stream. In particular, the case of exponential bias has been addressed in [35].
2.4.2 Feature Subset Selection
A second method for data preprocessing is feature subset selection. Some features can be discarded when they are known to be irrelevant. Which features are relevant? Clearly, this decision depends on the application at hand. There are two primary types of feature selection:
Unsupervised feature selection: This corresponds to the removal of noisy and redundant attributes from the data. Unsupervised feature selection is best defined in terms of its impact on clustering applications, though the applicability is much broader. It is difficult to comprehensively describe such feature selection methods without using the clustering problem as a proper context. Therefore, a discussion of methods for unsupervised feature selection is deferred to Chap. 6 on data clustering.
30
|
|
|
|
|
|
|
|
|
|
|
|
|
DATA POINTS
|
|
20
|
|
|
EIGENVECTOR 1
|
|
|
|
|
|
|
|
EIGENVECTOR 2
|
|
10
|
|
|
EIGENVECTOR 3
|
|
|
|
|
|
0
−10
−20
−30
−40
−50
0
|
|
|
|
|
|
|
|
4050
|
30
|
20
|
10
|
0
|
−10
|
−20
|
−30
|
FEATURE Y
|
|
FEATURE X
|
|
|
|
|
Figure 2.2: Highly correlated data represented in a small number of dimensions in an axis system that is rotated appropriately
Supervised feature selection: This type of feature selection is relevant to the problem of data classification. In this case, only the features that can predict the class attribute effectively are the most relevant. Such feature selection methods are often closely integrated with analytical methods for classification. A detailed discussion is deferred to Chap. 10 on data classification.
Feature selection is an important part of the data mining process because it defines the quality of the input data.
2.4.3 Dimensionality Reduction with Axis Rotation
In real data sets, a significant number of correlations exist among different attributes. In some cases, hard constraints or rules between attributes may uniquely define some attributes in terms of others. For example, the date of birth of an individual (represented quantita-tively) is perfectly correlated with his or her age. In most cases, the correlations may not be quite as perfect, but significant dependencies may still exist among the different features. Unfortunately, real data sets contain many such redundancies that escape the attention of the analyst during the initial phase of data creation. These correlations and constraints correspond to implicit redundancies because they imply that knowledge of some subsets of the dimensions can be used to predict the values of the other dimensions. For example, consider the 3-dimensional data set illustrated in Fig. 2.2. In this case, if the axis is rotated to the orientation illustrated in the figure, the correlations and redundancies in the newly transformed feature values are removed. As a result of this redundancy removal, the entire data can be (approximately) represented along a 1-dimensional line. Thus, the intrinsic dimensionality of this 3-dimensional data set is 1. The other two axes correspond to the low-variance dimensions. If the data is represented as coordinates in the new axis system illustrated in Fig. 2.2, then the coordinate values along these low-variance dimensions will not vary much. Therefore, after the axis system has been rotated, these dimensions can be removed without much information loss.
A natural question arises as to how the correlation-removing axis system such as that in Fig. 2.2 may be determined in an automated way. Two natural methods to achieve this goal
42 CHAPTER 2. DATA PREPARATION
are those of principal component analysis (PCA) and singular value decomposition (SVD). These two methods, while not exactly identical at the definition level, are closely related. Although the notion of principal component analysis is intuitively easier to understand, SVD is a more general framework and can be used to perform PCA as a special case.
2.4.3.1 Principal Component Analysis
PCA is generally applied after subtracting the mean of the data set from each data point. However, it is also possible to use it without mean centering, as long as the mean of the data is separately stored. This operation is referred to as mean centering, and it results in a data set centered at the origin. The goal of PCA is to rotate the data into an axis-system where the greatest amount of variance is captured in a small number of dimensions. It is intuitively evident from the example of Fig. 2.2 that such an axis system is affected by the correlations between attributes. An important observation, which we will show below, is that the variance of a data set along a particular direction can be expressed directly in terms of its covariance matrix.
Let C be the d × d symmetric covariance matrix of the n × d data matrix D. Thus, the (i, j)th entry cij of C denotes the covariance between the ith and jth columns (dimensions) of the data matrix D. Let μi represent the mean along the ith dimension. Specifically, if xmk be the mth dimension of the kth record, then the value of the covariance entry cij is as follows:
Dostları ilə paylaş: |