Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	260/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 256 257 258 259 260 261 262 263 ... 423

1-Data Mining tarjima

|S|
^F(h_s,h_t)(^{X, t}) = ^Cf ^· ^K(h_s,h_t)(^X ⁻ ^Xi^{, t} ⁻ ^ti)^.

i=1

Here K₍_h_s_,h_t₎(·, ·) is a spatiotemporal kernel smoothing function, h_s is the spatial kernel vector, and h_t is temporal kernel width. The kernel function K₍_h_s_,h_t₎(X − X_i, t − t_i) is a smooth distribution that decreases with increasing value of t − t_i. The value of C_f is a suitably chosen normalization constant, so that the entire density over the spatial plane is one unit. Thus, C_f is defined as follows:

_All_X^F⁽^hs^,ht⁾⁽^X,^t⁾^δX ^{= 1}^.

The reverse time- slice density estimate is calculated diﬀerently from the forward time-slice density estimate. Assume that the set of points in the time interval (t, t + h_t) is denoted by U . As before, the value of C_r is chosen as a normalization constant. Correspondingly, the reverse time-slice density estimate R₍_h_s_,h_t₎(X, t) is defined as follows:

|U|
^R(h_s,h_t)(^{X, t}) = ^Cr ^· ^K(h_s,h_t)(^X ⁻ ^Xi^{, t}i ⁻ ^t)^.

i=1

In this case, t_i − t is used in the argument instead of t − t_i. Thus, the reverse time-slice density in the interval (t, t+h_t ) would be exactly the same as the forward time-slice density, if time were reversed, and the data stream arrived in reverse order, starting at t + h_t and ending at t.

The velocity density V₍_h_s_,h_t₎(X, T ) at spatial location X and time T is defined as follows:

( ) = ^F(h_s,h_t)(^{X, T} ) ⁻ ^R(h_s,h_t)(^{X, T} ⁻ ^ht)

^V(h_s,h_t) ^{X, T} _h_t ^.

Note that the reverse time- slice density estimate is defined with a temporal argument of (T − h_t), and therefore the future points with respect to (T − h_t) are known at time T . A

12.6. STREAMING CLASSIFICATION

421

positive value of the velocity density corresponds to an increase in the data density at a given point. A negative value of the velocity density corresponds to a reduction in the data density at a given point. In general, it has been shown that when the spatiotemporal kernel function is defined as below, then the velocity density is directly proportional to a rate of change of the data density at a given point.

K₍_h_s_,h_t₎(X, t) = (1 − t/h_t) · K_h_s (X).

This kernel function is defined only for values of t in the range (0, h_t). The Gaussian spatial kernel function Kh (·) was used because of its well-known eﬀectiveness. Specifically, Kh (·)

s s
is the product of d identical gaussian kernel functions, and h_s = (h¹_s, . . . h^d_s), where hⁱ_s is the smoothing parameter for dimension i.

The velocity density is associated with a data point as well as a time instant, and therefore this definition allows the labeling of both data points and time instants as outliers. However, the interpretation of a data point as an outlier in the context of aggregate change analysis is slightly diﬀerent from the previous definitions in this section. An outlier is defined on an aggregate basis, rather than in a specific way for that point. Because outliers are data points in regions where abrupt change has occurred, outliers are defined as data points

X at time instants t with unusually large absolute values of the local velocity density. If desired, a normal distribution could be used to determine the extreme values among the absolute velocity density values. Thus, the velocity density approach is able to convert the multidimensional data distributions into a quantification that can be used in conjunction with extreme-value analysis.

It is important to note that the data point X is an outlier only in the context of aggregate changes occurring in its locality, rather than its own properties as an outlier. In the context of the news-story example, this corresponds to a news story belonging to a particular burst of related articles. Thus, such an approach could detect the sudden emergence of local clusters in the data, and report the corresponding data points in a timely fashion. Furthermore, it is also possible to compute the aggregate absolute level of change (over all regions) occurring in the underlying data stream. This is achieved by computing the average absolute velocity density over the entire data space by summing the changes at sample points in the space. Time instants with large values of the aggregate velocity density may be declared as outliers.

12.6 Streaming Classification

The problem of streaming classification is especially challenging because of the impact of concept drift. One simple approach is to use a reservoir sample to create a concise representation of the training data. This concise representation can be used to create an oﬄine model. If desired, a decay-based reservoir sample can be used to handle concept drift. Such an approach has the advantage that any conventional classification algorithm can be used since the challenges associated with the streaming paradigm have already been addressed at the sampling stage. A number of dedicated methods have also been proposed for streaming classification.

12.6.1 VFDT Family

Very fast decision trees (VFDT) are based on the principle of Hoeﬀding trees . The basic idea is that a decision tree can be constructed on a sample of a very large data set, using a carefully designed approach, so that the resulting tree is the same as what would have been

422 CHAPTER 12. MINING DATA STREAMS

achieved with the original data set with high probability. The Hoeﬀding bound is used to estimate this probability, and therefore the intermediate steps of the approach are designed with this bound in mind. This is the reason that such trees are referred to as Hoeﬀding trees.

The Hoeﬀding tree can be constructed incrementally by growing the tree simultaneously with stream arrival. An important assumption is that the stream does not evolve, and therefore the currently arrived set of points can be viewed as a sample of the full stream. The higher levels of the tree are constructed at earlier stages of the stream, when enough tuples have been collected to quantify the accuracy of the corresponding split criteria. The lower level nodes are constructed later because statistics about lower level nodes can be collected only after the higher level nodes have been constructed. Thus, successive levels of the tree are constructed, as more examples stream in and the tree continues to grow. The key in the Hoeﬀding tree algorithm is to quantify the point at which statistically suﬃcient tuples have been collected in order to perform a split, so that the split is approximately the same as what would have been performed with knowledge of the full stream.

The same decision tree will be constructed on the current stream sample and the full stream, as long as the same splits are used at each stage. Therefore, the goal of the approach is to ensure that the splits on the sample are identical to the splits on the full stream. For ease in discussion, consider the case where each attribute6 is binary. In this case, two algorithms will produce exactly the same tree, as long as the same split attribute is selected at each point. The split attribute is selected using a measure such as the Gini index. Consider a particular node in the tree constructed on the original data, and the same node constructed on the sampled data. What is the probability that the same attribute will be selected for the stream sample as for the full stream?

Consider the best and second-best attributes for a split, indexed by i and j, respectively, in the sampled data. Let G_i an G_i be the Gini index values of the split attribute i, as computed on the full stream, and the sampled data, respectively. Because the attribute i was selected for a split in the sampled data, it is evident that G_i < G_j . The problem is that the sampling might cause an error. In other words, for the original data, it might be the case that G_j < G_i. Let the diﬀerence G_j − G_i between G_j and G_i be 0. If the number

of samples n for evaluating the split is large enough, then it can be shown with the use of the Hoeﬀding bound that the undesirable case where G_j < G_i will not occur with at least a user-defined probability 1 − δ. The required value of n would be a function of and δ. In the context of data streams with continuously accumulating samples, the key is to wait for a large enough sample size n before performing the split. In the Hoeﬀding tree, the Hoeﬀding

bound is used to determine the value of n in terms of			and δ as follows:
n =	R² · ln(1/δ)	.	(12.32)
	2 2

The value of R denotes the range of the split criterion. For the Gini index, the value of R is 1, and for the entropy criterion, the value is log(k), where k is the number of classes. Near ties in the split criterion correspond to small values of . According to Eq. 12.32, such ties will lead to large sample size requirements, and therefore a larger waiting time until one can be suﬃciently confident of performing a split with the available stream sample.

The Hoeﬀding tree approach determines whether the current diﬀerence in the Gini index

between the best and second-best split attributes is at least ^R²^·^ln⁽¹^/δ⁾in order to initiate

a split. This provides a guarantee on the quality of a split at a particular node. In cases,

The argument also applies to general attributes by first transforming them to binary data with dis-cretization and binarization.

12.6. STREAMING CLASSIFICATION									423

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 256 257 258 259 260 261 262 263 ... 423